WO2019028710A1 - 基于图结构数据的候选项集支持度计算方法及其应用 - Google Patents

基于图结构数据的候选项集支持度计算方法及其应用 Download PDF

Info

Publication number
WO2019028710A1
WO2019028710A1 PCT/CN2017/096672 CN2017096672W WO2019028710A1 WO 2019028710 A1 WO2019028710 A1 WO 2019028710A1 CN 2017096672 W CN2017096672 W CN 2017096672W WO 2019028710 A1 WO2019028710 A1 WO 2019028710A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate set
support
structure data
determining
candidate
Prior art date
Application number
PCT/CN2017/096672
Other languages
English (en)
French (fr)
Inventor
钟叶青
张锐
陈文光
Original Assignee
深圳清华大学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳清华大学研究院 filed Critical 深圳清华大学研究院
Priority to CN201780094550.6A priority Critical patent/CN111316257A/zh
Priority to PCT/CN2017/096672 priority patent/WO2019028710A1/zh
Publication of WO2019028710A1 publication Critical patent/WO2019028710A1/zh
Priority to US16/718,305 priority patent/US10776372B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Definitions

  • the present invention relates generally to the field of association rules, and more particularly to a candidate set support calculation method based on graph structure data, a frequent item set determination method using the same, and a frequent item set determination method based on a pre-order tree.
  • Association law is one of the most important technologies in data mining.
  • the association rule was first proposed to solve the supermarket sales problem. A large supermarket can collect a lot of transaction records, and supermarket managers hope to find some valuable information from these transaction records to help develop sales strategies.
  • a transaction record contains a collection of elements that, in the supermarket example, represent a commodity.
  • association rules The calculation of association rules is generally divided into two steps: 1. Find all the frequent itemsets; 2. Find all the association rules based on the frequent itemsets found in the first step. Since the second step is very simple, the cost of the association rule basically depends on the first step.
  • the Apriori algorithm based on the pre-order tree is one of the most famous and widely used algorithms for calculating frequent itemsets. It needs to go through two steps to calculate a frequent item set with k elements: 1. Generate a k-waiting set according to the (k-1)-frequent item set; 2. Scan the database to get the support of the k-waiting set, In turn, a k-frequent item set is obtained.
  • the algorithm uses a pre-order tree to store frequent itemsets, and each node of the k-th layer of the pre-order tree represents a set of k-frequent itemsets.
  • the depth of the pre-order tree increases.
  • the algorithm spends too much time in the second step of each loop: scanning the database to get the support of the candidate set. This is because when using the database to recursively traverse the pre-order tree, the increase in the depth of the pre-order tree cannot be avoided, resulting in an increase in the number of traversal times of the database, which in turn consumes a lot of time.
  • the object of the present invention is to provide a novel and improved method for calculating candidate set support based on graph structure data in view of the above defects and deficiencies in the prior art, using a frequent item set determination method and a frequent pre-order tree based method. Item set determination method.
  • a candidate set support calculation method based on graph structure data comprising: converting data in a database into graph structure data, wherein each point in the graph structure data represents the database a record in the middle, and the edge between any two points represents the intersection of the set of elements of the two records corresponding to the two points; the candidate set is obtained from the database; and the candidate structure set corresponding to the figure structure data is obtained Connected component; determine the number of points included in the connected component; and determine the number of points as the support for the candidate set.
  • the step of obtaining the candidate set from the database specifically includes: obtaining a k-wait set by the Apriori algorithm based on the pre-order tree, where k is an integer greater than 1.
  • the step of obtaining a candidate set from the database specifically includes: generating a candidate set by using a hash table by using a DHP method.
  • the step of determining the number of points included in the connected component specifically includes: calculating a point in the connected component by using an iterative algorithm based on label transfer Number of.
  • a frequent item set determining method includes: obtaining a support set of a candidate set using a candidate set calculation method based on graph structure data as described above; determining the candidate set Whether the support degree is greater than a predetermined threshold; and in a case where the support degree of the candidate set is greater than a predetermined threshold, determining the candidate set as a frequent item set.
  • a pre-order tree-based frequent item set determining method includes: determining a first time used by a database to recursively traverse a pre-order tree having a depth k; determining a k-waiting set Number, and multiply the number of the k-waiting set by the time at which the connected component is calculated on the graph using the graph calculation method as the second time; compare the first time with the second time; In the case of the second time, the database recursive calculation is used to determine the support degree of the candidate set; and in the case where the first time is greater than the second time, the graph structure data is used as described above.
  • Option set support calculation method to obtain support for candidate sets
  • the method further includes: determining whether a degree of support of the obtained candidate set is greater than a predetermined threshold; and, in a case where the obtained support set of the candidate set is greater than a predetermined threshold , determine that the candidate set is a frequent item set.
  • the method before determining the first time, further includes: obtaining a k-waiting set by an Apriori algorithm based on a pre-tree, where k is an integer greater than 1.
  • the method further includes: generating a candidate set by using a hash table by using a DHP method.
  • a candidate set support calculation device based on graph structure data comprising a memory and a processor, wherein the computer stores executable instructions, wherein the computer executable instructions are executed by the controller At the time, it is operable to perform the candidate set support calculation method based on the graph structure data as described above.
  • a frequent item set determining apparatus includes a memory and a processor, wherein the computer stores executable instructions, and when the computer executable instructions are executed by the controller, is operable to execute Method: obtaining a support set of a candidate set using a candidate set support calculation method based on the graph structure data as described above; determining whether the support degree of the candidate set is greater than a predetermined threshold; and, in the candidate set In the case where the support degree is greater than a predetermined threshold, it is determined that the candidate set is a frequent item set.
  • a pre-order tree-based frequent item set determining apparatus includes a memory and a processor, wherein the computer stores executable instructions, when the computer executable instructions are executed by the controller, The pre-order tree based frequent item set determination method as described above is operable to perform.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform a graph-based structure as described above when executed by a computing device The candidate set calculation method for data.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform a method of using a method as described above when executed by a computing device
  • the candidate set support degree calculation method based on the graph structure data obtains the support degree of the candidate set; determines whether the support degree of the candidate set is greater than a predetermined threshold; and, if the support degree of the candidate set is greater than a predetermined threshold , determine that the candidate set is a frequent item set.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform a preamble as described above when executed by a computing device The method of determining the frequent itemsets of a tree.
  • the candidate set support calculation method based on the graph structure data and the frequent item set determining method provided by the embodiment of the present invention can avoid the support of the candidate set by scanning the database and consume too much time, and the effect of the algorithm is guaranteed. At the same time, the efficiency of the algorithm is improved.
  • the pre-order tree-based frequent item set determination algorithm models and analyzes the calculation process of the Apriori algorithm based on the pre-order tree, and implements the Apriori algorithm based on the graph calculation method to optimize the pre-based method.
  • the Apriori algorithm of the sequence tree improves the efficiency of the Apriori algorithm based on the pre-order tree at deep depth.
  • FIG. 1 is a diagram showing an example of a process of converting a database record into graph structure data according to an embodiment of the present invention
  • FIG. 2A and 2B are schematic diagrams showing connected components corresponding to the candidate set shown in Fig. 1;
  • FIG. 3 is a schematic flowchart showing a method for calculating a candidate set support degree based on graph structure data according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart showing a frequent item set determining method according to an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of an optimized pre-order tree-based frequent item set determining method according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of an example of an optimized preamble-based Apriori algorithm according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of another example of an optimized frequent item set determining algorithm according to an embodiment of the present invention.
  • FIG. 8 is a pseudo code diagram of an algorithm for calculating k-wait list set support using a graph calculation method, in accordance with an embodiment of the present invention.
  • FIG. 9 is a pseudo code diagram of an optimization method of an Apriori algorithm based on a pre-order tree according to an embodiment of the present invention.
  • the term “a” is understood to mean “at least one” or “one or more”, that is, in one embodiment, the number of one element may be one, and in other embodiments, the element The number can be multiple, and the term “a” cannot be construed as limiting the quantity.
  • ordinal numbers such as “first”, “second”, etc. will be used to describe various components, those components are not limited herein. This term is only used to distinguish one component from another. For example, a first component could be termed a second component, and as such, a second component could also be termed a first component without departing from the teachings of the inventive concept.
  • the term "and/or" used herein includes any and all combinations of one or more of the associated listed items.
  • the Apriori algorithm based on the pre-order tree when the Apriori algorithm based on the pre-order tree is deep in depth, it takes a lot of time to scan the database to obtain the candidate set, which leads to inefficiency. Therefore, in the pre-order tree-based frequent item set determining method according to the embodiment of the present invention, the support degree of the candidate set is calculated by the graph calculation method, thereby determining the frequent item set. In this way, the efficiency of the algorithm can be improved while ensuring the effect of the algorithm, that is, the Apriori algorithm based on the pre-order tree is optimized.
  • Apriori algorithm based on pre-order tree uses a pre-order tree to store frequent itemsets and candidate frequencies. Set of items. At the beginning of the algorithm, the 1-frequent item set is obtained by scanning the database once; then the algorithm uses the (k-1)-frequent item set to generate the k-wait set, and then uses the database to recursively traverse the pre-order tree to get k. - Support for candidate sets, which in turn yields k-frequent itemsets.
  • the step of generating a k-wait set using the (k-1)-frequent item set is specifically generating a k candidate set by two k-1 frequent itemsets, which are divided into two. Stages:
  • Production stage Suppose there are 2 k-1 frequent itemsets p and q, and the first k-2 elements of p and q are the same, then you can put these k-2 identical elements and p and q respectively. The next element, a total of k elements make up a set of k candidates.
  • Verification phase For the generated set of k candidates, it is necessary to ensure that all its k-1 order subsets are in the k-1 frequent item set. This is because for a set of k frequent items, it can be confirmed that all of its k-1 order subsets are also frequent itemsets. Therefore, if the generated k candidate set has its k-1 order subset not in the k-1 frequent item set, the k candidate set should be removed from the candidate set.
  • a Dash method can be used to generate a candidate set using a hash table.
  • the hash function is first defined, and then a hash value is calculated for each item set, and the scan item set is placed in a bucket corresponding to the hash table and counted.
  • only one bucket's count is greater than a preset threshold, and all item sets corresponding to the bucket's hash value are candidate sets.
  • the support degree of the candidate set is obtained by scanning the database.
  • the graph calculation method is used to calculate the support degree of the candidate set to determine the frequent item set. .
  • FIG. 1 is a diagram showing an example of a process of converting a database record into graph structure data according to an embodiment of the present invention.
  • the count is the count of all records in the database that contain this k-wait set.
  • the points in the corresponding map are recorded. If there is an intersection of the two records, then there is a strip between the corresponding two points in the figure, and the attribute of the edge is the intersection set of the two records.
  • the edge containing the k-waiting set in the definition map is the active edge and the other edge is the inactive edge (there is no such edge in this calculation)
  • all the records containing the k-waiting set correspond to The points constitute a connected component, and the points corresponding to the records that do not contain the k-waiting set are isolated points (without edges). Therefore, the support of this k-waiting set is the number of points included in this connected component in this calculation. That is, in the embodiment of the present invention, by determining the relationship between the graph data structure and the item set support degree, it is determined that the number of points in the connected component in the graph structure data is the support degree of the corresponding candidate set.
  • the number of points included in the connected component corresponding to the candidate set can be passed To determine the support of the candidate set to determine whether the candidate set is a frequent item set.
  • the tag-based connected component algorithm is specifically executed as follows: For each vertex in a normal graph structure data, a label (initial value of its id) is set, and then iteration is started.
  • One iteration traverse all the edges in the graph, compare the label value of the two endpoints for each edge, and set the label value of the point of the larger label value to the label value of the point with the smaller label value.
  • Iteration end condition The tag value of any point in an iteration changes.
  • the converted graph structure data is special, and for a fixed set of k candidate sets, Only the edge containing the k candidate set is valid, so a connected component and the remaining isolated points are obtained. Therefore, for a k candidate set, the execution of the connected component algorithm will only get one connected component, and this connected component is the corresponding set of k candidates.
  • FIG. 2A the connected components of the candidate set ⁇ I2, I3, I5 ⁇ in FIG. 1 are shown, and the connected component includes two points T2 and T3, so the candidate set ⁇ I2, I3
  • the support for I5 ⁇ is 2.
  • FIG. 2B the connected component of the candidate set ⁇ I2, I5 ⁇ in FIG. 1 is shown, and the connected component includes three points T2, T3, and T4, so the candidate set ⁇ I2, I5 The support for ⁇ is 3.
  • FIG. 2A and FIG. 2B are schematic diagrams showing connected components corresponding to the candidate set shown in FIG. 1.
  • a candidate set support calculation method based on graph structure data including: converting data in a database into graph structure data, wherein the graph structure data is Each point represents a record in the database, and the edge between any two points represents the intersection of the set of elements of the two records corresponding to the two points; the candidate set is obtained from the database; the structure data of the figure is obtained a connected component corresponding to the candidate set; determining a number of points included in the connected component; and determining the number of the points as the support of the candidate set.
  • FIG. 3 is a schematic flowchart showing a candidate set support calculation method based on graph structure data according to an embodiment of the present invention.
  • the candidate set support calculation method based on the graph structure data according to the embodiment of the present invention includes: S101, converting data in the database into graph structure data, Wherein each point in the graph structure data represents a record in the database, and an edge between any two points represents an intersection of a set of elements of two records corresponding to the two points; S102, a candidate is obtained from the database a set of items; S103, obtaining a connected component corresponding to the candidate set obtained in step S2 in the graph structure data obtained in step S1; S104, determining the number of points included in the connected component; and S105, the number of the points Determine the support for this candidate set.
  • the step of obtaining the candidate set from the database specifically includes: obtaining a k-wait set by the Apriori algorithm based on the pre-order tree, where k is an integer greater than 1.
  • the step of obtaining a candidate set from the database specifically includes: generating a candidate set by using a hash table by using a DHP method.
  • the step of determining the number of points included in the connected component specifically includes: calculating a point in the connected component by using an iterative algorithm based on label transfer Number of.
  • the candidate data set support method After obtaining the support of the candidate set by using the candidate data set support method based on the graph structure data according to the embodiment of the present invention, on the one hand, it may be applied to determine whether the candidate set is a frequent item set, which will be below More specific description. On the other hand, the degree of support of the obtained candidate set can also be applied to subsequent processes in the association rule, and is not limited to determining whether the candidate set is a frequent item set.
  • a frequent item set determining method comprising: obtaining a support set of a candidate set using a candidate set calculation method based on graph structure data as described above; Whether the support degree of the candidate set is greater than a predetermined threshold; and if the support degree of the candidate set is greater than a predetermined threshold, determining the candidate set as a frequent item set.
  • FIG. 4 is a schematic flowchart showing a frequent item set determining method according to an embodiment of the present invention.
  • the frequent item set determining method according to the embodiment of the present invention includes: S201, obtaining the support degree of the candidate set by using the candidate set calculation method based on the graph structure data as described above; S202, determining the Whether the support of the candidate set is greater than a predetermined threshold; and S203, in the candidate In the case where the support of the set is greater than a predetermined threshold, it is determined that the candidate set is a frequent item set.
  • the frequent item set determination method can avoid using the candidate set calculation method based on the graph structure data to avoid the support of the candidate set by scanning the database, and it takes too much time. The effect is also improved by the efficiency of the algorithm.
  • the existing Apriori algorithm based on the pre-order tree at the beginning of the algorithm, the 1-frequent item set is obtained by scanning the database once, and then the algorithm generates the (k-1)-frequent item set.
  • the k-waiting option set and then using the database recursively traversing the pre-order tree to obtain the support of the k-waiting set, and then obtaining the k-frequent item set.
  • the existing Apriori algorithm based on pre-order tree needs to use the database recursive traversal pre-order tree to calculate the count of candidate sets. When the k-value becomes larger, the efficiency of Apriori algorithm becomes lower.
  • the embodiment of the present invention further proposes an optimization model of the Apriori algorithm based on the pre-order tree, wherein when the k value is relatively small, the database is recursively traversed using the existing Apriori algorithm, and when the k value is large to a certain extent Use the graph calculation method.
  • the number of k-waiting sets is known.
  • the time at which a connected component is calculated on a graph using a graph calculation method is also known. Therefore, it is possible to estimate the time for calculating the k-frequent item set using the graph calculation method.
  • the Apriori algorithm based on the pre-order tree uses the database to recursively traverse the pre-order tree of depth k.
  • the ergodic time of a depth of k can be estimated as: the time of traversal with a depth of k-1 + (the time of traversal with a depth of k-1 - the time of traversal of a depth of k-2).
  • the number of traversals can be estimated by combining the records of the database with the k values. By multiplying these two data, the time at which the Apriori algorithm calculates the k-frequent item set can be roughly estimated.
  • the switching point k value is determined by comparing the two times.
  • the set determining method comprises: determining a first time used by the database to recursively traverse the pre-order tree of depth k; determining the number of k-wait sets, and multiplying the number of the k-wait set by the use graph
  • the method calculates a time of the connected component on the graph as the second time; compares the first time with the second time; and when the first time is less than the second time, uses database recursive calculation to determine the support degree of the candidate set; And in the case where the first time is greater than the second time, the support set of the candidate set is obtained using the candidate set calculation method based on the graph structure data as described above.
  • FIG. 5 is a schematic flowchart of an optimized pre-tree based frequent item set determining method according to an embodiment of the present invention.
  • an optimized pre-order tree-based frequent item set determining method includes: S301, determining a first time used by a database to recursively traverse a pre-order tree having a depth k; S302, determining k - the number of candidate sets, and multiplying the number of the k-waiting sets by the time at which the connected component is calculated on the map using the graph calculation method as the second time; S303, comparing the first time and the second Time, that is, determining whether the first time is greater than the second time; S304, using the database recursive calculation to determine the support degree of the candidate set when the first time is less than the second time; and S305, the first time is greater than the second time In the case of the candidate set calculation method based on the graph structure data as described above, the support degree of the candidate set is obtained.
  • the method further includes: determining whether a degree of support of the obtained candidate set is greater than a predetermined threshold; and, if the supported degree of the obtained candidate set is greater than a predetermined threshold In the case, it is determined that the candidate set is a frequent item set.
  • pre-order tree-based frequent item set determining method before determining the first time, further comprising: obtaining a k-waiting set by an Apriori algorithm based on a pre-order tree, where k is an integer greater than 1.
  • the method further includes: generating a candidate set by using a hash table by using a DHP method.
  • the embodiment of the present invention further provides an optimized method for determining a frequent item set based on a pre-order tree.
  • the Apriori algorithm is implemented based on the graph calculation method.
  • the method is to optimize the Apriori algorithm based on the pre-order tree, which improves the efficiency of the Apriori algorithm based on the pre-order tree at deep depth.
  • FIG. 6 is a schematic diagram of an example of an optimized preamble-based Apriori algorithm, in accordance with an embodiment of the present invention.
  • the algorithm starts, first run the Apriori algorithm based on the pre-order tree to generate a candidate set, and then evaluate whether the graph calculation method is better. If the conclusion is no, continue to run the Apriori algorithm based on the pre-order tree to calculate the support degree of the candidate set, so that the support degree is greater than the pre- Set a candidate set of thresholds as a frequent item set.
  • the graph structure data-based method is used to calculate the support degree of the candidate set, thereby obtaining a candidate set with the support degree greater than the predetermined threshold as the frequent item set. Then, it is judged whether the candidate set of the unsupported degree of support is empty, and if it is no, it represents that there is still a candidate set of uncalculated support, and the support degree of the candidate set is calculated based on the graph calculation method. Finally, if the candidate set with no support is empty, it indicates that the support of all candidate sets has been calculated, thus determining the frequent itemsets in the candidate set, and the algorithm ends.
  • FIG. 7 is a schematic diagram of another example of an optimized frequent item set determination algorithm in accordance with an embodiment of the present invention. As shown in FIG. 7, the difference is in the method of obtaining a candidate set, compared to the optimized pre-tree-based Apriori algorithm according to an embodiment of the present invention as shown in FIG. 6. That is, in the example shown in FIG. 7, instead of using the Apriori algorithm based on the pre-order tree to obtain the candidate set, other methods for generating the candidate set are used to obtain the candidate set. Thereafter, the procedure of the example shown in FIG. 7 is the same as the example shown in FIG. 6 described earlier, and will not be described again here.
  • the FIM_CC(G, attribute_set) function performs a connected component algorithm on the graph data. In this calculation, the edge is considered valid only when the edge contains attribute_set.
  • the connected component algorithm uses label to pass, Fe (edge, attribute_set) function performs specific label transfer, and Fv (vertex, new_label) function modifies the label of the point.
  • the FIM_GC(D, T, I, G) function counts the support for each candidate set, calculates the frequent itemsets, and writes the results to the pre-order tree.
  • the FIM_GC(D,T,I,G) function ends when no candidate set is generated.
  • FIG. 9 is a pseudo code diagram of an optimization method of an Apriori algorithm based on a pre-order tree according to an embodiment of the present invention.
  • the ANG(D, T, G) function combines Apriori and graph calculation methods, and uses Apriori at the beginning of the algorithm.
  • the estimated time-consuming graph calculation is smaller than Apriori, it is converted into graph calculation.
  • the candidate set is only when the support of the candidate set is greater than the preset threshold, otherwise it is not (marked -1).
  • the algorithm ends when the resulting set of candidate sets is an empty set.
  • the candidate set support calculation method based on the graph structure data and the frequent item set determining method provided by the embodiment of the present invention can avoid the support of the candidate set by scanning the database and consume too much time, and the effect of the algorithm is guaranteed. At the same time, the efficiency of the algorithm is improved.
  • the pre-order tree-based frequent item set determining method models and analyzes the calculation process of the Apriori algorithm based on the pre-order tree, and implements the Apriori algorithm based on the graph calculation method to optimize the pre-based method.
  • the Apriori algorithm of the order tree thus improving the pre-based The efficiency of the Apriori algorithm of the order tree at deep depths.
  • a candidate set support calculation device based on graph structure data comprising a memory and a processor, wherein the computer stores executable instructions, wherein the computer executable instructions are controlled When executed, the program is operable to perform the candidate set support calculation method based on the graph structure data as described above.
  • the candidate set support calculation method based on the graph structure data according to the embodiment of the present invention has been described above, and will not be described again in order to avoid redundancy.
  • a frequent item set determining apparatus includes a memory and a processor, wherein the computer stores executable instructions that are operable when the computer executable instructions are executed by the controller. Performing the following method: obtaining the support degree of the candidate set by using the candidate set calculation method based on the graph structure data as described above; determining whether the support degree of the candidate set is greater than a predetermined threshold; and, in the candidate In the case where the support of the set is greater than a predetermined threshold, it is determined that the candidate set is a frequent item set.
  • the candidate set support calculation method based on the graph structure data according to the embodiment of the present invention has been described above, and will not be described again in order to avoid redundancy.
  • a pre-order tree-based frequent item set determining apparatus includes a memory and a processor, where the computer stores executable instructions, where the computer executable instructions are executed by the controller. At time, it is operable to perform a pre-order tree-based frequent item set determination method as described above.
  • the pre-order tree-based frequent item set determining method according to the embodiment of the present invention has been described above, and will not be described again in order to avoid redundancy.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform execution based on the computer executable instructions when executed by a computing device
  • a candidate set support calculation method for graph structure data A candidate set support calculation method for graph structure data.
  • the candidate set support calculation method based on the graph structure data according to the embodiment of the present invention has been described above, and will not be described again in order to avoid redundancy.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform a method of performing the following method when the computer executable instructions are executed by a computing device
  • the candidate set support degree calculation method based on the graph structure data obtains the support degree of the candidate set; determines whether the support degree of the candidate set is greater than a predetermined threshold; and, the support degree of the candidate set is greater than a predetermined threshold In the case, it is determined that the candidate set is a frequent item set.
  • the candidate set support calculation method based on the graph structure data according to the embodiment of the present invention has been As described above, in order to avoid redundancy, it will not be described again.
  • a computer readable storage medium having stored thereon computer executable instructions operable to perform execution based on the computer executable instructions when executed by a computing device
  • the frequent item set determination method of the preorder tree has been described above, and will not be described again in order to avoid redundancy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

提供了基于图结构数据的候选项集支持度计算方法,使用其的频繁项集确定方法和基于前序树的频繁项集确定方法。该基于图结构数据的候选项集支持度计算方法包括:将数据库中的数据转换为图结构数据(S101);从该数据库中得到候选项集(S102);得到该图结构数据中对应于该候选项集的连通分量(S103);确定该连通分量中包含的点的数目(S104);和将点的数目确定为该候选项集的支持度(S105)。提供的基于图结构数据的候选项集支持度计算方法,使用其的频繁项集确定方法可以避免通过扫描数据库得到候选项集的支持度而耗费太多时间,在保证算法效果的同时提高了算法的效率。

Description

基于图结构数据的候选项集支持度计算方法及其应用 技术领域
本发明总的来说涉及关联法则领域,且更为具体地涉及基于图结构数据的候选项集支持度计算方法,使用其的频繁项集确定方法和基于前序树的频繁项集确定方法。
背景技术
数据挖掘的目标是从一个庞大的数据库中提取出未知的并且有利用价值的信息,关联法则是数据挖掘中最重要的技术之一。关联法则最先是被提出用于解决超市的销售问题。一个大型超市可以收集到很多交易记录,超市经理希望可以从这些交易记录中找到一些有利用价值的信息用来帮助制定销售策略。
一条交易记录包含了一个元素的集合,在超市例子中,这些元素代表着一种商品。使用I表示所有元素的集合,X和Y分别表示某些元素的集合,则一条关联法则为形如X→Y的蕴涵式,X是这条关联法则的先导,Y是这条关联法则的后继,并且
Figure PCTCN2017096672-appb-000001
关联法则的计算一般分为两个步骤:1.找出所有的频繁项集;2.基于第一步中找出的频繁项集找出所有的关联法则。由于第二步非常简单,关联法则的开销基本取决于第一步。
基于前序树的Apriori算法是用来计算频繁项集的最著名的并且被广泛使用的算法之一。它计算拥有k个元素的频繁项集需要经过2个步骤:1.根据(k-1)-频繁项集产生出k-候选项集;2.扫描数据库得到k-候选项集的支持度,进而得到k-频繁项集。该算法使用一个前序树来存储频繁项集,前序树的第k层的每个节点都表示了一个k-频繁项集的集合。
但是随着算法深度的增加,前序树深度增加,该算法在每个循环的第二步:扫描数据库得到候选项集的支持度中耗费了太多时间。这是因为在使用数据库递归遍历前序树时,前序树深度的增加无法避免的导致数据库的遍历次数增加太多,进而致使消耗大量时间。
因此,需要改进的候选项集支持度的计算方法和相应的频繁项集的确定方法。
发明内容
本发明的目的在于针对上述现有技术中的缺陷和不足,提供新颖的和改进的基于图结构数据的候选项集支持度计算方法,使用其的频繁项集确定方法和基于前序树的频繁项集确定方法。
根据本发明的一方面,提供了一种基于图结构数据的候选项集支持度计算方法,包括:将数据库中的数据转换为图结构数据,其中该图结构数据中的每个点表示该数据库中的一条记录,且任意两个点之间的边表示该两个点对应的两条记录的元素集合的交集;从数据库中得到候选项集;得到该图结构数据中对应于该候选项集的连通分量;确定该连通分量中包含的点的数目;和,将该点的数目确定为该候选项集的支持度。
在上述基于图结构数据的候选项集支持度计算方法中,从数据库中得到候选项集的步骤具体包括:通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
或者,在上述基于图结构数据的候选项集支持度计算方法中,从数据库中得到候选项集的步骤具体包括:采用DHP方法使用哈希表产生候选项集。
在上述基于图结构数据的候选项集支持度计算方法中,确定该连通分量中包含的点的数目的步骤具体包括:使用基于标签传递的连通分量算法通过一次迭代计算出该连通分量中的点的数目。
根据本发明的另一方面,提供了一种频繁项集确定方法,包括:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
根据本发明的又一方面,提供了一种基于前序树的频繁项集确定方法,包括:确定使用数据库递归遍历深度为k的前序树所用的第一时间;确定k-候选项集的个数,并以该k-候选项集的个数乘以使用图计算方法在图上计算一次连通分量的时间作为第二时间;比较该第一时间和该第二时间;在第一时间小于第二时间的情况下,使用数据库递归计算确定候选项集的支持度;和在第一时间大于第二时间的情况下,使用如上所述的基于图结构数据的候 选项集支持度计算方法获得候选项集的支持度
在上述基于前序树的频繁项集确定方法中,进一步包括:确定所获得的候选项集的支持度是否大于预定阈值;和,在所获得的候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
在上述基于前序树的频繁项集确定方法中,在确定第一时间之前进一步包括:通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
在上述基于前序树的频繁项集确定方法中,在确定第一时间之前进一步包括:采用DHP方法使用哈希表产生候选项集。
根据本发明的再一方面,提供了一种基于图结构数据的候选项集支持度计算装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行如上所述的基于图结构数据的候选项集支持度计算方法。
根据本发明的再一方面,提供了一种频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行下述方法:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和,在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
根据本发明的再一方面,提供了一种基于前序树的频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行如上所述的基于前序树的频繁项集确定方法。
根据本发明的再一方面,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行如上所述的基于图结构数据的候选项集支持度计算方法。
根据本发明的再一方面,一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行下述方法:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和,在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
根据本发明的再一方面,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行如上所述的基于前序树的频繁项集确定方法。
本发明实施例提供的基于图结构数据的候选项集支持度计算方法和使用其的频繁项集确定方法可以避免通过扫描数据库得到候选项集的支持度而耗费太多时间,在保证算法效果的同时提高了算法的效率。
另外,本发明实施例提供的基于前序树的频繁项集确定算法通过对基于前序树的Apriori算法的计算过程进行建模与分析,采用基于图计算方法实施Apriori算法的方法来优化基于前序树的Apriori算法,从而提高了基于前序树的Apriori算法在深度较深时的效率。
附图说明
图1是根据本发明实施例的数据库记录转化成图结构数据的过程的示例图;
图2A和图2B是示出了图1所示的候选项集对应的连通分量的示意图;
图3是示出根据本发明实施例的基于图结构数据的候选项集支持度计算方法的示意性流程图;
图4是示出根据本发明实施例的频繁项集确定方法的示意性流程图;
图5是根据本发明实施例的优化的基于前序树的频繁项集确定方法的示意性流程图;
图6是根据本发明实施例的优化的基于前序树的Apriori算法的实例的示意图;
图7是根据本发明实施例的优化的频繁项集确定算法的另一实例的示意图;
图8是根据本发明实施例的使用图计算方法计算k-候选项集支持度的算法的伪代码图;
图9是根据本发明实施例的基于前序树的Apriori算法的优化方法的伪代码图。
具体实施方式
以下描述用于公开本发明以使本领域技术人员能够实现本发明。以下描 述中的优选实施例只作为举例,本领域技术人员可以想到其他显而易见的变型。在以下描述中界定的本发明的基本原理可以应用于其他实施方案、变形方案、改进方案、等同方案以及没有背离本发明的精神和范围的其他技术方案。
以下说明书和权利要求中使用的术语和词不限于字面的含义,而是仅由本发明人使用以使得能够清楚和一致地理解本发明。因此,对本领域技术人员很明显仅为了说明的目的而不是为了如所附权利要求和它们的等效物所定义的限制本发明的目的而提供本发明的各种实施例的以下描述。
可以理解的是,术语“一”应理解为“至少一”或“一个或多个”,即在一个实施例中,一个元件的数量可以为一个,而在另外的实施例中,该元件的数量可以为多个,术语“一”不能理解为对数量的限制。
虽然比如“第一”、“第二”等的序数将用于描述各种组件,但是在这里不限制那些组件。该术语仅用于区分一个组件与另一组件。例如,第一组件可以被称为第二组件,且同样地,第二组件也可以被称为第一组件,而不脱离发明构思的教导。在此使用的术语“和/或”包括一个或多个关联的列出的项目的任何和全部组合。
在这里使用的术语仅用于描述各种实施例的目的且不意在限制。如在此使用的,单数形式意在也包括复数形式,除非上下文清楚地指示例外。另外将理解术语“包括”和/或“具有”当在该说明书中使用时指定所述的特征、数目、步骤、操作、组件、元件或其组合的存在,而不排除一个或多个其它特征、数目、步骤、操作、组件、元件或其组的存在或者附加。
包括技术和科学术语的在这里使用的术语具有与本领域技术人员通常理解的术语相同的含义,只要不是不同地限定该术语。应当理解在通常使用的词典中限定的术语具有与现有技术中的术语的含义一致的含义。
下面结合附图和具体实施方式对本发明作进一步详细的说明:
如上所述,基于前序树的Apriori算法在深度较深时,由于扫描数据库得到候选项集的支持度需要耗费很多时间,从而导致表现出来的低效性。因此,在根据本发明实施例的基于前序树的频繁项集确定方法中,通过图计算方法来计算候选项集的支持度,从而确定频繁项集。这样,可以在保证算法效果的同时,提高算法的效率,即基于前序树的Apriori算法进行优化。
基于前序树的Apriori算法使用一个前序树来存储频繁项集以及候选频 繁项集。在算法开始阶段,通过扫描一次数据库来得到1-频繁项集;接下来该算法使用(k-1)-频繁项集产生出k-候选项集,再使用数据库递归遍历前序树来得到k-候选项集的支持度,进而得到k-频繁项集。
具体来说,在Apriori算法中,使用(k-1)-频繁项集产生出k-候选项集的步骤具体为通过2个k-1频繁项集产生一个k候选项集,其分为两个阶段:
产生阶段:假设有2个k-1频繁项集p和q,并且p和q的前k-2个元素是相同的,那么就可以把这k-2个相同的元素以及p和q分别剩下的一个元素,总共k个元素组成一个k候选项集。
检验阶段:对于产生的这个k候选项集,需要确保它的所有k-1阶子集都是在k-1频繁项集集合中的。这是因为对于一个k频繁项集而言,可以肯定它的所有k-1阶子集也是频繁项集。所以如果产生的这个k候选项集存在它的k-1阶子集不在k-1频繁项集集合中,就应该把这个k候选项集从候选集合中剔除掉。
例如,假设已经存在3频繁项集集合:{{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}},在产生阶段可以产生出2个4候选项集:{1,2,3,4}(通过{1,2,3}和{1,2,4}产生),{1,3,4,5}(通过{1,3,4}和{1,3,5}产生)。
在检验阶段由于{1,3,4,5}的3阶子集{3,4,5}不在3频繁项集集合中,所以{1,3,4,5}要从4候选项集中剔除。这样,产生的4候选项集只有一个,即{1,2,3,4}。
虽然上面描述了基于前序树的Apriori算法中如何提到候选项集,但是在根据本发明实施例的基于图结构数据的候选项集支持度计算方法和使用其的频繁项集确定方法中,并不仅限定使用基于前序树的Apriori算法得到候选项集,还可以采用其它算法来得到候选项集。
例如,可以采用DHP方法(Direct Hashing and Pruning:直接哈希和裁枝)使用一个哈希表(hash table)来产生候选项集。具体来说,首先定义hash函数,然后对于每个项集计算出一个hash值,扫描项集放入对应hash table的桶(bucket)并统计计数。并且,只有一个bucket的计数大于预先设定的阈值,这个bucket的hash值对应的所有项集才是候选项集。
这样,在得到候选项集之后,就需要确定候选项集的支持度,从而确定频繁项集。
如上所述,在基于前序树的Apriori算法中,是通过扫描数据库得到候选项集的支持度。但是,在根据本发明实施例的基于图结构数据的候选项集支持度计算方法和使用其的频繁项集确定方法中,是使用图计算方法来计算候选项集的支持度从而确定频繁项集。
具体来说,使用图计算方法来确定频繁项集首先需要将数据转换成图结构数据。对于一条数据库中的记录而言,它包含自己的id号以及一个元素的集合。因此,可以按照下面所述的规则将数据转换为图结构数据:1.图中的每一个点代表数据库中的一条记录,这条记录的id号就是这个点的id号;2.图中的两个点之间有一条边当且仅当这两个点对应的记录的元素集合的交集不为空,并且这个交集就是这条边的属性,如图1所示。这里,图1是根据本发明实施例的数据库记录转化成图结构数据的过程的示例图。
对于一个给定的候选k-频繁项集而言,它的计数是数据库中所有包含这个k-候选项集的所有记录的计数。对应到图结构中,记录对应图中的点,如果两条记录有交集则对应到图中就是对应的两个点之间有条边,并且边的属性就是这两条记录的交集集合。
因此,如果定义图中包含这个k-候选项集的边为活跃边,其他边是不活跃边(在此次计算中视为没有这条边),那么所有包含这个k-候选项集的记录对应的点就构成了一个连通分量,不包含这个k-候选项集的记录对应的点均是孤立点(没有边)。所以这个k-候选项集的支持度就是此次计算中这个连通分量所包含点的个数。也就是,本发明实施例通过对图数据结构与项集支持度的关系的研究,确定上述的图结构数据中连通分量中的点的个数即是与其对应的候选项集的支持度。
这样,在根据本发明实施例的基于图结构数据的候选项集支持度计算方法和使用其的频繁项集确定方法中,可以通过该候选项集对应的连通分量中所包含的点的个数来确定该候选项集的支持度,从而确定该候选项集是否为频繁项集。
所以,使用图计算方法来确定频繁项集只需要分别计算若干个(取决于 候选项集的数目)连通分量,再统计连通分量中点的个数即可。注意到使用图计算方法来计算频繁项集时,所求的连通分量其实是一个特殊的连通分量:这个连通分量中的所有点都相互之间有一条边。因此在使用基于标签(label)传递的连通分量(Connected Components)算法时,仅需要一次迭代就可以计算出这个连通分量中点的个数。
该基于标签传递的连通分量算法具体执行过程如下:对于一个普通图结构数据中的每个点(vertex),设置一个标签(初始值为其id),然后开始迭代。
一次迭代:遍历图中的所有边(edge),对于每条边,比较其两个端点的标签值大小,将较大标签值的点的标签值设为较小标签值的点的标签值。
迭代结束条件:一次迭代中没有任何点的标签值发生改变。
另外,如图1所示,在根据本发明实施例的基于图结构数据的候选项集支持度计算方法中,转换得到的图结构数据是特殊的,对于固定的一个k候选项集而言,只有包含了k候选项集的边才是有效的,于是得到了一个连通分量和其余的孤立点。所以,对于一个k候选项集,执行连通分量算法只会得到一个连通分量,并且这个连通分量就是对应的这个k候选项集。
例如,如图2A所示,示出了图1中的候选项集{I2,I3,I5}的连通分量,并且该连通分量包含两个点T2和T3,因此该候选项集{I2,I3,I5}的支持度即为2。又例如,如图2B所示,示出了图1中的候选项集{I2,I5}的连通分量,该连通分量包含三个点T2、T3和T4,因此该候选项集{I2,I5}的支持度即为3。这里,图2A和图2B是示出了图1所示的候选项集对应的连通分量的示意图。
综上所述,根据本发明实施例的一方面,提供了一种基于图结构数据的候选项集支持度计算方法,包括:将数据库中的数据转换为图结构数据,其中该图结构数据中的每个点表示该数据库中的一条记录,且任意两个点之间的边表示该两个点对应的两条记录的元素集合的交集;从数据库中得到候选项集;得到该图结构数据中对应于该候选项集的连通分量;确定该连通分量中包含的点的数目;和,将该点的数目确定为该候选项集的支持度。
图3是示出根据本发明实施例的基于图结构数据的候选项集支持度计算方法的示意性流程图。如图3所示,根据本发明实施例的基于图结构数据的候选项集支持度计算方法包括:S101,将数据库中的数据转换为图结构数据, 其中该图结构数据中的每个点表示该数据库中的一条记录,且任意两个点之间的边表示该两个点对应的两条记录的元素集合的交集;S102,从数据库中得到候选项集;S103,得到步骤S1中得到的图结构数据中对应于步骤S2中得到的候选项集的连通分量;S104,确定该连通分量中包含的点的数目;和S105,将该点的数目确定为该候选项集的支持度。
在上述基于图结构数据的候选项集支持度计算方法中,从数据库中得到候选项集的步骤具体包括:通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
或者,在上述基于图结构数据的候选项集支持度计算方法中,从数据库中得到候选项集的步骤具体包括:采用DHP方法使用哈希表产生候选项集。
这里,通过基于前序树的Apriori算法和DHP方法得到候选项集的具体过程已经在前面描述,在这里不再赘述。
在上述基于图结构数据的候选项集支持度计算方法中,确定该连通分量中包含的点的数目的步骤具体包括:使用基于标签传递的连通分量算法通过一次迭代计算出该连通分量中的点的数目。
这样,通过根据本发明实施例的基于图结构数据的候选项集支持度计算方法,可以避免通过扫描数据库得到候选项集的支持度而耗费太多时间,在保证算法效果的同时提高了算法的效率。
在采用根据本发明实施例的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度之后,一方面,可以应用于确定该候选项集是否是频繁项集,这将在下面进一步具体描述。另一方面,所获得的候选项集的支持度还可以应用于关联法则中的后续过程,而并不仅限于确定候选项集是否是频繁项集。
因此,根据本发明实施例的另一方面,提供了一种频繁项集确定方法,包括:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
图4是示出根据本发明实施例的频繁项集确定方法的示意性流程图。如图4所示,根据本发明实施例的频繁项集确定方法包括:S201,使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;S202,确定该候选项集的支持度是否大于一预定阈值;和S203,在该候选项 集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
也就是说,在基于上述方法计算出候选项集的支持度之后,如果某个候选项集的支持度大于预定阈值,则确定该候选项集为频繁项集。因此,通过根据本发明实施例的频繁项集确定方法可以通过使用基于图结构数据的候选项集支持度计算方法,避免通过扫描数据库得到候选项集的支持度而耗费太多时间,在保证算法效果的同时提高了算法的效率。
如上所述,在现有的基于前序树的Apriori算法中,在算法开始阶段,通过扫描一次数据库来得到1-频繁项集,接下来该算法使用(k-1)-频繁项集产生出k-候选项集,再使用数据库递归遍历前序树来得到k-候选项集的支持度,进而得到k-频繁项集。在这种算法中,在k值比较小的时候,候选项集集合的大小非常大,因此使用图计算方法来计算频繁项集的效率会相对较差。但是随着k值的增加,候选项集集合的大小急剧减小,使用图计算方法计算频繁项集的效率也随之提升。同时,现有的基于前序树的Apriori算法需要使用数据库递归遍历前序树来计算候选项集的计数,当k值变大时,Apriori算法的效率变低。
因此,基于上述问题,本发明实施例进一步提出基于前序树的Apriori算法的优化模型,其中当k值比较小时,使用现有的Apriori算法中的数据库递归遍历,而当k值大到一定程度时使用图计算方法。
首先,说明切换点k值的确定。
在计算过程中,k-候选项集的个数是可以得知的。使用图计算方法在一个图上计算一次连通分量的时间也是已知的。所以可以估算出使用图计算方法计算k-频繁项集的时间。
对于k-候选项集而言,基于前序树的Apriori算法使用数据库递归遍历深度为k的前序树。一次深度为k的遍历时间可以估算为:一次深度为k-1的遍历的时间+(一次深度为k-1的遍历的时间–一次深度为k-2的遍历的时间)。遍历的次数可以通过对数据库的记录对k值求组合数来估算出。将这两个数据相乘就可以大致估算出Apriori算法计算k-频繁项集的时间。
然后,通过比较这两个时间来确定切换点k值。
也就是说,通过优化基于前序树的Apriori算法,获得了一种基于前序树的频繁项集确定方法。
这样,根据本发明实施例的又一方面,提出了一种基于前序树的频繁项 集确定方法,包括:确定使用数据库递归遍历深度为k的前序树所用的第一时间;确定k-候选项集的个数,并以该k-候选项集的个数乘以使用图计算方法在图上计算一次连通分量的时间作为第二时间;比较该第一时间和该第二时间;在第一时间小于第二时间的情况下,使用数据库递归计算确定候选项集的支持度;和在第一时间大于第二时间的情况下,使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度。
图5是根据本发明实施例的优化的基于前序树的频繁项集确定方法的示意性流程图。如图5所示,根据本发明实施例的优化的基于前序树的频繁项集确定方法包括:S301,确定使用数据库递归遍历深度为k的前序树所用的第一时间;S302,确定k-候选项集的个数,并以该k-候选项集的个数乘以使用图计算方法在图上计算一次连通分量的时间作为第二时间;S303,比较该第一时间和该第二时间,即判定第一时间是否大于第二时间;S304,在第一时间小于第二时间的情况下,使用数据库递归计算确定候选项集的支持度;和S305,在第一时间大于第二时间的情况下,使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度。
同样,在上述基于前序树的频繁项集确定方法中,进一步包括:确定所获得的候选项集的支持度是否大于预定阈值;和,在所获得的候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
并且,在上述基于前序树的频繁项集确定方法中,在确定第一时间之前进一步包括:通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
或者,在上述基于前序树的频繁项集确定方法中,在确定第一时间之前进一步包括:采用DHP方法使用哈希表产生候选项集。
因此,本发明实施例进一步提供了一种优化的基于前序树的频繁项集确定方法,通过对基于前序树的Apriori算法的计算过程进行建模与分析,采用基于图计算方法实施Apriori算法的方法来优化基于前序树的Apriori算法,从而提高了基于前序树的Apriori算法在深度较深时的效率。
图6是根据本发明实施例的优化的基于前序树的Apriori算法的实例的示意图。如图6所示,算法开始之后,首先运行基于前序树的Apriori算法产生候选项集,然后评估图计算方法是否更优。如果结论为否,则继续运行基于前序树的Apriori算法计算候选项集的支持度,从而获得支持度大于预 定阈值的候选项集,以作为频繁项集。相对地,如果结论为是,即图计算方法更优,则采用上述基于图结构数据的方法计算候选项集的支持度,从而获得支持度大于预定阈值的候选项集,以作为频繁项集。然后,判断未计算支持度的候选项集是否为空,如果为否,则代表仍然存在未计算支持度的候选项集,基于采用图计算方法计算候选项集的支持度。最后,如果未计算支持度的候选项集为空,说明已经计算了所有候选项集的支持度,从而确定了候选项集中的频繁项集,算法结束。
图7是根据本发明实施例的优化的频繁项集确定算法的另一实例的示意图。如图7所示,与如图6所示的根据本发明实施例的优化的基于前序树的Apriori算法相比,区别在于得到候选项集的方法。即,在如图7所示的实例中,不是采用基于前序树的Apriori算法得到候选项集,而是采用产生候选项集的其它方法来得到候选项集。此后,如图7所示的实例的过程与之前描述的如图6所示的实例相同,在此便不再赘述。
图8是根据本发明实施例的使用图计算方法计算k-候选项集支持度的算法的伪代码图。如图8所示,FIM_CC(G,attribute_set)函数是在图数据上执行一次连通分量算法,在这次计算中,只有当边包含attribute_set时,才视为边有效。连通分量算法采用label传递,Fe(edge,attribute_set)函数执行具体的label传递,Fv(vertex,new_label)函数修改点的label。另外,FIM_GC(D,T,I,G)函数统计每个候选项集的支持度,计算出频繁项集,并将结果写入前序树中。FIM_GC(D,T,I,G)函数在没有候选项集产生时结束。
图9是根据本发明实施例的基于前序树的Apriori算法的优化方法的伪代码图。如图9所示,ANG(D,T,G)函数结合了Apriori和图计算方法,在算法初期使用Apriori,当估计耗时图计算小于Apriori时转换为图计算计算。当候选项集的支持度大于预先设定的阈值时才为候选项集,否则不是(标记为-1)。当产生的候选项集集合为空集时算法结束。
本发明实施例提供的基于图结构数据的候选项集支持度计算方法和使用其的频繁项集确定方法可以避免通过扫描数据库得到候选项集的支持度而耗费太多时间,在保证算法效果的同时提高了算法的效率。
另外,本发明实施例提供的基于前序树的频繁项集确定方法通过对基于前序树的Apriori算法的计算过程进行建模与分析,采用基于图计算方法实施Apriori算法的方法来优化基于前序树的Apriori算法,从而提高了基于前 序树的Apriori算法在深度较深时的效率。
根据本发明实施例的再一方面,提供了一种基于图结构数据的候选项集支持度计算装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行如上所述的基于图结构数据的候选项集支持度计算方法。这里,根据本发明实施例的基于图结构数据的候选项集支持度计算方法已经在上面进行了描述,为了避免冗余便不再赘述。
根据本发明实施例的再一方面,提供了一种频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行下述方法:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和,在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。同样的,根据本发明实施例的基于图结构数据的候选项集支持度计算方法已经在上面进行了描述,为了避免冗余便不再赘述。
根据本发明实施例的再一方面,提供了一种基于前序树的频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当该计算机可执行指令被控制器执行时,可操作来执行如上所述的基于前序树的频繁项集确定方法。这里,根据本发明实施例的基于前序树的频繁项集确定方法已经在上面进行了描述,为了避免冗余便不再赘述。
根据本发明实施例的再一方面,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行如上所述的基于图结构数据的候选项集支持度计算方法。这里,根据本发明实施例的基于图结构数据的候选项集支持度计算方法已经在上面进行了描述,为了避免冗余便不再赘述。
根据本发明实施例的再一方面,一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行下述方法:使用如上所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;确定该候选项集的支持度是否大于一预定阈值;和,在该候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。同样的,根据本发明实施例的基于图结构数据的候选项集支持度计算方法已 经在上面进行了描述,为了避免冗余便不再赘述。
根据本发明实施例的再一方面,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,当该计算机可执行指令被计算装置执行时,可操作来执行如上所述的基于前序树的频繁项集确定方法。这里,根据本发明实施例的基于前序树的频繁项集确定方法已经在上面进行了描述,为了避免冗余便不再赘述。
本领域的技术人员应理解,上述描述及附图中所示的本发明的实施例只作为举例而并不限制本发明。本发明的目的已经完整并有效地实现。本发明的功能及结构原理已在实施例中展示和说明,在没有背离该原理下,本发明的实施方式可以有任何变形或修改。

Claims (15)

  1. 一种基于图结构数据的候选项集支持度计算方法,包括:
    将数据库中的数据转换为图结构数据,其中所述图结构数据中的每个点表示所述数据库中的一条记录,且任意两个点之间的边表示所述两个点对应的两条记录的元素集合的交集;
    从所述数据库中得到候选项集;
    得到所述图结构数据中对应于所述候选项集的连通分量;
    确定所述连通分量中包含的点的数目;和,
    将所述点的数目确定为所述候选项集的支持度。
  2. 如权利要求1所述的基于图结构数据的候选项集支持度计算方法,其中,从数据库中得到候选项集的步骤具体包括:
    通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
  3. 如权利要求1所述的基于图结构数据的候选项集支持度计算方法,其中,从数据库中得到候选项集的步骤具体包括:
    采用DHP方法使用哈希表产生候选项集。
  4. 如权利要求1到3中任意一项所述的基于图结构数据的候选项集支持度计算方法,其中,确定所述连通分量中包含的点的数目的步骤具体包括:
    使用基于标签传递的连通分量算法通过一次迭代计算出所述连通分量中的点的数目。
  5. 一种频繁项集确定方法,包括:
    使用如权利要求1到4中任意一项所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;
    确定所述候选项集的支持度是否大于一预定阈值;和
    在所述候选项集的支持度大于预定阈值的情况下,确定所述候选项集为频繁项集。
  6. 一种基于前序树的频繁项集确定方法,包括:
    确定使用数据库递归遍历深度为k的前序树所用的第一时间;
    确定k-候选项集的个数,并以该k-候选项集的个数乘以使用图计算方法在图上计算一次连通分量的时间作为第二时间;
    比较该第一时间和该第二时间;
    在第一时间小于第二时间的情况下,使用数据库递归计算确定候选项集的支持度;和
    在第一时间大于第二时间的情况下,使用如权利要求1到4中任意一项所述的基于图结构数据的候选项集支持度计算方法获得所述候选项集的支持度。
  7. 如权利要求6所述的基于前序树的频繁项集确定方法,进一步包括:
    确定所获得的候选项集的支持度是否大于预定阈值;和,
    在所获得的候选项集的支持度大于预定阈值的情况下,确定该候选项集为频繁项集。
  8. 如权利要求6或者7所述的基于前序树的频繁项集确定方法,其中,在确定第一时间之前进一步包括:
    通过基于前序树的Apriori算法得到k-候选项集,其中k是大于1的整数。
  9. 如权利要求6或者7所述的基于前序树的频繁项集确定方法,其中,在确定第一时间之前进一步包括:
    采用DHP方法使用哈希表产生候选项集。
  10. 一种基于图结构数据的候选项集支持度计算装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当所述计算机可执行指令被控制器执行时,可操作来执行如权利要求1-4中任意一项所述的方法。
  11. 一种频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当所述计算机可执行指令被控制器执行时,可操作来执行下述方法:
    使用如权利要求1到4中任意一项所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;
    确定所述候选项集的支持度是否大于一预定阈值;和
    在所述候选项集的支持度大于预定阈值的情况下,确定所述候选项集为频繁项集。
  12. 一种基于前序树的频繁项集确定装置,包括存储器和处理器,存储器中存储有计算机可执行指令,当所述计算机可执行指令被控制器执行时,可操作来执行如权利要求6-9中任意一项所述的方法。
  13. 一种计算机可读存储介质,其上存储有计算机可执行指令,当所述计算机可执行指令被计算装置执行时,可操作来执行如权利要求1-4中任意一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机可执行指令,当所述计算机可执行指令被计算装置执行时,可操作来执行下述方法:
    使用如权利要求1到4中任意一项所述的基于图结构数据的候选项集支持度计算方法获得候选项集的支持度;
    确定所述候选项集的支持度是否大于一预定阈值;和
    在所述候选项集的支持度大于预定阈值的情况下,确定所述候选项集为频繁项集。
  15. 一种计算机可读存储介质,其上存储有计算机可执行指令,当所述计算机可执行指令被计算装置执行时,可操作来执行如权利要求6-9中任意一项所述的方法。
PCT/CN2017/096672 2017-08-09 2017-08-09 基于图结构数据的候选项集支持度计算方法及其应用 WO2019028710A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201780094550.6A CN111316257A (zh) 2017-08-09 2017-08-09 基于图结构数据的候选项集支持度计算方法及其应用
PCT/CN2017/096672 WO2019028710A1 (zh) 2017-08-09 2017-08-09 基于图结构数据的候选项集支持度计算方法及其应用
US16/718,305 US10776372B2 (en) 2017-08-09 2019-12-18 Method for computing support of itemset candidate based on graph structure data and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/096672 WO2019028710A1 (zh) 2017-08-09 2017-08-09 基于图结构数据的候选项集支持度计算方法及其应用

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/718,305 Continuation US10776372B2 (en) 2017-08-09 2019-12-18 Method for computing support of itemset candidate based on graph structure data and application thereof

Publications (1)

Publication Number Publication Date
WO2019028710A1 true WO2019028710A1 (zh) 2019-02-14

Family

ID=65273032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/096672 WO2019028710A1 (zh) 2017-08-09 2017-08-09 基于图结构数据的候选项集支持度计算方法及其应用

Country Status (3)

Country Link
US (1) US10776372B2 (zh)
CN (1) CN111316257A (zh)
WO (1) WO2019028710A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971921A (zh) * 2024-01-15 2024-05-03 兵器装备集团财务有限责任公司 基于apriori算法检测客户异常操作的方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114881223B (zh) * 2020-12-18 2023-04-18 北京百度网讯科技有限公司 深度学习模型的转换方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078189A1 (en) * 2009-09-30 2011-03-31 Francesco Bonchi Network graph evolution rule generation
CN102096719A (zh) * 2011-02-18 2011-06-15 中国科学院计算技术研究所 一种基于图的存储模式挖掘方法
CN102609528A (zh) * 2012-02-14 2012-07-25 云南大学 基于概率图模型的频繁模式关联分类方法
CN103778151A (zh) * 2012-10-23 2014-05-07 阿里巴巴集团控股有限公司 一种识别特征群体的方法及装置和搜索方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655911B2 (en) * 2003-08-18 2014-02-18 Oracle International Corporation Expressing frequent itemset counting operations
US8631043B2 (en) * 2009-12-09 2014-01-14 Alcatel Lucent Method and apparatus for generating a shape graph from a binary trie
US9298693B2 (en) * 2011-12-16 2016-03-29 Microsoft Technology Licensing, Llc Rule-based generation of candidate string transformations
US10769426B2 (en) * 2015-09-30 2020-09-08 Microsoft Technology Licensing, Llc Inferring attributes of organizations using member graph
US9569729B1 (en) * 2016-07-20 2017-02-14 Chenope, Inc. Analytical system and method for assessing certain characteristics of organizations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078189A1 (en) * 2009-09-30 2011-03-31 Francesco Bonchi Network graph evolution rule generation
CN102096719A (zh) * 2011-02-18 2011-06-15 中国科学院计算技术研究所 一种基于图的存储模式挖掘方法
CN102609528A (zh) * 2012-02-14 2012-07-25 云南大学 基于概率图模型的频繁模式关联分类方法
CN103778151A (zh) * 2012-10-23 2014-05-07 阿里巴巴集团控股有限公司 一种识别特征群体的方法及装置和搜索方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971921A (zh) * 2024-01-15 2024-05-03 兵器装备集团财务有限责任公司 基于apriori算法检测客户异常操作的方法及系统

Also Published As

Publication number Publication date
CN111316257A (zh) 2020-06-19
US20200125562A1 (en) 2020-04-23
US10776372B2 (en) 2020-09-15

Similar Documents

Publication Publication Date Title
US11604796B2 (en) Unified optimization of iterative analytical query processing
US10929294B2 (en) Using caching techniques to improve graph embedding performance
Jin et al. Efficient querying of large process model repositories
US10169059B2 (en) Analysis support method, analysis supporting device, and recording medium
US9576072B2 (en) Database calculation using parallel-computation in a directed acyclic graph
TWI730043B (zh) 關聯分析方法和裝置
US9262501B2 (en) Method, apparatus, and computer-readable medium for optimized data subsetting
JP6387399B2 (ja) データ操作のための、メモリ及びストレージ空間の管理
KR102195103B1 (ko) 프로그램 컴파일 방법
CN105989015B (zh) 一种数据库扩容方法和装置以及访问数据库的方法和装置
Lin et al. High-utility sequential pattern mining with multiple minimum utility thresholds
US11107187B2 (en) Graph upscaling method for preserving graph properties
CN106599122B (zh) 一种基于垂直分解的并行频繁闭序列挖掘方法
WO2019028710A1 (zh) 基于图结构数据的候选项集支持度计算方法及其应用
Kumar et al. Scalable performance tuning of hadoop mapreduce: a noisy gradient approach
US8392393B2 (en) Graph searching
Pan et al. Symbolic techniques in satisfiability solving
WO2016177027A1 (zh) 批量数据查询方法和装置
Niedermayer et al. Similarity search on uncertain spatio-temporal data
KR102517741B1 (ko) 최적해 도출 장치 및 최적해 도출 방법
JP6034240B2 (ja) 分析方法、分析装置および分析プログラム
JP2013127750A (ja) パーティション分割装置及び方法及びプログラム
JP6005583B2 (ja) 検索装置、検索方法および検索プログラム
WO2015045091A1 (ja) ベイジアンネットワークの構造学習におけるスーパーストラクチャ抽出のための方法及びプログラム
CN113536052B (zh) 一种基于k边连通分量在大型网络中搜索个性化影响力社区的方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25/06/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17921267

Country of ref document: EP

Kind code of ref document: A1