CN114490835B - High-utility item set mining method and device, electronic equipment and medium - Google Patents

High-utility item set mining method and device, electronic equipment and medium Download PDF

Info

Publication number
CN114490835B
CN114490835B CN202210389910.8A CN202210389910A CN114490835B CN 114490835 B CN114490835 B CN 114490835B CN 202210389910 A CN202210389910 A CN 202210389910A CN 114490835 B CN114490835 B CN 114490835B
Authority
CN
China
Prior art keywords
item
items
utility
average
item set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210389910.8A
Other languages
Chinese (zh)
Other versions
CN114490835A (en
Inventor
郭世明
陈国华
魏红强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Dahengqin Technology Development Co Ltd
Original Assignee
Zhuhai Dahengqin Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Dahengqin Technology Development Co Ltd filed Critical Zhuhai Dahengqin Technology Development Co Ltd
Priority to CN202210389910.8A priority Critical patent/CN114490835B/en
Publication of CN114490835A publication Critical patent/CN114490835A/en
Application granted granted Critical
Publication of CN114490835B publication Critical patent/CN114490835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a high-utility item set mining method, a high-utility item set mining device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result; determining a pruning strategy for the item set search space according to the calculation result; determining a set of efficient items from the set of items according to the pruning policy. According to the embodiment of the invention, a more compact upper limit for average utility of the item set and a lower limit for minimum utility threshold of the item set are provided, and a new pruning strategy is established according to the upper limit and the lower limit, so that more impossible item sets in a search space can be effectively pruned, and the efficient item set can be rapidly mined.

Description

High-utility item set mining method and device, electronic equipment and medium
Technical Field
The present invention relates to the field of data mining, and in particular, to a high utility item set mining method, a global tree structure construction method, an item set mining method, a high utility item set mining apparatus, an electronic device, and a computer-readable storage medium.
Background
The existing multi-utility threshold high-utility item set mining algorithm generally adopts vertical data to represent information in a storage database. During the mining process, a multi-item set of vertical data representations are recursively created based on the 1-item set vertical data representations constructed from the original database. The average utility and the upper bound of the average utility of the item set are calculated from the vertical data representation corresponding to the item set. These algorithms suffer from three disadvantages:
1. the average effectiveness of the item set is represented by the calculation of the upper bound based on the horizontal data. The average effectiveness of the item set obtained based on the horizontal data representation is not sufficiently compact with an upper bound;
2. in the construction of the vertical data representation, the same transaction merging strategy is not applied (if two transactions contain the same item, the two transactions can be merged into a new transaction), so that the sizes of an original database and a projection database cannot be reduced;
3. the construction of the multi-item set vertical data representation is based on the join operation performed by its specific two subsets corresponding to the vertical data representation, and the join operation is very time-consuming.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are provided to provide an efficient item set mining method, a global tree structure building method, an item set mining method and a corresponding efficient item set mining apparatus, an electronic device, and a computer-readable storage medium that overcome or at least partially solve the above problems.
In order to solve the above problems, an embodiment of the present invention discloses a high utility item set mining method, including:
acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result; the calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first and second average upper bounds of effectiveness are determined based on an average utility of the set of items in the raw database;
determining a pruning strategy for the item set search space according to the calculation result;
determining a set of efficient items from the set of items according to the pruning policy.
Optionally, the pruning strategy includes a first pruning strategy, a second pruning strategy, and a third pruning strategy, and the determining the efficient use item set from the item set according to the pruning strategy includes:
constructing a global tree structure corresponding to the original database by adopting the first pruning strategy;
determining an item set mining algorithm for calculating average utility using the second pruning strategy and the third pruning strategy;
performing item set mining based on the global tree structure and the item set mining algorithm to determine the efficient use item set from the item set.
Optionally, the method may be characterized in that,
the first average effect is calculated by the following formula:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the second average effect upper bound is as follows:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
the fourth average efficiency is calculated by the following formula:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN)。
optionally, the method may be characterized in that,
the calculation formula of the lower bound of the first minimum utility threshold is as follows:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the lower bound of the second minimum utility threshold is as follows:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the third minimum utility threshold lower bound is as follows:
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )}。
optionally, the first average effective upper bound has a superset inverse monotonic property; the second average effect upper bound has a bidirectional expansion inverse monotone property; the third average effective upper bound has an item set extension inverse monotonic property; the fourth average effective upper bound has a depth trim condition.
Optionally, the first minimum utility threshold lower bound has a superset inverse monotonic attribute; the second minimum utility threshold lower bound has a two-way expansion inverse monotonic attribute; the third minimum utility threshold lower bound has an item set extension inverse monotonic attribute.
Optionally, the determining, according to the calculation result, a pruning strategy for the item set search space includes:
in the first pruning strategy, determining a maximum of the first, second, third, and fourth upper average utility bounds, and determining a minimum of the first, second, and third lower minimum utility threshold bounds;
if the maximum value is less than the minimum value, removing the item set and the superset of the item set from the original database.
Optionally, the determining a pruning strategy for the item set search space according to the calculation result includes:
in the second pruning strategy, if the third average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, or the fourth average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, deleting the sub-number structure generated by the item set expansion of the item set from the global tree structure.
Optionally, the determining a pruning strategy for the item set search space according to the calculation result includes:
in the third pruning strategy, determining potential extension items of the item set, and determining a transaction set comprising the potential extension items of the item set in a projection database of the item set;
in the transaction set, if the second average utility upper bound is less than the second minimum utility threshold lower bound, removing potential extension items of the item set from a potential extension item set of the item set.
The embodiment of the invention also discloses a global tree structure construction method, which is applied to the high-utility item set mining method, and the method comprises the following steps:
pruning the item set in the original database according to the first pruning strategy, and sequencing the pruned item set to obtain corresponding total sequence information;
constructing a head table of the global tree structure by adopting the total order information;
constructing a prefix tree and a utility array of the global tree structure;
and constructing the global tree structure according to the prefix tree, the head table and the utility array.
The embodiment of the invention also discloses an item set mining method, which is applied to the high-utility item set mining method and comprises the following steps:
traversing the global tree structure, and under the condition that the second pruning strategy is met, performing item set pruning on the original database by adopting the second pruning strategy;
traversing a projection database corresponding to the global tree structure, and performing item set pruning on the projection database by adopting a third pruning strategy under the condition that the third pruning strategy is met;
and constructing a conditional global tree structure based on the pruned original database and the projection database, and mining the efficient use item set based on the conditional global tree structure.
The embodiment of the invention also discloses a high-utility item set excavating device, which comprises:
the calculation module is used for acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility threshold lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result; the calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first upper average utility bound and the second upper average utility bound are determined based on an average utility of the set of items in the raw database;
a first determining module, configured to determine a pruning strategy for the item set search space according to the calculation result;
a second determination module to determine a set of efficient use items from the set of items according to the pruning policy.
Optionally, the pruning strategy includes a first pruning strategy, a second pruning strategy and a third pruning strategy, and the second determining module includes:
the construction submodule is used for constructing a global tree structure corresponding to the original database by adopting the first pruning strategy;
a first determining submodule, configured to determine an item set mining algorithm for calculating an average utility by using the second pruning strategy and the third pruning strategy;
a mining submodule for performing item set mining based on the global tree structure and the item set mining algorithm to determine the efficient item set from the item set.
Alternatively,
the calculation formula of the first average effective upper bound is as follows:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the second average effect upper bound is as follows:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
the fourth average effective value is calculated by the following formula:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN)。
alternatively,
the calculation formula of the lower bound of the first minimum utility threshold is as follows:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the lower bound of the second minimum utility threshold is as follows:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the lower bound of the third minimum utility threshold is as follows:
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )}。
optionally, the first average effective upper bound has a superset inverse monotonic property; the second average effect upper bound has a bidirectional expansion inverse monotone property; the third average effective upper bound has an item set extension inverse monotonic property; the fourth average effective upper bound has a depth trim condition.
Optionally, the first minimum utility threshold lower bound has a superset inverse monotonic attribute; the second minimum utility threshold lower bound has a two-way expansion inverse monotonic attribute; the third minimum utility threshold lower bound has an item set extension inverse monotonic attribute.
Optionally, the first determining module includes:
a second determining sub-module, configured to determine, in the first pruning strategy, a maximum of the first upper average effectiveness bound, the second upper average effectiveness bound, the third upper average effectiveness bound, and the fourth upper average effectiveness bound, and a minimum of the first lower minimum utility threshold bound, the second lower minimum utility threshold bound, and the third lower minimum utility threshold bound;
a first removal submodule, configured to remove the item set and the superset of item sets from the original database if the maximum value is smaller than the minimum value.
Optionally, the first determining module includes:
a deleting submodule, configured to, in the second pruning policy, delete, from the global tree structure, the sub-number structure generated by the item set expansion of the item set if the third average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, or if the fourth average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold.
Optionally, the first determining module includes:
a third determining sub-module, configured to determine, in the third pruning strategy, potential extension items of the item set, and determine a transaction set including the potential extension items of the item set in a projection database of the item set;
a second removing submodule, configured to remove, from the set of potential expansion items of the set of items, the potential expansion item of the set of items if the second average upper limit of effectiveness is smaller than the second minimum lower limit of effectiveness threshold in the set of transactions.
The embodiment of the invention also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the steps of a high utility item set mining method as described above, or implements the steps of a global tree structure building method as described above, or implements the steps of an item set mining method as described above.
The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when being executed by a processor, the computer program realizes the steps of the high-utility item set mining method, or realizes the steps of the global tree structure building method, or realizes the steps of the item set mining method.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, the four upper average utility bounds and the three lower minimum utility bounds of the item set can be respectively calculated by adopting a preset calculation mode, the pruning strategy for pruning the search space is determined according to the obtained calculation result, and the high-efficiency item set is mined from the item set according to the pruning strategy. By adopting the method, a more compact upper bound for average utility of the item sets and a lower bound for minimum utility threshold of the item sets are provided, and a new pruning strategy is established according to the upper bound and the lower bound, so that more impossible item sets in a search space can be pruned effectively, and the high-efficiency item sets can be mined quickly.
Drawings
FIG. 1 is a schematic diagram of a search space generated by a set of terms I;
FIG. 2 is a flowchart illustrating the steps of a high utility item set mining method according to an embodiment of the present invention;
FIG. 3 is a flow chart of the steps of another high utility item set mining method of an embodiment of the present invention;
FIG. 4 is a schematic diagram of a global tree structure generated from tables 2, 3 and 4;
FIG. 5 is a schematic diagram of the b-conditional AUP-tree;
FIG. 6 is a schematic of ba conditions AUP-tree;
FIG. 7 is a schematic diagram of a b-condition AUP-tree after updating a head entry a;
FIG. 8 is a schematic diagram of a b-condition AUP-tree after updating a head entry e;
FIG. 9 is a flowchart illustrating steps of a method for constructing a global tree structure according to an embodiment of the present invention;
FIG. 10 is a flowchart of the steps of a method of item set mining, in accordance with an embodiment of the present invention;
fig. 11 is a block diagram of a structure of a high-utility item set mining apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of them. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
Frequent item set mining is a fundamental research of data analysis, and the task is to find all item sets with the occurrence frequency not less than a specified threshold in a transaction database.
The data model adopted by the method has the following assumptions:
1. elements in the database have the same importance (weight), e.g. frequent item set mining in the transaction database assumes that diamond and milk have the same profit per unit;
2. the number of occurrences of a database element in each transaction is only 0 or 1 (see table 1).
Therefore, frequent itemset mining cannot meet the requirement of enterprise managers for pursuing the maximization of sales profits, because the sales profits of enterprises are influenced by two factors: the profit per unit (weight) and the quantity sold of the good. To this end, researchers have posed a "high utility itemset mining" problem. In high utility item set mining, each element in the database is associated with an external utility, representing the weight of this element in the database (e.g., the unit profit of the good (see Table 3)); each element is associated with an internal utility in the transaction, representing the number of this element in the transaction (e.g., the purchase number of the good (see Table 2)), the utility of the element in each transaction is defined as the product of its external utility and the internal utility, and the utility of the set of items is defined as the sum of the utilities of each element in the set of items in all transactions containing the set of items.
While high utility item set mining can achieve the goal of enterprise managers pursuing profit maximization, it can be that enterprise managers are generally not interested in item sets containing multiple elements in real life. This is because it is very difficult to make a sales strategy for too many products at the same time. In addition, the new item set formed by combining the high-efficiency item set and the low-efficiency item set can still become the high-efficiency item set, and the length of the item set is introduced into the utility measurement, so that the aims of filtering the combination of the high-efficiency item set and the low-efficiency item set and focusing the combination of the high-efficiency item set and the high-efficiency item set by managers can be fulfilled. To do this, researchers incorporate the length of a set of items into an existing utility metric system, taking "average utility" as a metric of the set of items, i.e., the ratio of the utility of the set of items to its length.
However, average utility item set mining employs a single minimum utility threshold as an evaluation criterion for all item sets, i.e., implicitly assuming that all items in the database have similar utility values. In real life, the utility of items is always different. To obtain a set of items that contain a particular element, one can only do so by continually adjusting the value of the minimum utility threshold low. This would result in the algorithm returning too many sets of terms, requiring the user to pick the results twice. Therefore, researchers put forward the problem of 'mining a multi-utility threshold high-utility item set' aiming at the utility values of different items in a database, namely, a high-utility threshold is assigned to high-efficiency items; and for the low efficiency term, a low efficiency threshold is assigned. For this reason, the item set containing the high-efficiency items needs to satisfy the high-efficiency threshold, and the item set containing the low-efficiency items needs to satisfy the low-efficiency threshold, so that the aim of treating the high-efficiency items and the low-efficiency items differently is fulfilled.
The multi-utility threshold high-utility item set mining has many applications in real life, such as consumer shopping behavior analysis, web click stream analysis, engineering design, and the like.
TABLE 1 sample database for frequent itemset mining
Figure 313839DEST_PATH_IMAGE001
TABLE 2 sample database for high utility item set mining
Figure 762138DEST_PATH_IMAGE002
External utility of item in Table 3
Figure 948400DEST_PATH_IMAGE003
The existing multi-utility threshold high-utility item set mining algorithm generally adopts vertical data to represent information in a storage database. During the mining process, a multi-item set of vertical data representations are recursively created based on the 1-item set vertical data representations constructed from the original database. The average utility and the upper bound of the average utility of the item set are calculated from the vertical data representation corresponding to the item set. These algorithms suffer from three disadvantages:
1. the average effectiveness of the item set is represented by the calculation of the upper bound based on the horizontal data. The average effectiveness of the item set obtained based on the horizontal data representation is not sufficiently compact with an upper bound;
2. in the construction of the vertical data representation, the same transaction merging strategy is not applied (if two transactions contain the same item, the two transactions can be merged into a new transaction), so that the sizes of an original database and a projection database cannot be reduced;
3. the construction of the multi-item set vertical data representation is based on the join operation performed by its specific two subsets corresponding to the vertical data representation, and the join operation is very time-consuming.
Mining the multi-utility threshold efficient use item set requires solving two problems:
one is that the search space for a set of terms is huge. Assuming that the number of entries in the database is m, the search space of the set of entries in the database contains: (2 m -1) The number of sets of terms, i.e., terms in the search space, is exponential to the number of terms in the database. The runtime of the algorithm is intolerable if its utility value is calculated for each set of terms in the search space, because the number of sets of terms in the search space grows in a geometric progression as the number of terms in the database increases. Therefore, effectively pruning the search space is a key for improving the performance of the existing algorithm. In frequent item set mining, the occurrence frequency of an item set meets a downward closed property (1), namely, a superset of any infrequent item set is unlikely to be a frequent item set; 2) any subset of the frequent item set is the frequent item set. Thus, term set frequency of occurrence can be used to prune the search space. That is, in the traversal of the search space, if a term set is found to be not a frequent term set, its superset can be ignored without losing any of the frequent term sets. However, in the multi-utility threshold high-utility item set mining, the utility of the item set does not satisfy the downward closure property, i.e., for any item set, the utility of a subset thereof may be less than that of the item set, and the utility of a superset thereof may be greater than that of the item set. Therefore, only by the utility value of the item set, the superset of the item set in the search space cannot be pruned, and how to prune the search space effectively becomes the first problem to be solved by the multi-utility threshold high-utility item set mining.
Secondly, how to quickly calculate the average utility of each item set and the upper bound of the average utility for pruning the search space in the traversal process of the search space. In multi-utility threshold high-utility item set mining, a divide and conquer approach is typically employed to divide the raw database into a set of different projection databases. The utility and upper utility bounds of each item set and its expansion (the expansion of an item set refers to the item set represented by the descendant nodes of the item set in the search space) are calculated only in its corresponding projection database. The process is realized by adopting a recursion method, and the termination condition is that the effective upper bound obtained by calculation is smaller than the minimum effective threshold corresponding to the item set, or the generated projection database is empty. To achieve the above process, two factors are key to the performance of the algorithm: the data structure used and the strategy of pruning the projection database size. The data structures adopted in previous researches include a tree structure, a vertical data representation, a hyperlink structure and the like, and the strategy of pruning the size of the projection database generally adopts the "same transaction merging", that is, if two projection transactions contain the same item (the internal utility of the item in the two transactions may be different), merging the two transactions into a new transaction (the internal utility of the item in the new transaction is the corresponding internal utility of the item in the original two transactions and the internal utility of the item in the new transaction are the sum), does not influence the average utility of the item set in the database, thereby achieving the purpose of reducing the size of the projection database. However, the data structure is adopted to realize the same transaction combination, and high time complexity is required. Therefore, designing an effective data structure to quickly realize the same transaction combination, so that the calculation of the average utility of the item set and the upper bound of the average utility is a key for solving the second sub-problem of mining the multi-utility threshold high-utility item set.
The present invention is intended to provide a high-utility item set mining method and a corresponding high-utility item set mining apparatus, an electronic device, and a computer-readable storage medium that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
One of the core ideas of the embodiment of the invention is that four upper bounds of average utility and three lower bounds of minimum utility threshold of the item set can be respectively calculated by adopting a preset calculation mode, a pruning strategy for pruning a search space is determined according to the obtained calculation result, and the high-efficiency item set is excavated from the item set according to the pruning strategy. By adopting the method, a more compact upper bound for average utility of the item sets and a lower bound for minimum utility threshold of the item sets are provided, and a new pruning strategy is established according to the upper bound and the lower bound, so that more impossible item sets in a search space can be pruned effectively, and the high-efficiency item sets can be mined quickly.
In order to enable a person skilled in the art to better understand the present invention, the meanings of the relevant definitions are explained below:
suppose thatI = {i 1 , i 2 ,...,i n }Is one ofnA collection of different items, each itemi p (1 ≤ p ≤ n)Associating a unit profitp(i p )Referred to as external utility. If a set of itemsX = {i a1 , i a2 ,..., i ak }BykA plurality of different items, whereini aj ∈ I (1 ≤ j ≤ k, 1 ≤ a j ≤ n)Then callkAs a set of itemsXLength ofkIs called ask-A set of items. AffairsT d ByIIs composed of a subset of (a) of,T d with a unique identifierdIs called asTidT d Item (1)i p Associating a valueq(i p , T d ) (1 ≤ p ≤ vvIs composed ofT d Length of) ofi p In thatT d Internal effects of (1), thereforeT d Can be expressed as{(i b1 , q 1 ) (i b2, q 2 ) … (i bv , q v )}Database of transactionsDB = {T 1 , T 2 ,..., T m }Is of sizemThe set of transactions of (2).
Definition 1 (utility of item in transaction) itemi p In affairsT d Inu(i p , T d )Means thati p In thatT d Internal effects ofi p Multiplication of external effects, i.e.u(i p , T d ) = p(i p ) ´ q(i p ,T d )
Define 2 (utility of item set in transaction) item setXIn thatT d Inu(X,T d )Means thatXEach of which is inT d Of (1) and, i.e.
Figure 78030DEST_PATH_IMAGE004
Defining 3 (utility of item set in database) item setXIn a transaction databaseDBIn (1) applicationu(X)Mean thatXIn thatDBAll compriseXUtility sum in transactions, i.e.
Figure 423561DEST_PATH_IMAGE005
Define 4 (average utility of item set in database) item setXIn a transaction databaseDBAverage utility ofau(X)Means thatXIn thatDBThe ratio of the effect in (1) to its length, i.e.au(X) = u(X)/|X|
For example, in the external utilities of the database of Table 2 and the entries of Table 3, entry'b' atT 2 Inu(b, T 2 )= 5' 2 = 10, itemset{bc}In thatT 2 In (1) applicationu({bc}, T 2 )= u(b, T 2 ) + u(c, T 2 ) = 10 + 18 = 28, item set{bc}In thatDBInu({bc})= u({bc}, T 2 ) + u({bc}, T 3 ) + u({bc}, T 5 ) = 28 + 12 + 26= 66, item set{bc}In thatDBAverage utility ofau({bc}) = u({bc})/|{bc}| = 66/2 = 33。
Multi-utility threshold high-utility item set mining needs to be done for each item in the databasei p Specifying a minimum utility thresholdmau (i p )For example table 4.
Table 4 minimum utility threshold values specified by entries in the database
Figure 167395DEST_PATH_IMAGE006
Defining 5 (minimum utility threshold for set of items) the minimum utility threshold for a set of items refers to the arithmetic mean of the minimum utility thresholds for all items in the set of items, i.e.mau(X) = [mau(i a1 ) + mau(i a2 ) + ... + mau(i ak )]/|X|. For example, in the table 4, the following,mau(ab) = [mau(a) + mau(b)]/2 = (20 + 15)/2 = 17.5。
define 6 (efficient use item set) if item setXThe average utility in the database is not less than the minimum utility threshold of the set of items: (au(X) ≥ mau(X)) Then callXIs a highly efficient use of item sets.
Given a transactional database DB and a table of item minimum utility thresholds (e.g., table 4), multi-utility threshold high utility item set mining refers to finding all sets of items in the database whose average utility of the set of items is not less than the item set minimum utility threshold. Assume that the set of all items in the database isI = {i 1 , i 2 ,...,i n }The method has the advantages of no loss of generality,Iare arranged in a general order ≺ (e.g., alphabetical), that isi 1 ≺ i 2 ≺···≺ i n Then is obtained byIThe set of all items generated by the items in (a) may be represented by a set enumeration tree. For example,I = {1, 2, 3, 4}FIG. 1 shows a solution ofI = {1, 2, 3, 4}The resulting search space, the tree structure in FIG. 1, is represented byIAll item sets generated, in the present invention, it is assumed that the items in the item set are in terms of totalOrder and set itemsX = {i a1 i a2 ...i ak }Simplified toi a1 i a2 ...i ak . Before the upper bound on the average effectiveness of the term set is presented, the following auxiliary concepts are first described.
Definition 7 (potential extension item) if itemi p Arranged in item sets according to total orderXAfter each item in (1), then calli p Is composed ofXThe potential extension items of (a) are,Xthe set of all potential expansion items is notedPEIs(X). For example,PEIs(ab)is composed ofcdef
Definition 8 (item set extension) versus non-empty item setsXXAndPEIs(X)the new set of items composed of subsets is calledXThe set of extension items of (2). For example, a collection of itemsaceAndadfas a set of itemsaThe set of extension items of (1).
Define 9 (item set extension of item set to some potential extension item) if itemi p Is composed ofXPotential extension items ofXIs contained in all item set extensionsi p Is called asXTo pairi p The item set of (2) is expanded. For example to orderX = aci p = eThen, thenXTo pairi p Is expanded intoacdeaceacefAndacdef. WhereinacefIs composed ofacTo paireThe front extension of (a) is,acdeis composed ofacTo paireThe post-expansion of (a) is,acdefis composed ofacTo paireIs expanded in both directions.
Definition 10 (project transaction) if a transactionT d Containing item setsXThen, thenT d In thatXThe projection transaction under the condition is defined asT d In (A) belong toXAndPEIs(X)a set of items of, i.e.T q | X = {(i p , q(i p , T q )) | i p ∈ X ˅ PEIs(X)}. For example to orderX = bcIn Table 2T 3 | bc = {(b, 10) (c, 5) (d, 3) (f, 2)}
Defining 11 (projection database) database at item setXProjection database under conditionsPB(X)For all in the databaseXProjecting a collection of transactions, i.e.PB(X) = {T q | X | T q ∈ DB ∧ X ⊆ T q }
Referring to fig. 2, a flowchart illustrating steps of a high-utility item set mining method according to an embodiment of the present invention is shown, which may specifically include the following steps:
step 201, acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result.
The calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first average utility upper bound and the second average utility upper bound are determined based on the average utility of the set of items in the original database.
As can be seen from the above-mentioned definition of the highly efficient use item set, if the item setXHas an upper bound of less thanmau(X)Lower bound of (1), thenXIt is not possible to be an efficient set of terms.
In the embodiment of the invention, an upper bound on the average effectiveness of four item sets is provided: the first average effect is defined byvaub 1 ) The second average effect is defined byvaub) Third effective upper boundaryivaub) And the fourth average effective upper bound oflvaub) And three item set minimum utility threshold lower bounds: first lower minimum utility threshold bound (matlb 1 ) A second lower minimum utility threshold bound (matlb) And a third lower minimum utility threshold bound: (imatlb). And the number of the first and second groups is,vaub 1 andvaubcollection of itemsXAverage utility in the database: (u(X, TS(X))/|X|) Go on to countAnd (4) calculating.ivaubAndlvaubcalculations are performed for a projection database based on a set of items.
Step 202, determining a pruning strategy for the item set search space according to the calculation result.
In the embodiment of the invention, based on the calculated upper bound of the average utility of the four item sets and the lower bound of the minimum utility threshold of the three item sets, a new pruning strategy can be determined to prune the search space.
Step 203, determining an efficient use item set from the item set according to the pruning strategy.
After the corresponding pruning policy is determined, the pruning policy may then be employed to mine the efficient use item set from the item set.
In summary, in the embodiment of the present invention, a preset calculation method may be adopted to calculate four upper average utility bounds and three lower minimum utility bounds of the term set, respectively, determine a pruning strategy for pruning the search space according to the obtained calculation result, and excavate the efficient term set from the term set according to the pruning strategy. By adopting the method, a more compact upper limit for average utility of the item set and a lower limit for minimum utility threshold of the item set are provided, and a new pruning strategy is established according to the upper limit and the lower limit, so that more impossible item sets in a search space can be effectively pruned, and the high-efficiency item set can be rapidly mined.
Referring to fig. 3, a flowchart illustrating steps of another high-utility item set mining method according to an embodiment of the present invention is shown, which specifically includes the following steps:
step 301, obtaining an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result.
The calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first average utility upper bound and the second average utility upper bound are determined based on the average utility of the set of items in the original database.
Specifically, the first average efficiency upper bound calculation formula is as follows:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein, the first and the second end of the pipe are connected with each other,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub 1 (X) Is composed ofXThe first average effective value of (1) is an upper bound;TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤ h,1≤ c h ;|X| is item setXLength of (d);u(i c1 , TS(X) Is an itemi c1 In thatTS(X)The same can be said for the middle effect;u(X, TS(X) Is a set of items XTS(X) The effects of (1).
The second average efficiency is calculated by the following formula:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub(X) Is composed ofXThe second average effect of (1) is an upper bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤ h,1≤ c h (ii) a Item(s)i cf As a collection of itemsi c1 i c2 ...i ch In the first ranking according to the general orderi a1 The following items; non-viable cellsX| is item setXLength of (d);u(i cf , TS(X) Is an itemi cf In thatTS(X)The same goes for the middle effect and so on;u(X, TS(X) Is a set of items XTS(X) The effects of (1).
The calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k ivaub(X) Is composed ofXThe third average value of (a) is upper bound; PB(X)for all in the databaseXProjecting a set of transactions;PB(X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤ h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h,1≤l )As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXLength of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The effect of (1) in (1), and so on,u(X, PB(X) Is a set of items XPB(X) The effects of (1).
The fourth average efficiency is calculated as follows:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN);
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k lvaub(X) Is composed ofXThe fourth average effect of (1) is upper bound;au(X) Is composed ofXIn a transaction databaseDBAverage utility of (1);PB(X)for all in the databaseXProjecting a set of transactions;PB (X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤ h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h, 1≤l)As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXLength of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The same goes for the middle effect and so on;MNis composed ofXIn a projection databasePB(X)The maximum number of potential extension terms in (c).
The first minimum utility threshold lower bound is calculated as follows:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )};
wherein the content of the first and second substances,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l matlb 1 (X) Is composed ofXA first lower minimum utility threshold of (a); TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤ h,1≤ c h mau(i c1 ) Is an itemi c1 The minimum utility threshold of (2), and so on;mau(X) Is composed ofXA minimum utility threshold.
The calculation formula of the second minimum utility threshold lower bound is as follows:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )};
wherein the content of the first and second substances,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l matlb(X) Is composed ofXA second minimum utility threshold lower bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤ h,1≤ c h (ii) a Item(s)i cf As a collection of itemsi c1 i c2 ...i ch In the first rank according to the general orderi a1 The following items;mau(i cf ) Is an itemi cf The minimum utility threshold of (2), and so on;mau(X) Is composed ofXA minimum utility threshold.
The formula for the lower bound of the third minimum utility threshold is as follows:
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )};
wherein the content of the first and second substances,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l imatlb(X) Is composed ofXA third minimum utility threshold lower bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤ h,1≤ c h (ii) a Item(s)i cl As a collection of itemsi c1 i c2 ...i ch In the first rank according to the general orderi al The latter item;mau(i cl ) Is an itemi cl The minimum utility threshold of (2), and so on;mau(X) Is composed ofXA minimum utility threshold.
Definition 12 (Upper bound for average effect present)aub 1 Andauband the upper bound on average effectiveness employed by the inventionvaub 1 Andvaub) For item setsXIn the database containsXIs marked asTS(X)TS(X)The set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak In whichX = i a1 i a2 ...i ak . Existing upper bound for average effectivenessaub 1 (X)Andaub(X)defined as equations 1 and 2, respectively.
aub 1 (X) = max{u(i a1 , TS(X)), ..., u(i ak , TS(X)), u(i c1 , TS(X)), ..., u (i ch , TS(X))} (1)
aub(X) = max{u(i a1 , TS(X)), ..., u(i ak , TS(X)), u(i cf , TS(X)), ..., u (i ch , TS(X))} (2)
Therein, itemi cf As a set of itemsi c1 i c2 ...i ch In the first ranked item according to the general orderi a1 The following items.
In the embodiment of the present invention, the first average effect is used as the upper boundvaub 1 And second average effective upper boundvaubDefined as equations 3 and 4.
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|} (3)
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|} (4)
As can be seen from equations 1-4, the first average effect and the second average effect of the present invention employ a set of terms at an upper boundXAverage utility of (A), (B)u(X, TS(X))/|X|) Replaces the original item setXEach of which is inTS(X)The utility of (1) is calculated.
Definition 13 (Upper bound for average effect present)iaubAndlauband the average effectiveness employed by the inventionivaub) Item setXIs assumed to be atPB(X)The set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch Item of whichi cj (l ≤ j ≤ h)Are arranged according to the total sequencei ak Then, the existing average effectiveness is bounded byiaub(X)Andlaub(X)defined as equations 5 and 6, respectively.
iaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(i a1 , PB(X)), ..., u (i ak , PB(X))} (5)
laub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} (6)
In the embodiment of the present invention, the third average effect is defined as formula 7 by the upper bound ivaub.
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)),u(X, PB(X))/|X|} (7)
Definition 14 (maximum number of potentially expanding items in a transaction for a set of items and upper bound on average utility employed by the invention)lvaub) Item setXIn the projection of the transactionT q | X The number of potential extension items in the table is recorded asN(PEIs(X), T q | X ), XIn a projection databasePB(X)Maximum number of potential extension items inMN = max{N(PEIs(X), T q | X ), T q | X Î PB(X)}. Fourth average effectiveness Upper bound adopted by the inventionlvaubIs defined as equation 8.
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN) (8)
For example, orderX = cdIn Table 2TS(cd) = {T 2 , T 3 , T 4 , T 5 }. Therefore, the temperature of the molten metal is controlled,aub 1 (cd) = max{u(a, TS(cd)), u(b, TS(cd)), u(e, TS(cd)), u(f, TS(cd)), u(c, TS(cd)), u(d, TS (cd))} = max {14, 30, 15, 10, 51, 20} = 51 andaub(cd) = max{u(e, TS(cd)), u(f, TS(cd)), u(c, TS(cd)), u(d, TS(cd))}= max {15, 10, 51, 20} = 51. At the same time, the user can select the required time,vaub 1 (cd) = max{u(a, TS(cd)), u(b, TS(cd)), u(e, TS(cd)), u(f, TS(cd)), u(cd, TS (cd))/|cd|} = max {14, 30, 15, 10, (51 + 20)/2} = 35.5 andvaub(cd) = max{u(e, TS (cd)), u(f, TS(cd)), u(cd, TS(cd))/|cd|} = 15, 10, 35.5} = 35.5. In this way, it can be seen that,vaub 1 (X) ≤ aub 1 (X)vaub(X) ≤ aub(X)
likewise, letX = acIn Table 2TS(ac) = {T 1 , T 3 , T 5 }. Therefore, the temperature of the molten metal is controlled,iaub(ac) = max{u(d, TS(ac)), u(e, TS(ac)), u(f, TS(ac)), u(a, TS(ac)), u(c, TS(ac))} = max {8, 15, 10, 21, 48} = 48 andlaub(ac) = au(ac) + max{u(d, TS(ac)), u(e, TS(ac)), u(f, TS (ac))} = (21 + 48)/2 + max {8, 15, 10} = 34.5 + 15 = 49.5. At the same time, the user can select the desired position,ivaub(ac) = max {u(d, TS(ac)), u(e, TS(ac)), u(f, TS(ac)), u(ac, TS(ac))/|ac|} = max {8, 15, 10, 34.5} = 34.5. Due to the fact thatN(PEIs(ac), T 1 ) = 1, N(PEIs(ac), T 3 ) = 2AndN(PEIs(ac), T 5 ) = 1thus, therefore, it isMN = 2lvaub(ac) = au(ac, TS(ac)) + max{u(d, TS(ac)), u(e, TS(ac). u (f, TS(ac)} × MN / (|ac| + MN)= 34.5 + 15 × 2/4 = 42. Therefore, the temperature of the molten metal is controlled,ivaub(X) ≤ iaub (X)lvaub(X) ≤ laub(X)
furthermore, in previous studies, the lower bound of the minimum utility threshold for an item set was generally passedXAndPEIs(X)minimum utility threshold calculation of the medium term without going toXConsidered as a whole.
Definition 15 (item set minimum utility threshold lower bound)matlb 1 matlbAndimatlb) For non-empty item setsXAppear inTS (X)The term in (A) representsi a1 i a2 ...i al i c1 i c2 ...i ch Then the first minimum utility threshold lower boundmatlb 1 Is defined as:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )}
suppose in a collection of itemsi c1 i c2 ...i ch In the general orderi a1 Andi al the first items of the latter are respectivelyi cf Andi cl then the second minimum utility threshold lower boundmatlbAnd a third lower minimum utility threshold boundimatlbAre respectively defined as:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )}
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )}
for example, orderX = bdThen, thenTS(bd)The term appearing inacef bd. According to the results of Table 4,mau(bd) = [mau (b) + mau(d)]/2= 15 + 22)/2 = 17.5, thenmatlb 1 (bd) = min{mau(bd), mau(a), mau (c), mau(e), mau(f)}= min {17.5, 20, 18, 14, 12} = 12. In addition to this, the present invention is,matlb(bd)andimatlb (bd)are respectively asmatlb(bd)= min{mau(bd), mau(c), mau(e), mau(f)}= min {17.5, 18, 14, 12} = 12 andimatlb(bd) = min{mau(bd), mau(e), mau(f)} = min{17.5, 14, 12} = 12。
in an alternative embodiment of the invention, the first upper average effective bound has a superset inverse monotonic property; the second average effect has two-direction expansion inverse monotony property by using an upper bound; the third average effective upper bound has an item set extension inverse monotonic property; the fourth average effect has a deep trim condition with the upper bound.
There are two ways of evaluating the mean effectiveness of a set of items. The first is to compare the magnitude of the values with an upper bound on the average effect of the two sets of terms. For exampleub 1 Andub 2 average effectiveness for two sets of terms, if any, non-empty sets of termsXub 1 (X)Is always numerically less thanub 2 (X)Then callub 1 Ratio ofub 2 Is compact. The second way to evaluate is to determine whether the upper bound of the average effectiveness of the item set has the inverse monotonic property.
Define 16 (inverse monotonic property of upper bound for average effect of item set) orderCDERAndXfor a set of items, assumeubAs a set of itemsXAn upper bound on the average utility, i.e.au(X) ≤ ub(X)To callubSatisfy the requirement of
(1) Superset inverse monotonic propertyAM(ub)And if and only if⩝ C ⊇ XIs provided withub(C) ≤ ub(X)
(2) Two-way extended inverse monotonic propertyBiDEAM(ub)And if and only if⩝ X = R ⋃ DC = R ⋃ ER ≠ ØD ⊆ EIs provided withub(C) ≤ ub(X)This is true.
(3) Item set extension inverse monotonic propertyIEAM(ub)If and only if pairsXExtension of any one item setEIs provided withub(E) ≤ ub(X)This is true.
(4) Deep trimming conditionsDPC(ub)If, if⩝ C = R ⋃ ER ≠ ØSatisfy the following requirementsau(C) ≤ ub(R)
The relationship of the four inverse monotone attributes isAM(ub) Þ BiDEAM(ub) Þ IEAM(ub) Þ DPC(ub). And the invention adoptsvaub 1 vaubivaubAndlvaubthe owned anti-monotonic property is specifically (the attestation process is not described in detail here):
(1) AM(vaub 1 )i.e. byvaub 1 The method has the properties of superset inverse monotony;
(2) BiDEAM(vaub)i.e. byvaubThe method has the property of two-direction extension inverse monotony;
(3) IEAM(ivaub)i.e. byivaubThe method has the property of item set expansion inverse monotony;
(4) DPC(lvaub)i.e. bylvaubThe method has the deep trimming condition;
(5) vaub 1 , vaubandivaubprogressively closer to the average utility of the term set, i.e. to any non-empty term setXSatisfy the requirement ofau (X) ≤ ivaub(X) ≤ vaub(X) ≤ vaub 1 (X)
In an alternative embodiment of the invention, the first minimum utility threshold lower bound has a superset inverse monotonic attribute; the second minimum utility threshold lower bound has a two-way expansion inverse monotonic attribute; the third lower minimum utility threshold has an item set extension inverse monotonic attribute.
Likewise, for the lower threshold minimum utility of the set of terms, there are two ways of evaluating: 1. comparing numerically to a lower minimum utility threshold for the set of terms present; 2. evaluating the inverse monotonic attribute possessed by the lower bound of the minimum utility threshold of the item set.
Define 17 (inverse monotonic Attribute for lower bound on item set minimum utility threshold) orderCDERAndXfor a set of items, assumelbAs a set of itemsXA lower bound on the minimum utility threshold, i.e.lb(X) ≤ mau(X)Then calllbSatisfy the requirement of
(1) Superset inverse monotonic propertyAM(lb)And if and only if⩝ C ⊇ XIs provided withlb(C) ≥ lb(X)
(2) Two-way extended inverse monotonic propertyBiDEAM(lb)And if and only if⩝ X = R ⋃ DC = R ⋃ ER ≠ ØD ⊆ EIs provided withlb(C) ≥ lb(X)This is true.
(3) Item set extension inverse monotonic propertyIEAM(lb)If and only if pairsXExtension of any one item setEIs provided withlb(E) ≥ lb(X)This is true.
And the invention adoptsmatlb 1 matlbAnd, andimatlbthe owned anti-monotonic attribute is specifically (the attestation process is not described in detail here):
(1) AM(matlb 1 )i.e. bymatlb 1 The method has the properties of superset inverse monotony;
(2) BiDEAM(matlb)i.e. bymatlbThe method has the property of two-direction extension inverse monotony;
(3) IEAM(imatlb)i.e. byimatlbThe method has the property of item set expansion inverse monotony;
(4) matlb 1 , matlbandimatlbprogressively closer to the minimum utility threshold of the set of items, i.e. for any non-empty set of itemsXSatisfy the requirements ofmau(X) ≥ imatlb(X) ≥ matlb(X) ≥ matlb 1 (X)
Illustratively, the upper and lower minimum utility threshold bounds for the average utility of the set of items employed by the present invention are compared with the upper and lower bounds, respectively, of other peers.
Definition 18 (Upper bound for average effectiveness of item setrtubAndeubr ) For non-empty item setsXIn aXProjection ofAffairsT q | X Middle, maximum term effectmiu(T q | X )Is defined asmiu(T q | X ) = max{u(i p , T q | X ) | i p Î T q | X }Average effective upper bound of term setrtub(X)Defined as equation 9.
Figure 168849DEST_PATH_IMAGE007
(9)
At the same time, projecting the transactionT q | X In (1),Xresidual maximum term utility ofremu(X, T q | X )Is defined asremu(X, T q | X ) = max{u(i p , T q | X ) | i p Î T q | X ˄ i p Î PEIs(X)}. At the same time, the user can select the required time,Xin thatT q | X Upper bound of utility ineubr(X, T q | X )Is defined aseubr(X, T q | X ) = u(X, T q | X )/|X| + remu(X, T q | X )/(|X| + 1)Average effective upper bound of term seteubr(X)Defined as equation 10.
Figure 898907DEST_PATH_IMAGE008
(10)
Adopted in the inventionvaub 1 vaubivaubAndlvaubupper bound on average effect from presenceaub 1 aubiaublaubrtubAndeubrthe relationship of (a) to (b) is as follows:
(1)vaub 1 vaubivaubandlvaubare respectively less thanaub 1 aubiaubAndlaub
(2)rtubpossess an item set extension inverse monotonic property (IEAM(rtub)),ivaubRatio ofrtubThe structure is compact;
(3)eubrwithout the deep clipping property.
Definition 19 (item set minimum utility threshold lower bound)smauAndLMAU) For non-empty item setsX = i a1 i a2 ...i ak Let us orderPEIs(X) = {i cl i cl+1 ...i ch }Then, thensmau(X)Is defined as:smau(X) = min{mau(i cl ), ..., mau(i ch ), mau(i a1 ), ..., mau(i ak )}
for example, in Table 4, letX = acThen, thensmau(ac) = min{mau(a), mau(c), mau(d), mau (e), mau(f)} = min {20, 18, 22, 14, 12} = 12. Of the minimum utility thresholds specified by the database for each entry, the minimum valueLMAUIs defined as:LMAU = min{mau(i 1 ), mau(i 2 ), ..., mau(i m )}
adopted in the inventionmatlb 1 matlbAndimatlbminimum utility threshold lower bound with item setsmauAndLMAUthe relationship of (a) to (b) is as follows:
(1)LMAUpossesses the property of inverse monotony of superset of item set: (AM(LMAU)),matlb 1 Ratio ofLMAUCompact;
(2)smaupossess an item set extension inverse monotonic property (IEAM(smau)),imatlbRatio ofsmauIs compact.
Step 302, according to the calculation result, determining a pruning strategy for the item set search space.
In the embodiment of the present invention, a search space pruning strategy may be designed based on the first upper average effectiveness bound, the second upper average effectiveness bound, the third upper average effectiveness bound, the fourth upper average effectiveness bound, the first lower minimum effectiveness threshold bound, the second lower minimum effectiveness threshold bound, and the third lower minimum effectiveness threshold bound of the present invention.
In an alternative embodiment of the invention, the pruning strategies include a first pruning strategy, a second pruning strategy and a third pruning strategy.
With respect to step 302, the following steps may be performed:
sub-step S11, in the first pruning strategy, determining a maximum of the first upper average effectiveness bound, the second upper average effectiveness bound, the third upper average effectiveness bound and the fourth upper average effectiveness bound, and determining a minimum of the first lower minimum utility threshold bound, the second lower minimum utility threshold bound and the third lower minimum utility threshold bound.
Substep S12, removing the item set and the superset of item sets from the original database if the maximum value is less than the minimum value.
Pruning strategy 1 (raw database singleton pruning) is in accordance with the above descriptionvaub 1 vaubivaubAndlvaubas can be seen from the inverse monotonic attribute analysis of (c),vaub 1 the average effect of the four item sets provided by the invention has the largest value in the upper bound. At the same time, according to the above-mentioned aimmatlb 1 matlbAndimatlbas can be seen from the inverse monotonic property analysis of (c),matlb 1 the values are the smallest in the three minimum utility threshold lower bounds set forth in the present invention. Thus, if the item setXIs/are as followsvaub 1 A value less thanmatlb 1 Then, thenXAnd its superset can be pruned in the search space: (au(S) ≤ vaub 1 (S) ≤ vaub 1 (X) < matlb 1 (X) ≤ matlb 1 (S) ≤ mau (S))。
In the embodiments of the present inventionIn the step (1), the first step,vaub 1 andmatlb 1 is used to prune the entry in the original database, i.e. ifDBItem (1)i p Satisfy the requirement ofvaub 1 (i p ) < matlb 1 (i p )Then, theni p Can be selected fromDBIs removed.
For example, the occurrences in tables 2 and 3 areabcdeAndf. By itemaFor the purpose of example only,TS(a) = {T 1 , T 3 , T 5 }then appear inTS(a)The utility of all entries in (a) is shown in the first row of table 5. After obtaining Table 5, it can be seen thatvaub 1 (a) = 48,vaub 1 (b) = 36, vaub 1 (c) = 81, vaub 1 (d) = 51, vaub 1 (e) = 63, and vaub 1 (f)= 10. At the same time, the user can select the desired position,matlb 1 (a) = matlb 1 (b) = matlb 1 (c) = matlb 1 (d) = 12, matlb 1 (e) = 14 andmatlb 1 (f)and = 12. Due to the fact thatvaub 1 (f) < matlb 1 (f)And thus the term f can be removed from table 2.
TABLE 5 for each entry in the raw databasevaub 1 Numerical value
Figure 239890DEST_PATH_IMAGE009
With respect to step 302, the following steps may be performed:
and a substep S21, in the second pruning strategy, if the third upper average utility bound is smaller than the third lower minimum utility threshold bound, or the fourth upper average utility bound is smaller than the third lower minimum utility threshold bound, deleting the sub-number structure generated by the item set expansion of the item set from the global tree structure.
Pruning strategy 2 (sub-tree pruning strategy) for non-empty item setsXXAll item set extensions appear in the search space toXThe corresponding node is in the subtree of the root. If it is notivaub(X) < imatlb(X)Or alternativelylvaub(X) < imatlb(X)Then the subtree can be pruned. Note that becauseIEAM(ivaub)It is true that the first and second sensors,au(X)also less thanimatlb(X). However, if only there islvaub(X) < imatlb(X)If it is true, thenau(X)Need to be calculated to judgeXWhether it is a high efficiency use item set.
For example, assume a total order ofb ≺ a ≺ d ≺ e ≺ c. Order toX = bdeRepresenting a set of entries in a search spacebdeIs at a node ofN. Due to the fact thatPB(bde)Is composed ofT 2 | bde = {(b, 5) (d, 2) (e, 3) (c, 6)}Then, thenivaub(bde) = max {au(bde, T 2 | bde ),u(c, T 2 | bde )} = max{[5 × 2 + 2 × 4 + 3 × 3]3, 6 × 3} = max {9, 18} =18, andlvaub(bde) = au(bde, T 2 | bde ) + u(c, T 2 | bde ) × 1 / (3 + 1) = 13.5. In addition to this, the present invention is,imatlb(bde) = min{mau(bde), mau(c)}= min { (15 + 14 + 22)/3, 18} = min {17, 18} = 17. Therefore, the temperature of the molten metal is controlled,lvaub(bde) < imatlb(bde)it is true that the first and second sensors,Nthe descendant nodes of (A) can be pruned, i.e. the item setbdec. It is noted that becauseivaub(bde) > imatlb(bde)Thus, therefore, it isau(bde)Calculations are required.
With respect to step 302, the following steps may be performed:
sub-step S31, in the third pruning strategy, determining potential extension items of the set of items, and determining a transaction set including the potential extension items of the set of items in the projection database of the set of items.
Sub-step S32, in the transaction set, if the second average upper bound of utility is less than the second minimum lower bound of utility threshold, removing potential extension items of the item set from the potential extension item set of the item set.
Pruning strategy 3 (potential extension item pruning strategy) for non-empty item setsXAndXpotential extension item ofi p PB(X)All of which comprisei p Is marked asTS(i p )| X . In thatTS(i p )| X In, ifvaub(Xi p ) < matlb(Xi p )Then, theni p Can be arranged inPEIs(X)Is removed, i.e.XTo pairi p The pre-expansion, post-expansion, and both directional expansions of (a) may be cleared in the search space.
For example, for a set of itemsbaThen, thenPB(ba)Is composed ofT 3 | ba = {(b, 3) (a, 1) (d, 1) (c, 2)}AndT 5 | ba = {(b, 7) (a, 1) (d, 1) (c, 4)}wherein the items are according to pruning strategy 1fIn thatT 3 | ba Is removed. To pairbaPotential extension item ofcTS(c)| ba = {T 3 | ba , T 5 | ba }. Therefore, the temperature of the molten metal is controlled,vaub(bac) = max{au(bac), TS(c)| ba ), u (d, TS(c)| ba )} = max { (20 + 14 + 18)/3, 8} = 17.33. At the same time, the user can select the desired position,matlb(bac) = min{mau (bac), mau(d)}= min { (15 + 20 + 18)/3, 22} = 17.66. Therefore, the temperature of the molten metal is controlled,vaub(bac) < matlb (bac)the term of an itemcCan be in the item setbaIs removed from the potential extension item.
Step 303, constructing a global tree structure corresponding to the original database by using the first pruning strategy.
The global tree structure is composed of a prefix tree, a head table and a utility array.
In the embodiment of the invention, an AUP-tree (AUP-tree) constructed based on an original database is provided.
Definitions 20 (AUP-tree) an AUP-tree consists of three parts, a prefix tree, a header table and a utility array. Each non-root node in the prefix tree includes the following fields: item tag, parent node pointer, child node pointer, link to another node having the same item tag node, and link to a record of the utility array.
The head table consists of a set of entries, each containing five fields: item mark,ivaublvaubpei_lbAnd a link to a first one of the prefix trees having the labeled tree node. Assuming AUP-tree composed ofPB(X)Is constructed ifXAnd if the result is null, the AUP-tree is constructed by the original database. Marking each item in the head tablei p ivaubAndlvaubdomain individual storageivaub(Xi p )Andlvaub(Xi p )the value of (d);pei_lbset of entries recorded in head tableXi p The potential extension terms are associated with a minimum value of the minimum utility threshold.
The utility array is a set of records, each record storing the utility of all the entries in one path of the prefix tree. Due to the existence of the same transaction merging policy, the path in the prefix tree may correspond to multiple projection transactions in the projection database.
The AUP-tree structure is constructed by two database scans, the AUP-tree constructed by the original database is called a global tree, and the algorithm flow for constructing the global tree structure is as follows:
input transaction databaseDBExternal utility table and item minimum utility threshold table
Output a global treeAUP-tree
01 scanningDBOnce for each itemi p
02 calculating the numerical valuevaub 1 (i p )Stored in a two-dimensional array
03 to the occurrence inTS(i p )Term in (1), calculatingmatlb 1 (i p )
04 removing in item listvaub 1 A value less thanmatlb 1 Item (pruning strategy 1)
05 to the rest item basisvaub 1 The numerical values are arranged in ascending order and are recorded as the current total orderO
06, inserting items in the current overall sequence in a reverse orderAUP-treeHead watchHeaderItem tag field in
07 creationAUP-treeMiddle prefix treePrefix-treeRoot node ofRAnd utility arrayArray
08 is toHeaderEach item in (1)i p Scanning two-dimensional arrays in line2
09 according to the current general sequenceOComputingivaub(i p )
10 is to mixi p The minimum value of the minimum utility threshold of the potential extension term is stored inpei_lbDomain
11 second pass scanningDBFor each transactionT q
12 removing not in Current Total orderOItem of (1)
13 basis for remaining itemsOReverse order, new transaction is notedT' q
Call 14Insert_trans(T' q ,R, Array, Header, Prefix-tree)Will be provided withT' q Insert intoAUP- tree
15 recordingT' q Number of potential extension items per item in
16: pairHeaderEach item in (1)i p CalculatingMNAndlvaub(i p )
Insert_trans(Trans, R, Array, Header, Prefix-tree)
01 is prepared by mixingTransIs represented by [ P | P]WhereinpIn the first item, the first item is,Plists formed for remaining items
02 ifPrefix-treeRoot node ofRAbsence of child nodesNSo thatNItem tag of andpsame, then
03, a new node is createdNAsRThe child node of willpItem tagging as a New node
04 initializationNThe utility array pointer of is null
05 inHeaderIn search term markingpItem of (1)ELinking the created new node to the new nodeEIn the starting node queue
06 ifPIs empty
07 calculation ofTransEach of which is inTransIn
08 ifNLinking to utility array null
09 at leastArrayAllocate a recordRecordStoringTransThe term of
10 establishment ofNTo utility arrayRecordIs linked with
11 otherwise
12 applying the same transaction merging strategy, willTransTerm utility of (1) intoNWith linked utility arraysRecordIn
13 otherwise
Call 14Insert_trans(P, N, Array, Header, Prefix-tree)
First, the database is scanned once, for each entry in the databasei p CalculatingTS(i p )The utility of the item appearing in the list is stored in a two-dimensional array, thereby obtainingvaub 1 (i p )Numerical values. At the same time, toTS(i p )Term in (2), calculatematlb 1 (i p )Numerical values (lines 1-3 in the flow of the global tree structure algorithm). If a single itemvaub 1 A value less thanmatlb 1 Numerical values, then the entry can be safely removed in the database (line 4 in the global tree structure algorithm flow) according to pruning strategy 1. Then all will bevaub 1 A value less thanmatlb 1 Is recorded as a collection of item compositionsSETTo the rest item basisvaub 1 The numerical values are arranged in ascending order, and the formed item order is recorded as the current total orderO(line 5 of the global tree structure algorithm flow). Constructing a head table of an AUP-treeHeaderInserting the items in the current overall sequence in reverse orderHeaderThe item tag field of each entry (line 6 in the global tree structure algorithm flow). Creating a prefix tree in an AUP-treePrefix-treeRoot node ofRAnd a utility arrayArray(line 7 of the global tree structure algorithm flow). For is toHeaderItem in each entryi p Scanning the two-dimensional array constructed on the 2 nd line according to the current overall sequence, and calculatingivaub (i p )Values and saving in the entryivaubIn the domain. The minimum utility threshold value table of the simultaneous scanning item is calculatedHeaderThe minimum value of the minimum utility threshold of each potential expansion item is stored in the item corresponding to the itempei_lbIn the domain (lines 8-10 in the flow of the global tree structure algorithm). In the second database scan, remove for each transactionSETThe residual items are arranged in reverse order according to the current total order, and the formed revision transaction is calledInsert_trans()Program inserted into the globalAUP-treeLine 11-14 of the global tree structure algorithm flow. At the same time, the number of potential extension items per item in the revision transaction is recorded (line 15 in the flow of the global tree structure algorithm). After the second database scan is finished, calculatingHeaderMaximum number of potential extension items per item andlvaub(i p )numerical values and storing in entries corresponding to the itemslvaubDomain (line 16 of the global tree structure algorithm flow).
In thatInsert_trans()In the program, the transaction will be revisedTransBuilt-in by means of prefix sharingAUP-treeIn (1). Firstly, the first step is toTransDivision into leaderpAnd collections of the remaining itemsP(in the figure)Insert_trans()Line 1 of the program). Then for the first itempChecking upPrefix-treeRoot node ofRWhether there is an item markerpThe child node of (2). Such asIf not, thenPrefix-treeAdd an item markpNode (a) ofNAs aRAnd initializing the child nodes ofNTo utility arrayArrayIs null (in the figure)Insert_trans()Lines 2-4 of the program). Watch on headHeaderIn search term markingpItem ofEConnecting pointNIs added to the business cardEIn the starting node queue (in the figure)Insert_trans()Line 5 of the program). If it ispIs composed ofTransThe last term of (2), then calculateTransItem utility of each item, according to nodeNTo utility arrayArrayReady to store item utilities toArrayIn (in the figure)Insert_trans()Lines 6-7 of the program). If it is notNTo utility arrayArrayIs empty, thenArrayAssign a recordRecordPreservation ofTransThe term utility in, and establishNTo pairRecordLink of (in the figure)Insert_trans()Lines 8-10 of the program); if it isNExist toArrayThe link of the record in (1) indicatesTransContains the same items as the previous revision transaction and will thereforeTransTerm utility of (1) accumulated intoNLinking the item utilities corresponding to records (in the graph)Insert_trans()Lines 11-12 of the program). If it ispIs not provided withTransThe last item of (1), then callInsert_trans()Program insertionTransThe remaining terms of (in the figure)Insert_trans()Lines 13-14 of the program).
For example, constructing the global tree structure AUP-tree obtained from tables 2 and 3 requires first scanning tables 2 and 3. For each item in the databaseabcdeAndfare respectively calculated atTS(a)TS(b)TS(c)TS(d)TS(e)AndTS(f)the utility of the terms in (a) is shown in table 5. Can know the itemabcdeAndfis/are as followsvaub 1 Values of 48, 36, 81, 51, 63 and 10. At the same time, the user can select the desired position,matlb 1 (a) = min{mau(a), mau(b), mau(c),mau(d), mau(e), mau(f)} = min{20, 15, 18, 22, 14, 10} = 10、matlb 1 (b) = matlb 1 (c) = matlb 1 (d) = 10、matlb 1 (e) = 14、matlb 1 (f)= 10. Due to the fact thatvaub 1 (f) < matlb 1 (f)According to pruning strategy 1, itemfCan be safely removed in the database. To the rest item basisvaub 1 The numerical values are arranged in ascending order to obtain the current total order ofb ≺ a ≺ d ≺ e ≺ c. Adding items in the current overall order to the global in reverse orderAUP-treeIn the head table of the tree, i.e.c ≻ e ≻ d ≻ a ≻ b. For each item in the head tableivaubAndpei_lbnumerical values. For example, to itemsdAccording to the current overall sequence, the extension term iseAndcthen in the transaction setTS(d)In (1),ivaub(d) = max {51, 15, 20} = 51. At the same time, the user can select the desired position,pei_lb(d) = min{mau(c), mau(e)}min {18, 14} = 14. The second scan of the database results in a revision transaction as shown in table 6. InvokingInsert_trans()Program to insert a revised transaction set into the globalAUP-treeAnd (4) a tree. Revising a transaction to a first{(c, 10) (e, 5) (a, 1)}First, the item is marked ascIs added to the prefix treePrefix-treeA child node as a root node, which is then added to the node marked by the head table entrycThe node queue from which the entry starts, and at the same time, the link of the node to the utility array is initialized to be empty. Item markers are then created in the same mannereAndanode (a) ofMAndN. Due to the itemaRevise the last item of the transaction for the first item, anNIf the link to the utility array is null, the term is calculatedceAndain thatT' 1 Utility {30, 15, 7} of (C), stored toArrayAnd establish a node in the first recordNA link to the first record. For the second revision transaction{(c, 6) (e, 3) (d, 2) (b, 5)}Sharing ofPrefix-treeItem tagging in current pathcAndenode(s) requiring an added item to be markeddIs/are as followsPNode of asMChild nodes of, and item labelsbIs/are as followsQNode of asPThe child node of (2). Due to the itembIs composed ofT' 2 Last item of (1), andQif the link to the utility array is null, the term is calculatedcedAndbin thatT' 2 Utility of {18, 9, 8, 10}, stored toArrayIn the second record and establish nodesQA link to the record. To pairT' 3 T' 4 AndT' 5 by adopting similar processing mode, the finally obtained globalAUP-treeThe tree is shown in figure 4. Note that becauseT' 3 AndT' 5 contains the same entries so that the two revised transactions are merged into a new transaction and the size of the original database is reduced from 5 to 4. Meanwhile, in the specific implementation of the invention, in order to save the storage space, the items in the utility array only indicate the corresponding relation and do not need to be stored, so that the items are transparently processed.
TABLE 6 revised transaction set
Figure 764412DEST_PATH_IMAGE010
TABLE 7 two-dimensional array of conditions b
Figure 253162DEST_PATH_IMAGE011
In summary, a first pruning strategy is applied in the process of constructing the global tree structure corresponding to the original database.
And step 304, determining an item set mining algorithm for calculating average utility by adopting the second pruning strategy and the third pruning strategy.
In the embodiment of the invention, a new item set mining algorithm, namely AUPGrowth algorithm, is provided, and a second pruning strategy and a third pruning strategy are applied to the AUPGrowth algorithm.
The AUPGrowth algorithm traverses the head table of the global AUP-tree in a bottom-up sequence, and the specific algorithm flow is as follows:
input a globalAUP-treeTree (R)
Output efficient use of item set collectionsHAUIS
01 to make HAUIS = Æ
02: pairAUP-treeEach entry in the head tableE(item Mark isi p
03 first passing throughEInpei_lbCalculate itmau(i p ) Lower boundary of (1)imatlb(i p );
04 if in Eivaub(i p ) < imatlb(i p )
05 calling programChange_node_util_pointer(i p );
06:Continue;
07 otherwise
08 traverse theEStarting node queue, calculation item seti p Mean utility of }
09 ifau({i p }) ≥ mau({i p })
10:HAUIS ←{i p }
11 ifEIn (1)lvaub(i p ) < matlb(i p )
12 calling programChange_node_util_pointer(i p );
13:Continue;
14 traverse theEStarting node queue, for each nodeP
15 traversing fromPPath to root node
16 will containPB({i p }) of the same or differenti q Is set asTS(i q )| ip
17 calculation ofTS(i q )| ip The utility of the term appearing in and stored in a two-dimensional array
18, scanning the two-dimensional array and calculatingvaub(i p i q ) Andmatlb(i p i q ) Numerical value
19 ifvaub(i p i q ) < matlb(i p i q )
20, then willi q Adding to collectionsSET'
21 to the rest item basisvaubThe numerical values are arranged in ascending order as the current total orderO'
22 ifO'The number of items in (1) is not 0
23 inserting the items in the current overall sequence in reverse order into a i p ConditionsAUP-treeHead watchHeaderItem tag field in
Creating conditions 24AUP-treeRoot node of middle prefix treeR'And utility arrayArray'
25, pairHeaderEach item in (1)i q Scanning the two-dimensional array created on line 17
26 calculating according to the current total sequenceivaub(i p i q ) Stored in an entryivaubDomain
27 atpei_lbDomain recordingi q Minimum value of potential extension term minimum utility threshold
28 second passPB({i p }) for each projection transactionT q | ip
29 removal ofSET'Item (1)
30, the remaining items are processed according to the current total sequenceO'Arrangement of
31 invokingInsert_trans(T' q | ip , R', Array', Header', Prefix-tree')
32 recordingT' q | ip Number of potential extension items per item in
33: pairHeaderEach item in (1)i q Meter for measuringCalculating outMNAndlvaub(i p i q )
34 calling Algorithm 2 AUPGrowth (Condition)AUP-tree);
35:Change_node_util_pointer(i p );
Change_node_util_pointer(i p )
01 finding an entry marker in the header tablei p Item ofE
02 traversing byEStarting node queue, for each nodeN
03 if its father nodeMChaining of utility arrays to null
04 is to getMLink settings to utility arrayNLinking utility arrays
05 otherwise
06 is to mixNAccumulating the utility of items in the utility array link record toMCorresponding item of linked record of utility array
In the AUPGrowth mining algorithm, firstly, the high-efficiency item set is gatheredHAUISNull (row 1 in the AUPGrowth algorithm flow). Then, the item in each item E of the head table is alignedi p According toEInpei_lbNumerical values in the field, computingmau(i p )Lower boundary of (1)imatlb(i p ) = min{pei_lb, mau(i p )}. If it is notEIn (1)ivaubIs less thanimatlb(i p )I.e. byivaub(i p ) < imatlb(i p )Then according to pruning strategy 2, item set{i p }And item set expansion thereof cannot become an efficient item set. Thus, call upChange_node_util_ pointer()The program helps to calculate the average utility of the set of items that contain the next entry in the head table (lines 2-6 in the flow of the AUPGrowth algorithm). If the condition of row 4 is not satisfied, traverse throughEStarting node queue, calculating item set according to the link of each node to utility arrayi p Average utility of. If item seti p Is not less thani p Minimum utility threshold of, theni p Is a high efficiency item set (lines 7-10 in the AUPGrowth algorithm flow). If it is notEIn (1)lvaubIs less thanimatlb(i p )I.e. bylvaub(i p ) < imatlb(i p )Then, according to the pruning strategy 2,i p none of the item set extensions of (a) is likely to be an efficient use item set. Thus, call upChange_node_util_pointer()Program (AUPGrowth algorithm flow lines 11-13). If the conditions in the 4 th and 11 th rows are not satisfied, traversingi p Projection database of (2), toPB(i p )Occurrence in (1)i q Comprisesi q Is set asTS(i q )| ip . ComputingTS(i q )| ip The utility of the terms is presented and stored in a two-dimensional array. In thati p After the traversal of the projection database is completed, each item is calculatedi q Is/are as followsvaubAndmatlbnumerical values (AUPGrowth algorithm flow lines 14-18). If it is notvaub(i p i q )Is less thanmatlb(i p i q )Then, according to the pruning strategy 3,i q can be selected fromPEIs(i p )Remove (AUPGrowth algorithm flow lines 19-20). To pairPEIs(i p )According to the remaining items invaubThe numerical values are arranged in ascending order, and the item order is recorded as the current total orderO'. If it is notO'If the number of entries in (1) is not 0, then construction is requiredi p Condition AUP-tree (AUPGrowth algorithm flow lines 21-22). Firstly, the items in the current total order are inserted in the reverse orderi p Creating a root node of a conditional AUP-tree prefix tree in an item tag field of a conditional AUP-tree header tableR'Sum utility arrayArray'. Then, for each item in the head table, scanning the two-dimensional array constructed in the 17 th row, and according to the current total orderO'ComputingivaubThe value and the minimum value of the minimum utility threshold for all potential expansion terms (lines 25-27 of the autoprowth algorithm flow). Finally, scanningPB(i p )Once, for each revision transaction, a call is madeInsert_trans()The program inserts a revised transaction into the conditionAUP-treeAnd records the number of potential extension items for the item in the revision transaction (lines 28-32 in the AUPGrowth algorithm flow). Note that during the revision transaction insertion process, the utility of the condition item set in the revision transaction needs to be preserved in the utility array (see examples below for details). Insertion condition of transaction to be totally revisedAUP- treeThen, the items in the head table are alignedi q Computinglvaub(i p i q )Is stored ini q Corresponding to the itemlvaubDomain (line 33 of the AUPGrowth algorithm flow). For the conditionAUP-treeThe algorithm is called recursively (line 34 in the AUPGrowth algorithm flow). When the current item in the head list is processed, callingChange_node_util_pointer()Program (line 35 of the AUPGrowth algorithm flow).
Change_node_util_pointer()The program is responsible for passing to its parent node the link to the utility array for each node in the current head table entry departure node queue. Each node in the queue of nodesNIf its parent node is null, then the parent node's link to the utility array is set to the current node's link to the utility array (in the graph)Change_node_util_pointer()Lines 1-4 of the program); otherwise, the item utility in the path from the father node to the root node is stored in a record of the utility arrayRecordIn (1). Therefore, willNCumulative entry of term utility into root node pathRecordCorresponding item of (in the figure)Change_node_util_pointer()Lines 5-6 of the program).
For example, in FIG. 4, the traversal order of the head table is labeled as the slave entriesbItem to item tagging ofcThe item of (1). Marking itemsbItem of (1), first calculatingimatlb(b) = min{mau(b), pei_lb(b)} And = 14. Due to the fact thativaub(b) = 36 > imatlb(b)Calculatingau(b) = 30 > mau(b) And (5) = 15. Thus, the item setbFor efficient use of the item set, save toHAUIS. At the same time, the user can select the desired position,lvaub(b) = 57 > imatlb(b)indicates thatbThere may be efficient use of item sets. Therefore, scanning is requiredPB(b)By computing potential extension termsvaubAnd obtaining the number of potential extension items.PB(b)The term appearing inacdAndeto termaFor the purpose of example only,TS(a)| b is composed ofT' 3 = {(c, 6) (d, 2) (a, 2) (b, 10)}Then, thenTS(a)| b In the occurrence ofcdAnda. Thus, itemcAnddin thatTS(a)| b Inu(c, TS(a)| b ) = 18、u(d, TS(a)| b ) And (8). To itemaRequiring the computation of a set of condition itemsbAndaaverage utility of union, i.e.au(ba, TS(a)| b )= 14 + 20)/2 = 17. To itemcdeIn a similar manner, the results obtained are shown in Table 7. Therefore, it can be seen that,vaub(ba) =18、vaub(bc) =33、vaub(bd) =36 andvaub(be)= 18. At the same time, the user can select the desired position,matlb(ba) = min{mau(ba), mau(c), mau(d)} = min{17.5, 18, 22} = 17.5、matlb(bd) = min{mau(bd), mau(a), mau(c), mau(e)} = min{18.5, 20, 18, 14} = 14、matlb(be) = min{mau(be), mau(c), mau(d)} = min {14.5, 18, 22} = 14.5 andmatlb(bc) = min{mau(bc), mau(a), mau(d), mau(e)}= min {20, 20, 22, 14} = 14. Therefore, it is impossible to trimPEIs(b)(18 > 17.5, 33 > 14, 36 > 14, 18 >14.5), the current overall sequence isa ≺ e ≺ c ≺ d. Second pass scanningPB(b)For each revision projection transaction insertbConditions of (2)AUP-treeAssociated with each entry in the head tableivaublvaubAndpei_lbdomain, computing mode and globalAUP-treeSame, conditions obtainedAUP-treeAs shown in fig. 5. Note, though in terms of the current overall order itemeArrange at the itemaFront, butpei_lb (ba) = 18(mau(c)) To do soIs different from 14 (mau(e)) This is because ofTS(a)| b Middle itemeAnd is not present. At the same time, inbConditionAUP-treeIn the utility array of (2), the utility of the condition item set at each revision transaction needs to be stored, e.g., atT' 2 Middle item setbHas a utility of 10.
To pairbConditionAUP-treeEntries in the header tableaFirst, calculateimatlb(ba) = min{mau(ba), pei_lb (ba)}= min {17.5, 18} = 17.5. Due to the fact thativaub(ba) > imatlb(ba)Andlvaub(ba) > imatlb (ba)thus traversing the node queue from the entry, computingau(ba) = 17, can knowbaIs not an efficient use-item set. Go throughbaProjection database, calculating occurrencesdcIn thatPB(ba)Is/are as followsvaubNumerical values, i.e.vaub(bac) = max{au(bac, TS(c) | ba ), u(d, TS(c)| ba )} = max {17.33, 8} = 17.33 andvaub(bad) = max{au(bad, TS(d) | ba ), u(c, TS(c)| ba )}= max {14, 18} = 18. At the same time, the user can select the required time,matlb(bac) = min{mau(bac), mau (d)}= min {17.67, 22} = 17.67 andmatlb(bad) = min{mau(bad), mau(c)}min {19, 18} = 18. Thus, according to pruning strategy 3, itemscCan be arranged inPEIs(ba)In (17.33)<17.67) ofPB(ba)Conditions of constructionAUP-treeAs shown in fig. 6 (construction method and){b}Conditions of (2)AUP-treeSimilarly, each record in the utility array needs to retain a set of condition entriesbaUtility of). To pairbaConditionAUP-treeEntries in the header tabledCalculatingimatlb(bad) = min {mau(bad), pei_lb(bad)}And = 19. Due to the fact thativaub(bad) < imatlb(bad)Andlvaub(bad) < imatlb(bad)thus, therefore, it isbadAnd its superset are not likely to be efficient use item sets (pruning strategy 2). To this end, forbaConditionAUP- treeThe traversal of the head table ends. Then return tobConditions ofAUP-treeNeed to be aligned withbConditionAUP-treeHead table itemaCalling each node in the departure node queueChange_node_util_ pointer(a)And (5) carrying out a procedure. After the execution is finished,bconditionAUP- treeAs shown in fig. 7. To pairbConditionAUP-treeEntries in the header tableecAnddexecute and itemaSimilar process. Note that items are being pairedeInvokingChange_ node_util_pointer(e)After the procedure, the process is carried out,bconditionAUP-treeAs shown in fig. 8. Since the items are marked ascThe tree node of (1) is not empty of links to the utility array, and will thereforeT' 2 Middle itemdcCumulative entering of utilityT' 3 Corresponding item in (1), reducebcThe size of the projection database. Based on the above description, the output order of the final efficient use item set isb:30bc:33bcd: 27.33bd:23a:21aec:17.33ac:34.5dec:20dc:35.5e:30ec:46.5Andc:81wherein the value associated with the set of items is the average utility of the set of items in the database.
Step 305, performing item set mining based on the global tree structure and the item set mining algorithm to determine the efficient item set from the item set.
In the embodiment of the invention, data mining can be carried out based on the global tree structure and the item set mining algorithm, and the efficient item set is selected from the item set.
In summary, in the embodiment of the present invention, a preset calculation method may be adopted to calculate four upper bounds of average utility and three lower bounds of minimum utility threshold of an item set respectively, determine a pruning policy for pruning a search space according to an obtained calculation result, and dig out an item set for efficient use from the item set according to the pruning policy. By adopting the method, a more compact upper limit for average utility of the item set and a lower limit for minimum utility threshold of the item set are provided, and a new pruning strategy is established according to the upper limit and the lower limit, so that more impossible item sets in a search space can be effectively pruned, and the high-efficiency item set can be rapidly mined.
In the invention, an upper bound for the average effectiveness of four item sets is providedvaub 1 vaubivaubAndlvaubtheir inverse monotonic properties are analyzed and compared with existing termsUpper bound for ensemble averagingaub 1 aubiaubAndlaubandrtubthe numerical relationship of (a); three lower bounds for the minimum utility threshold of the set of terms are presented simultaneouslymatlb 1 matlbAndimatlbtheir inverse monotonic properties are discussed and compared to existing lower boundssmauAndLMAUthe numerical relationships are compared. Because a more compact upper bound for average utility of the item sets and a lower bound for minimum utility threshold of the item sets are adopted, other item sets which cannot be used efficiently in the search space can be pruned more. Based on the provided upper limit of average effectiveness of the item set and the lower limit of the minimum effectiveness threshold of the item set, the invention provides three search space pruning strategies.
The new data structure AUP-tree provided by the invention is used for storing necessary information of mining a multi-utility threshold efficient item set in an original database and a projection database, and the AUP-tree structure can effectively realize the same transaction merging strategy, reduce the scales of the original database and the projection database and reduce the space complexity of an algorithm.
Based on the constructed global AUP-tree, an AUPGrowth algorithm which is an AUPGrowth algorithm for calculating the average utility of the candidate item set by recursively creating the conditional AUP-tree and obtaining all high-efficiency item sets is provided.
Referring to fig. 9, a flowchart illustrating steps of a global tree structure construction method according to an embodiment of the present invention is shown, and the method for mining a high-utility item set applied to the above embodiment may specifically include the following steps:
step 901, pruning the item set in the original database according to the first pruning strategy, and sorting the pruned item set to obtain corresponding total order information.
In the embodiment of the invention, the item set in the original database can be pruned according to the first pruning strategy, and the pruned item set is sorted to obtain the corresponding total order information. In one example, each entry in the database is scanned for the first timei p CalculatingTS(i p )The utility of the item appearing in the table is stored in a two-dimensional array, thereby obtainingvaub 1 (i p )Numerical values. At the same time, toTS(i p )Term in (2), calculatematlb 1 (i p )Numerical values. If a single itemvaub 1 A value less thanmatlb 1 Value, the entry can be safely removed in the database according to pruning policy 1. Then all will bevaub 1 A value of less thanmatlb 1 Is recorded as a collection of item compositionsSETTo the rest item basisvaub 1 The numerical values are arranged in ascending order, and the formed item order is recorded as the current total orderO
And step 902, constructing a head table of the global tree structure by using the total order information.
After the total order information is obtained, a head table of the global tree structure can be constructed according to the total order information. In one example, items in the overall order may be inserted in reverse orderHeaderThe item mark field of each item constructs the head table of AUP-treeHeader
And 903, constructing a prefix tree and a utility array of the global tree structure.
And 904, constructing the global tree structure according to the prefix tree, the head table and the utility array.
The global tree structure is composed of a prefix tree, a head table and a utility array, and after the head table is constructed, the corresponding prefix tree and the corresponding utility array need to be constructed, so that the global tree structure is created.
The above embodiment of the high utility item set mining method describes in detail the construction process of the global tree structure, and is not described here again to avoid repetition.
In summary, in the embodiment of the present invention, a new data structure AUP-tree is provided for storing necessary information of mining a multi-utility threshold efficient item set in an original database and a projection database, and the AUP-tree structure can effectively implement the same transaction merging strategy, reduce the scale of the original database and the projection database, and reduce the spatial complexity of the algorithm.
Referring to fig. 10, a flowchart illustrating steps of an item set mining method according to an embodiment of the present invention is shown, and the method applied to the embodiment of the present invention for mining a high-utility item set may specifically include the following steps:
step 1001, traversing the global tree structure, and performing item set pruning on the original database by using the second pruning strategy under the condition that the second pruning strategy is satisfied.
In the embodiment of the present invention, it may be determined whether a second pruning policy is satisfied for the item set in the global tree structure, and if so, pruning may be performed on the item set according to the second pruning policy.
Step 1002, traversing a projection database corresponding to the global tree structure, and performing item set pruning on the projection database by using the third pruning strategy under the condition that the third pruning strategy is satisfied.
In the embodiment of the present invention, it may be determined whether a third pruning policy is satisfied for the item set in the projection database, and if so, pruning may be performed on the item set according to the third pruning policy.
Step 1003, constructing a conditional global tree structure based on the pruned original database and the pruned projection database, and mining the efficient use item set based on the conditional global tree structure.
After item set pruning is carried out on the original database and the projection database, a conditional global tree structure can be constructed according to the construction mode of the global tree structure, and an efficient item set is mined from the conditional global tree structure.
An item set mining method (autoprowth mining algorithm) is described in detail in the above embodiment for a high utility item set mining method, and is not described here again to avoid repetition.
In summary, in the embodiment of the present invention, based on the constructed global AUP-tree, an AUPGrowth algorithm is proposed that calculates the average utility of the candidate sets by recursively creating the conditional AUP-tree and obtains all the efficient use sets, and the temporal performance of the algorithm is expected to be improved by an order of magnitude compared with the existing algorithms.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 11, a block diagram of a structure of a high-utility item set mining apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:
a calculating module 1101, configured to obtain an item set to be mined in an original database, and calculate an upper average utility bound and a lower minimum utility threshold bound of the item set by using a preset formula to obtain a corresponding calculation result; the calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first upper average utility bound and the second upper average utility bound are determined based on an average utility of the set of items in the raw database;
a first determining module 1102, configured to determine a pruning strategy for the item set search space according to the calculation result;
a second determining module 1103 configured to determine a set of efficient items from the set of items according to the pruning policy.
In an embodiment of the present invention, the pruning policy includes a first pruning policy, a second pruning policy and a third pruning policy, and the second determining module includes:
the construction sub-module is used for constructing a global tree structure corresponding to the original database by adopting the first pruning strategy;
a first determining submodule, configured to determine an item set mining algorithm for calculating an average utility by using the second pruning strategy and the third pruning strategy;
a mining submodule for performing item set mining based on the global tree structure and the item set mining algorithm to determine the efficient item set from the item set.
In the embodiment of the present invention, it is,
the calculation formula of the first average effective upper bound is as follows:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the second average effect upper bound is as follows:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
the calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
the fourth average effective value is calculated by the following formula:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN)。
in the embodiment of the present invention, it is,
the calculation formula of the lower bound of the first minimum utility threshold is as follows:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the second minimum utility threshold lower bound is as follows:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )};
the calculation formula of the lower bound of the third minimum utility threshold is as follows:
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )}。
in the embodiment of the invention, the first average effective upper bound has a superset inverse monotonic attribute; the second average effect upper bound has a bidirectional expansion inverse monotone property; the third average effective upper bound has an item set extension inverse monotonic property; the fourth average effective upper bound has a depth trim condition.
In an embodiment of the present invention, the first minimum utility threshold lower bound has a superset inverse monotonic attribute; the second minimum utility threshold lower bound has a two-way expansion inverse monotonic attribute; the third minimum utility threshold lower bound has an item set extension inverse monotonic attribute.
In an embodiment of the present invention, the first determining module includes:
a second determining sub-module, configured to determine, in the first pruning strategy, a maximum of the first upper average effectiveness bound, the second upper average effectiveness bound, the third upper average effectiveness bound, and the fourth upper average effectiveness bound, and a minimum of the first lower minimum utility threshold bound, the second lower minimum utility threshold bound, and the third lower minimum utility threshold bound;
a first removal submodule, configured to remove the item set and the superset of item sets from the original database if the maximum value is smaller than the minimum value.
In an embodiment of the present invention, the first determining module includes:
a deleting submodule, configured to, in the second pruning policy, delete, from the global tree structure, the sub-number structure generated by the item set expansion of the item set if the third average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, or if the fourth average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold.
In an embodiment of the present invention, the first determining module includes:
a third determining sub-module, configured to determine, in the third pruning strategy, potential extension items of the item set, and determine a transaction set including the potential extension items of the item set in a projection database of the item set;
a second removing submodule, configured to remove, from the set of potential expansion items of the set of items, the potential expansion item of the set of items if the second average upper limit of effectiveness is smaller than the second minimum lower limit of effectiveness threshold in the set of transactions.
In summary, in the embodiment of the present invention, a preset calculation method may be adopted to calculate four upper average utility bounds and three lower minimum utility bounds of the term set, respectively, determine a pruning strategy for pruning the search space according to the obtained calculation result, and excavate the efficient term set from the term set according to the pruning strategy. By adopting the method, a more compact upper limit for average utility of the item set and a lower limit for minimum utility threshold of the item set are provided, and a new pruning strategy is established according to the upper limit and the lower limit, so that more impossible item sets in a search space can be effectively pruned, and the high-efficiency item set can be rapidly mined.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention further provides an electronic device, including: the computer program is executed by the processor to implement each process of the above-mentioned embodiment of the high utility item set mining method, or to implement each process of the above-mentioned embodiment of the global tree structure construction method, or to implement each process of the above-mentioned embodiment of the item set mining method, and can achieve the same technical effect, and is not described here again to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements each process of the above-mentioned high-utility item set mining method embodiment, or implements each process of the above-mentioned global tree structure construction method embodiment, or implements each process of the above-mentioned item set mining method embodiment, and can achieve the same technical effect, and is not described here again to avoid repetition.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The high-utility item set mining method, the global tree structure construction method, the item set mining method, the high-utility item set mining device, the electronic equipment and the computer-readable storage medium provided by the invention are introduced in detail, and specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (11)

1. A high-utility item set mining method is characterized by comprising the following steps:
acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result; the calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first upper average utility bound and the second upper average utility bound are determined based on an average utility of the set of items in the raw database;
determining a pruning strategy for the item set search space according to the calculation result; the pruning strategies comprise a first pruning strategy, a second pruning strategy and a third pruning strategy;
constructing a global tree structure corresponding to the original database by adopting the first pruning strategy;
traversing the global tree structure, and carrying out item set pruning on the original database by adopting the second pruning strategy under the condition that the second pruning strategy is met;
traversing a projection database corresponding to the global tree structure, and performing item set pruning on the projection database by adopting a third pruning strategy under the condition that the third pruning strategy is met;
constructing a conditional global tree structure based on the pruned original database and the projection database, and mining an efficient item set from the item set based on the conditional global tree structure;
the calculation formula of the first average effective upper bound is as follows:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub 1 (X) Is composed ofXThe first average effective value of (1) is an upper bound;TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)in the set of items notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤h,1≤ c h ;|X| is item setXLength of (d);u(i c1 , TS(X) Is an itemi c1 In thatTS(X)The same goes for the middle effect and so on;u(X, TS(X) Is a set of items XTS(X) The utility of (1);
the calculation formula of the second average effect upper bound is as follows:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub(X) Is composed ofXThe second average effect of (1) is an upper bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤h,1≤ c h (ii) a Item(s)i cf As a collection of itemsi c1 i c2 ...i ch In the first rank according to the general orderi a1 The following items; non-viable cellsX| is item setXLength of (d);u(i cf , TS(X) Is an itemi cf In thatTS(X)The same goes for the middle effect and so on;u(X, TS(X) Is a set of items XTS(X) The utility of (1);
the calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k ivaub(X) Is composed ofXThe third average value of (a) is upper bound;PB(X)for all in the databaseXProjecting a set of transactions;PB(X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h,1≤l )As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXLength of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The effect of (1) in (1), and so on,u(X, PB(X) Is a set of items XPB(X) The utility of (1);
the fourth average effective value is calculated by the following formula:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN);
wherein, the first and the second end of the pipe are connected with each other,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k lvaub(X) Is composed ofXThe fourth average effect of (1) is an upper bound;au(X) Is composed ofXIn a transaction databaseDBAverage utility of (1);PB(X)for all in the databaseXProjecting a set of transactions;PB(X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h,1≤l)As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXLength of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The same goes for the middle effect and so on;MNis composed ofXIn a projection databasePB(X)The maximum number of potential extension terms in (c).
2. The method of claim 1,
the calculation formula of the lower bound of the first minimum utility threshold is as follows:
matlb 1 (X) = min{mau(X), mau(i c1 ), mau(i c2 ), ..., mau(i ch )};
wherein the content of the first and second substances,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l matlb 1 (X) Is composed ofXA first lower minimum utility threshold of (a); TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤h,1≤ c h mau(i c1 ) Is an itemi c1 The minimum utility threshold of (c), and so on;mau(X) Is composed ofXA minimum utility threshold of;
the calculation formula of the lower bound of the second minimum utility threshold is as follows:
matlb(X) = min{mau(X), mau(i cf ), mau(i c2 ), ..., mau(i ch )};
wherein the content of the first and second substances,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l matlb(X) Is composed ofXA second minimum utility threshold lower bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cf As a collection of itemsi c1 i c2 ...i ch In the first ranking according to the general orderi a1 The following items;mau(i cf ) Is an itemi cf The minimum utility threshold of (2), and so on;mau(X) Is composed ofXA minimum utility threshold of;
the calculation formula of the lower bound of the third minimum utility threshold is as follows:
imatlb(X) = min{mau(X), mau(i cl ), mau(i c2 ), ..., mau(i ch )};
wherein, the first and the second end of the pipe are connected with each other,Xin the form of a set of items,X =i a1 i a2 ...i al 1 ≤ l,1≤ a l imatlb(X) Is composed ofXA third minimum utility threshold lower bound; TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi a1 i a2 ...i al i c1 i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cl As a collection of itemsi c1 i c2 ...i ch In the first rank according to the general orderi al The following items;mau(i cl ) Is an itemi cl The minimum utility threshold of (2), and so on;mau(X) Is composed ofXA minimum utility threshold.
3. The method of claim 1, wherein the first upper average utility bound has a superset inverse monotonic property; the second average effect upper bound has a bidirectional expansion inverse monotone property; the third average effective upper bound has an item set extension inverse monotonic property; the fourth average effective upper bound has a depth trim condition.
4. The method of claim 2, wherein the first minimum utility threshold lower bound has a superset inverse monotonic property; the second minimum utility threshold lower bound has a two-way expansion inverse monotonic attribute; the third minimum utility threshold lower bound has an item set extension inverse monotonic attribute.
5. The method of claim 1, wherein determining a pruning strategy for the corpus search space based on the computing results comprises:
in the first pruning strategy, determining a maximum of the first, second, third, and fourth upper average utility bounds, and determining a minimum of the first, second, and third lower minimum utility threshold bounds;
if the maximum value is less than the minimum value, removing the item set and the superset of the item set from the original database.
6. The method of claim 1, wherein determining a pruning strategy for the corpus search space based on the computing results comprises:
in the second pruning strategy, if the third average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, or the fourth average upper limit of effectiveness is smaller than the third minimum lower limit of effectiveness threshold, deleting the sub-number structure generated by the item set expansion of the item set from the global tree structure.
7. The method of claim 1, wherein determining a pruning strategy for the corpus search space based on the computing results comprises:
in the third pruning strategy, determining potential extension items of the item set, and determining a transaction set comprising the potential extension items of the item set in a projection database of the item set;
in the transaction set, if the second average utility upper bound is less than the second minimum utility threshold lower bound, removing potential extension items of the item set from a potential extension item set of the item set.
8. A global tree structure construction method, applied to the high-utility item set mining method of claim 1, the method comprising:
pruning the item set in the original database according to the first pruning strategy, and sequencing the pruned item set to obtain corresponding total sequence information;
constructing a head table of the global tree structure by adopting the total order information;
constructing a prefix tree and a utility array of the global tree structure;
and constructing the global tree structure according to the prefix tree, the head table and the utility array.
9. An efficient item set mining apparatus, the apparatus comprising:
the calculation module is used for acquiring an item set to be mined in an original database, and calculating an average utility upper bound and a minimum utility threshold lower bound of the item set by adopting a preset formula to obtain a corresponding calculation result; the calculation result comprises a first average effect upper bound, a second average effect upper bound, a third average effect upper bound, a fourth average effect upper bound, a first minimum utility threshold lower bound, a second minimum utility threshold lower bound and a third minimum utility threshold lower bound; the first upper average utility bound and the second upper average utility bound are determined based on an average utility of the set of items in the raw database;
a first determining module, configured to determine a pruning strategy for the item set search space according to the calculation result; the pruning strategies comprise a first pruning strategy, a second pruning strategy and a third pruning strategy;
the device is further configured to construct a global tree structure corresponding to the original database using the first pruning strategy; traversing the global tree structure, and under the condition that the second pruning strategy is met, performing item set pruning on the original database by adopting the second pruning strategy; traversing a projection database corresponding to the global tree structure, and performing item set pruning on the projection database by adopting a third pruning strategy under the condition that the third pruning strategy is met; constructing a conditional global tree structure based on the pruned original database and the projection database, and mining an efficient item set from the item set based on the conditional global tree structure;
the calculation formula of the first average effective upper bound is as follows:
vaub 1 (X) = max{u(i c1 , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub 1 (X) Is composed ofXThe first average effective value of (1) is an upper bound;TS(X) For inclusion in databasesXA set of transactions of (a);TS(X)the set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤h,1≤ c h ;|X| is item setXLength of (d);u(i c1 , TS(X) Is an itemi c1 In thatTS(X)The same goes for the middle effect and so on;u(X, TS(X) Is a set of items XTS(X) The utility of (1);
the second average efficiency is calculated by the following formula:
vaub(X) = max{u(i cf , TS(X)), ..., u(i ch , TS(X)), u(X, TS(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k vaub(X) Is composed ofXThe second average effect of (1) is an upper bound; TS(X) For inclusion in a databaseXA set of transactions of (a);TS(X)the set of items appearing in is notedi c1 i c2 ...i ch i a1 i a2 ...i ak 1 ≤h,1≤ c h (ii) a Item(s)i cf As a collection of itemsi c1 i c2 ...i ch In the first rank according to the general orderi a1 The following items; in the absence ofX| is item setXLength of (d);u(i cf , TS(X) Is an itemi cf In thatTS(X)The same goes for the middle effect and so on;u(X, TS(X) Is a set of items XTS(X) The utility of (1);
the calculation formula of the third average effective upper bound is as follows:
ivaub(X) = max{u(i cl , PB(X)), ..., u(i ch , PB(X)), u(X, PB(X))/|X|};
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k ivaub(X) Is composed ofXThe third average effect of (2) is an upper bound;PB(X)for all in the databaseXProjecting a set of transactions;PB(X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h,1≤l )As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXThe length of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The effect of (1) in (1), and so on,u(X, PB(X) Is a set of items XPB(X) The effects of (1);
the fourth average effective value is calculated by the following formula:
lvaub(X) = au(X) + max{u(i cl , PB(X)), ..., u(i ch , PB(X))} × MN / (|X| + MN);
wherein the content of the first and second substances,Xin the form of a set of items,X = i a1 i a2 ...i ak 1 ≤ k,1≤ a k lvaub(X) Is composed ofXThe fourth average effect of (1) is upper bound;au(X) Is composed ofXIn a transaction databaseDBAverage utility of (1);PB(X)for all in the databaseXProjecting a set of transactions;PB(X)the set of items appearing in is notedi a1 i a2 ...i ak i cl i c2 ...i ch 1 ≤h,1≤ c h (ii) a Item(s)i cj (l ≤ j ≤ h,1≤l)As a collection of itemsi cl i c2 ...i ch In the above, all items are arranged in terms of the overall orderi ak The following items; non-viable cellsX| is item setXLength of (d);u(i cl , PB(X) Is an itemi cl In thatPB(X) The same goes for the middle effect and so on;MNis composed ofXIn a projection databasePB(X)The maximum number of potential extension terms in (c).
10. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of a high utility item set mining method as claimed in any one of claims 1 to 7 or implementing the steps of a global tree structure building method as claimed in claim 8.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a high utility item set mining method according to any one of claims 1 to 7, or carries out the steps of a global tree structure building method according to claim 8.
CN202210389910.8A 2022-04-14 2022-04-14 High-utility item set mining method and device, electronic equipment and medium Active CN114490835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210389910.8A CN114490835B (en) 2022-04-14 2022-04-14 High-utility item set mining method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210389910.8A CN114490835B (en) 2022-04-14 2022-04-14 High-utility item set mining method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN114490835A CN114490835A (en) 2022-05-13
CN114490835B true CN114490835B (en) 2022-09-06

Family

ID=81488772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210389910.8A Active CN114490835B (en) 2022-04-14 2022-04-14 High-utility item set mining method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN114490835B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407442B (en) * 2023-12-11 2024-03-19 珠海大横琴科技发展有限公司 Mining method and device for judging high utility mode, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109460424A (en) * 2018-10-18 2019-03-12 哈尔滨工业大学(深圳) Effective sequence pattern processing method, device and computer equipment
CN110188131A (en) * 2019-06-03 2019-08-30 西北工业大学 A kind of Frequent Pattern Mining method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109460424A (en) * 2018-10-18 2019-03-12 哈尔滨工业大学(深圳) Effective sequence pattern processing method, device and computer equipment
CN110188131A (en) * 2019-06-03 2019-08-30 西北工业大学 A kind of Frequent Pattern Mining method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于优化上界的高平均效用项集垂直挖掘算法;浦蓉等;《计算机工程与科学》;20200515;第42卷(第05期);第931-937页 *
改进的频繁和高效用项集挖掘算法;张健 等;《华侨大学学报(自然科学版)》;20171120;第38卷(第06期);第880-885页 *

Also Published As

Publication number Publication date
CN114490835A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US5594897A (en) Method for retrieving high relevance, high quality objects from an overall source
US7742906B2 (en) Balancing collections of vertices in a network
US8880451B2 (en) Fast algorithm for mining high utility itemsets
US20030217055A1 (en) Efficient incremental method for data mining of a database
Han et al. Efficient top-k high utility itemset mining on massive data
CN110188131B (en) Frequent pattern mining method and device
Yun et al. Mining recent high average utility patterns based on sliding window from stream data
Dam et al. Towards efficiently mining closed high utility itemsets from incremental databases
CN114490835B (en) High-utility item set mining method and device, electronic equipment and medium
Bernstein et al. Incremental topological sort and cycle detection in expected total time
Masseglia et al. Web usage mining: extracting unexpected periods from web logs
CN112434031A (en) Uncertain high-utility mode mining method based on information entropy
CN111984688B (en) Method and device for determining business knowledge association relationship
Kim et al. Efficient approach for mining high-utility patterns on incremental databases with dynamic profits
Atzmueller et al. Minerlsd: Efficient local pattern mining on attributed graphs
Gavruskin et al. Dynamic algorithms for monotonic interval scheduling problem
Singh et al. High average-utility itemsets mining: a survey
Lin et al. Mining of high average-utility patterns with item-level thresholds
KR20120136677A (en) Method and tree structure of database for extracting data steams frequent pattern based on weighted support and structure of database
Prasad Optimized high-utility itemsets mining for effective association mining paper
Zhang et al. Skyline queries with constraints: Integrating skyline and traditional query operators
Broutin et al. Partial match queries in random quadtrees
CN117407442B (en) Mining method and device for judging high utility mode, electronic equipment and medium
Gupta et al. Mining closed itemsets in data stream using formal concept analysis
YILDIRIM et al. FIMHAUI: Fast incremental mining of high average-utility itemsets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant