WO2018059298A1 - 模式挖掘方法、高效用项集挖掘方法及相关设备 - Google Patents

模式挖掘方法、高效用项集挖掘方法及相关设备 Download PDF

Info

Publication number
WO2018059298A1
WO2018059298A1 PCT/CN2017/102663 CN2017102663W WO2018059298A1 WO 2018059298 A1 WO2018059298 A1 WO 2018059298A1 CN 2017102663 W CN2017102663 W CN 2017102663W WO 2018059298 A1 WO2018059298 A1 WO 2018059298A1
Authority
WO
WIPO (PCT)
Prior art keywords
utility
transaction
item
item set
value
Prior art date
Application number
PCT/CN2017/102663
Other languages
English (en)
French (fr)
Inventor
林浚玮
肖磊
陈伟
张杰雄
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201610856770.5A external-priority patent/CN107870939B/zh
Priority claimed from CN201610866557.2A external-priority patent/CN107870956B/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018059298A1 publication Critical patent/WO2018059298A1/zh
Priority to US16/022,891 priority Critical patent/US10776347B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions

Definitions

  • the present application relates to the field of data mining technologies, and particularly relates to a method and apparatus for mining a pattern, and an efficient item set mining method, apparatus, and data processing apparatus.
  • a transaction database is a database that can record transactions, news, etc.
  • the transaction database usually records at least one transaction, and each transaction includes at least one data item, that is, a project; for example, a transaction type transaction database
  • At least one transaction about the transaction record may be recorded, and a transaction item may include at least one item of data items (the item of the item may correspond to the item name) and the transaction quantity of each item, and is used to represent the transaction database.
  • the association rule between data items, at least one data item is aggregated to form a project set, that is, an item set.
  • the transaction database such as transaction type can often reflect the user's preference
  • the item set recommended to the user is often mined from the plurality of items formed by the transaction database; and in the process of mining the item set, It is often necessary to consider itemsets with higher utility values (referred to as efficient itemsets).
  • An efficient item set is a set of items with higher utility values, and there are often one or more data items in the item set. How to comprehensively consider the utility value of each data item in the item set to improve the accuracy of the efficiently used item set. It is especially necessary.
  • the present application provides a mode mining method and device, and an efficient item set mining method, device and data processing device for improving the accuracy of the excavated efficient item set.
  • a method for mining a pattern including:
  • each transaction includes at least one item; each candidate pattern in the set of candidate patterns includes an item in at least one item set; the item A set is a collection generated from items in each transaction;
  • a mode mining device including:
  • a candidate mode set obtaining unit configured to acquire, according to each transaction included in the database, a candidate mode set that satisfies a set condition; wherein each transaction includes at least one item; each candidate mode in the candidate mode set includes at least one An item in a project set; the item set is a set generated according to an item in each transaction; a utility value calculating unit is configured to calculate a utility value of the candidate mode in each transaction for each candidate mode in the candidate mode set ;
  • the target transaction determining unit is configured to determine a target transaction in which the utility value reaches a set utility threshold.
  • a candidate mode period value determining unit configured to determine a period value of the candidate mode according to a time attribute of each target transaction
  • a mining result determining unit configured to determine the candidate mode if the period value of the candidate mode is less than or equal to the set period threshold It is determined as the mining result.
  • a mode mining device comprising:
  • the memory is for storing a computer program
  • the processor is configured to read the computer program and execute the pattern mining method according to executable instructions in the computer program.
  • a storage medium for storing program code for executing the above mode mining method is provided.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform the pattern mining method described above.
  • the foregoing mode mining method and related device calculate a utility value in each transaction for the acquired candidate mode set, and delete a transaction whose utility value is less than the set utility threshold, and the mode utility of the partial transaction If the value is too small, the mining calculation time can be reduced, and the period value of the candidate mode is determined according to the time attribute of the target transaction remaining after the deletion. When the period value is less than or equal to the set period threshold, the candidate mode is determined as The mining results ensure that the utility values of the mined patterns are evenly distributed over time, making it easier to make accurate decisions and mining results more accurately.
  • an efficient item set mining method including:
  • the item set utility value corresponding to an item set represents the sum of the utility values of the item set in each target transaction corresponding to the item set, and an item set
  • the target transaction is a transaction that contains all the data items of the item set; the utility value of an item set in the target transaction represents the sum of the utility values of the data items of the item set in the target transaction;
  • a predefined minimum utility threshold value table records a minimum utility threshold corresponding to each data item, and an item set minimum utility threshold corresponding to an item set indicates The minimum minimum utility threshold among the lowest utility thresholds for the data items contained in the set.
  • an efficient item set mining device including:
  • the item set utility value determining module is configured to determine an item set utility value corresponding to each item in the transaction database; the item set utility value corresponding to the item set indicates that the item set is in each target transaction corresponding to the item set
  • the sum of utility values, the target transaction of an item set is a transaction containing all data items of the item set; the utility value of an item set in the target transaction represents the utility of each data item of the item set in the target transaction The sum of the values.
  • An efficient item set determining module is configured to compare the item set utility value of each set with a corresponding item set minimum utility threshold, and determine an efficient item set according to the comparison result, wherein the item set utility of the efficient item set is used The value is not less than the minimum utility threshold for the corresponding item set.
  • a data processing apparatus comprising the efficient item set mining apparatus of the seventh aspect described above.
  • an efficient item set mining device comprising:
  • the memory is for storing a computer program
  • the processor is configured to read the computer program, and execute the efficient item set mining method according to executable instructions in the computer program.
  • a storage medium for storing program code for executing the above-described efficient item set mining method is provided.
  • Also provided in an eleventh aspect of the present application is a computer program product comprising instructions which, when run on a computer, cause the computer to perform the efficient item set mining method described above.
  • a minimum utility threshold table in which a minimum utility threshold corresponding to each data item is recorded is defined, and when determining a minimum utility threshold of each item set corresponding to each item set, By comparing the lowest utility threshold corresponding to the data item included in the item set, the minimum minimum utility threshold of the lowest utility threshold corresponding to the data item included in the item set is used as the lowest utility threshold of the item set corresponding to the item set, so that The minimum utility threshold corresponding to the determined item set is closer to the minimum utility of the item set; the item set utility value of each set and the corresponding item are based on the item set minimum utility threshold of the determined item set The set of minimum utility thresholds is compared to determine an efficient set of items whose item set utility value is not less than the corresponding item set minimum utility threshold.
  • the efficient item set mining method and related equipment provided by the present application do not use a unique fixed minimum utility threshold as a mining standard for efficient item sets, but correspondingly the data items included in each item set.
  • the minimum minimum utility threshold is used as the minimum utility threshold of the item set of each item set, so that the item set minimum utility threshold corresponding to the determined item set is closer to the minimum utility of the item set, and then the item set of each item set.
  • the utility value is compared with the minimum utility threshold of the item set corresponding to the item set to realize the mining of the efficient item set, and the accuracy of the efficient item set mining is improved.
  • FIG. 1 is a schematic structural diagram of a server hardware according to an embodiment of the present application
  • FIG. 2 is a flow chart of a method mining method according to an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for determining a period value of a candidate mode according to an embodiment of the present application
  • FIG. 4 is a flowchart of a method for acquiring a candidate mode set according to an embodiment of the present application
  • FIG. 5 is a flowchart of a method for generating a layer k candidate mode set according to an embodiment of the present application
  • FIG. 6 is a flowchart of a method for generating a layer k candidate mode set according to an embodiment of the present application
  • FIG. 7 is a flowchart of a method for acquiring a candidate mode set according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a pattern excavating apparatus according to an embodiment of the present application.
  • FIG. 9 is a flowchart of an efficient item set mining method according to an embodiment of the present application.
  • FIG. 10 is a flowchart of a method for determining an item set utility value corresponding to an item set according to an embodiment of the present application
  • FIG. 11 is a flowchart of a method for constructing an MIU tree according to an embodiment of the present application.
  • Figure 12 is a schematic structural view of an MIU tree
  • Figure 13 is a schematic diagram showing a utility list corresponding to each set of the first level in the MIU tree
  • Figure 14 is a schematic diagram showing the combination of utility lists
  • Figure 15 is a schematic diagram showing another combination of utility lists
  • Figure 16 shows a further combined schematic diagram of the utility list
  • 17 is a block diagram showing the structure of an efficient item set mining device according to an embodiment of the present application.
  • FIG. 18 is a structural block diagram of an item set utility value determining module according to an embodiment of the present application.
  • FIG. 19 is a structural block diagram of a utility list construction unit according to an embodiment of the present application.
  • FIG. 20 is a block diagram showing the hardware structure of a data processing device according to an embodiment of the present application.
  • the merchandise sales record is used to record the purchase list content of the customer, wherein the purchase list of the customer includes information related to the purchase of the merchandise, such as the name of the merchandise and the quantity of the merchandise; Find a combination of goods with high sales or profits in the list, use the identified combination of products, change the sales strategy, and increase sales profits.
  • the above example is abstracted into a model of model mining, specifically: purchasing a product corresponding item, purchasing a list corresponding transaction, storing all purchase lists in a transaction database, the transaction database includes one or more transactions, and one transaction includes at least one item.
  • the project set is generated according to the items included in the transaction; and the pattern mining is to excavate the qualified items from the project set.
  • This application combines the cycle and utility values, and proposes a cycle-based efficient pattern mining method, which obtains a set of candidate patterns and calculates its utility value in each transaction.
  • the utility value does not reach the set utility.
  • the value of the transaction because the utility value of such a transaction is too small, contributes a small amount to the total utility value, in order to avoid wasting the mining calculation time, you can delete such a transaction, and then use the time attribute of the remaining transaction to calculate the period value of the candidate mode.
  • the candidate mode whose period value is less than or equal to the set period threshold is reserved as the mining result. This type of mode has a more efficient value in each cycle, which is convenient for quick decision making.
  • the period value of the mode is determined according to the time attribute of each transaction of the specified inclusion mode. Specifically, in each transaction of the specified inclusion mode, the maximum time difference among the time differences of the adjacent transactions is determined as the period value of the mode.
  • the above-mentioned specified mode-containing transaction can be any specified transaction containing the mode, or it can be a partial transaction selected from all the transactions containing the mode according to certain conditions.
  • the mode mining method provided by the embodiment of the present application is implemented by a server.
  • the server is first introduced.
  • the server may be a processing device such as a computer or a notebook.
  • FIG. 1 shows According to the schematic diagram of the hardware structure of the server according to the embodiment of the present application, as shown in FIG. 1 , the server may include:
  • the processor 1, the communication interface 2, the memory 3 and the display screen 5 complete communication with each other via the communication bus 4.
  • the pattern mining method of the present application is introduced in combination with the server hardware structure.
  • FIG. 2 is a flowchart of a method for mining a pattern according to an embodiment of the present application. As shown in FIG. 2, the method is applied to a server, and the method includes:
  • Step S200 Acquire, according to each transaction included in the transaction database, a candidate mode set that satisfies the set condition.
  • each transaction includes at least one item; each of the candidate mode sets includes an item in at least one item set; the item set is a set generated according to items in each transaction.
  • the scan transaction database acquires a set of candidate patterns that satisfy the set conditions.
  • the setting condition may include a condition for defining the size of the utility value of the candidate mode.
  • the size value of the utility value is not specifically limited in the embodiment of the present application.
  • the transaction database can be stored in the memory 3 through the communication interface 2 in advance.
  • the setting conditions are input through the communication interface 2, and the processor 1 queries the database stored in the memory via the communication bus 4 for the candidate mode set that satisfies the set condition.
  • the communication interface 2 may be an interface of the communication module, such as an interface of the GSM module; optionally, the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or One or more integrated circuits configured to implement the embodiments of the present application.
  • the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or One or more integrated circuits configured to implement the embodiments of the present application.
  • ASIC Application Specific Integrated Circuit
  • Step S210 Calculate a utility value of the candidate mode in each transaction for each candidate mode in the candidate mode set.
  • the transaction database contains three transactions, which are (2a, 3b, c), (a, 2b, 3d), (b, 3c, 4d), where a, b, c, and d are four items.
  • the number before the item included in each transaction indicates that the firm contains the number of corresponding items.
  • the transaction (2a, 3b, c) contains 2 items a, 3 items b, and 1 item c.
  • the scan transaction database can determine that the transaction containing the candidate pattern is: (a, b, c) and (a, b, d), respectively calculating the candidate pattern in the above two
  • the utility value in a transaction of course, for a transaction in the transaction database that does not contain a candidate pattern, the candidate mode has a utility value of 0 in the corresponding transaction.
  • the utility 1 may calculate the utility value of the candidate mode in each transaction.
  • Step S220 determining that the utility value reaches a target transaction of the set utility threshold.
  • the user can preset the utility threshold of the mode in each transaction and the periodic threshold of the mode as needed.
  • the processor 1 may compare the magnitude relationship between the utility value of each transaction and the set utility threshold to determine the target transaction that the utility value reaches the set utility threshold.
  • Step S230 determining a period value of the candidate mode according to a time attribute of each of the target transactions.
  • each transaction in the transaction database has a time attribute.
  • the length of the transaction database can be defined as the number of transactions included in the transaction database, and the time difference between two adjacent transactions is the same, for example, two adjacent The time difference of one thing is 1. For example, if the database contains five transactions A, B, C, D, and E, you can determine that the length of the database is 5, the time difference between transaction A and transaction B is 1, and the time difference between transaction A and transaction D is 3. .
  • the period value of the candidate mode is determined according to the time attribute of each target transaction, and is still described by the above example. It is assumed that for the candidate mode 1, the corresponding target transaction includes A, C, and E, and the period value of the candidate mode 1 is three. The maximum of the two adjacent differences in the target transaction, where the time difference between A and C is 2, and the time difference between C and E is 2, that is, the period value of candidate mode 1 is 2.
  • the processor 1 may determine the period value of the candidate mode based on the time attribute of each target transaction.
  • Step S240 if the period value of the candidate mode is less than or equal to the set period threshold, the candidate mode is determined as the mining result.
  • the processor 1 may compare the relationship between the period value of each candidate mode and the set period threshold, and determine the candidate mode whose period value is less than or equal to the set period threshold as the mining result, through the display screen. 5 output display.
  • the mode mining method provided by the embodiment of the present application calculates the utility value of each candidate transaction for the obtained candidate mode set, and deletes the transaction whose utility value is less than the set utility threshold, and the mode utility value of the partial transaction exceeds Small, can reduce mining calculations after deletion Time, and according to the time attribute of the target transaction remaining after the deletion, determining the period value of the candidate mode, when the period value is less than or equal to the set period threshold, determining the candidate mode as the mining result, and ensuring the model obtained by the mining Utility values are evenly distributed over time, making it easier to make accurate decisions.
  • FIG. 3 is a flowchart of a method for determining a period value of a candidate mode according to an embodiment of the present application. As shown in FIG. 3, the method includes:
  • Step S300 calculating a time difference value of the adjacent two target transactions according to the time attribute of each target transaction.
  • Each target transaction has a time attribute, and the time difference between two adjacent target transactions is calculated according to the time attribute of the target transaction.
  • the specific calculation process is: the transactions in the database are sorted in chronological order, and the targets are sequentially sorted in the database.
  • Transaction if there is no other target transaction before the target transaction, calculate the time difference between the target transaction and the first transaction in the database; if there is no other target transaction after the target transaction, calculate the end transaction in the database and the The time difference of the target transaction; if there are other target transactions before the target transaction, calculate the time difference between the target transaction and the previous adjacent target transaction.
  • the transaction database contains five transactions A, B, C, D, E, where the target transaction is transaction B and C.
  • the target transaction B the target transaction B and the database are calculated because there are no other target transactions in front of it.
  • the time difference of the first transaction A is 1; for the target transaction C, since there is no other target transaction thereafter, the time difference between the calculated target transaction C and the last transaction E in the database is 2; and for the target transaction C, there is a target transaction B in front of it, and the time difference between the two target transactions is calculated as 1.
  • Step S310 determining a maximum time difference value among the time difference values as a period value of the candidate mode.
  • each time difference includes 1, 2, 1, wherein the maximum time difference is 2, so the period value of the candidate mode is determined to be 2.
  • the meaning of the period value of the candidate mode is that, for a transaction including the candidate mode, after deleting the transaction in which the mode utility value is less than the set utility threshold, the maximum value of the time difference of the remaining transactions is used as the period value of the candidate mode.
  • step S200 Referring to FIG. 4, the process of step S200 described above is introduced, and the method includes:
  • Step S400 scanning each transaction in the database, and acquiring an item in which the sum value of the utility value reaches the set extended utility threshold in each transaction, and the acquired item constitutes the first layer candidate mode set HTWUSPI 1 .
  • Step S410 recording the transaction of each item in the project set and the utility value of each transaction when scanning the transaction database.
  • step S400 it is also possible to simultaneously record the transactions of each item in the project set and the utility value of each transaction.
  • the transaction number of the transaction where the project is located, and the utility value of each transaction number and the corresponding transaction may be recorded, where the utility value of the transaction is the sum value of the utility value of the transaction containing each item.
  • Step S420 using the Apriori_gen function and the HTWUSPI 1 to generate the k-th layer candidate mode set HTWUSPI k layer by layer until HTWUSPI k+1 is empty, and the final candidate mode set is composed of HTWUSPI 1 to HTWUSPI k .
  • the Apriori_gen function is a function provided by the Apriori algorithm, according to which a candidate pattern set can be generated layer by layer.
  • the k-th layer candidate mode set HTWUSPI k is generated, the two-two candidate modes of the symbol conditions in the k- 1th layer candidate mode set HTWUSPI k-1 are combined and generated.
  • the process includes:
  • step S500 the candidate modes in the HTWUSPI k-1 are combined in pairs to obtain a plurality of candidate mode pairs.
  • Step S510 in the candidate mode pairs, select a candidate mode pair that includes k-2 identical items.
  • the candidate mode pair is selected.
  • Step S520 combining the selected candidate mode pairs to obtain a preliminary candidate mode.
  • Step S530 for each preliminary candidate mode, determining a transaction in which each item included in the preliminary candidate mode is located, and determining an intersection of transactions in which each item is located, and determining an intersection transaction as a transaction in which the preliminary candidate mode is located.
  • the transaction of each item included in the preliminary candidate mode may be determined, and the intersection of the transactions of each item may be determined.
  • the intersection transaction is the transaction in which the preliminary candidate mode is located.
  • Step S540 the value and the utility value of the transaction in each of at least the initial mode where the candidate reaches the extended utility threshold, the initial candidate pattern added HTWUSPI k.
  • the sum of the utility values of the transactions in which the preliminary candidate mode is located may be determined, and then the preliminary candidate mode that satisfies at least the sum value reaches the extended utility threshold may be determined.
  • the determined preliminary candidate pattern is added to HTWUSPI k .
  • the embodiment of the present application proposes a pruning strategy for the generation process of the TWUSPI k , which can reduce the generation of the candidate mode whose period value does not satisfy the set period threshold.
  • the process includes:
  • step S600 the candidate modes in the HTWUSPI k-1 are combined in pairs to obtain a plurality of candidate mode pairs.
  • Step S610 in the pair of candidate modes, select a candidate mode pair that includes k-2 identical items.
  • the two candidate modes contain k-2 identical items, then the pair of candidate mode pairs are selected.
  • Step S620 combining the selected candidate mode pairs to obtain a preliminary candidate mode.
  • Step S630 for each preliminary candidate mode, determining a transaction in which each item included in the preliminary candidate mode is located, and determining an intersection of transactions of each item, and determining an intersection transaction as a transaction in which the preliminary candidate mode is located.
  • Step S640 calculating a sum value of utility values of each transaction in which the preliminary candidate mode is located.
  • the sum value of the utility values of the transactions in which the preliminary candidate mode is located may be determined.
  • Step S650 determining a period value of the preliminary candidate mode according to a time attribute of each transaction in which the preliminary candidate mode is located.
  • the time difference between the two adjacent transactions is calculated according to the time attribute of each transaction, and the maximum time difference among the calculated time differences is determined as the period value of the preliminary candidate mode.
  • Step S660 when the sum value of the utility values of the transactions in which the preliminary candidate mode is located reaches the extended utility threshold, and the period value of the preliminary candidate mode is less than or equal to the set period threshold, the preliminary The candidate mode is added to HTWUSPI k .
  • this embodiment further increases the periodic threshold value when generating HTWUSPI k , and filters out the preliminary candidate mode in which the period value does not reach the periodic threshold, thereby reducing the number of subsequent scan databases and reducing the number of subsequent scans. Pattern mining time.
  • Another pruning strategy is proposed, which can reduce the generation of the candidate mode whose utility value does not reach the set utility threshold, and the acquisition of the pruning strategy is satisfied.
  • the process of determining a set of candidate mode sets is described. Referring to Figure 7, the process includes:
  • Step S700 Scan each transaction in the database, acquire an item whose utility value in each transaction reaches a set extended utility threshold, and compose the first layer candidate mode set HTWUSPI 1 from the acquired items.
  • the extended utility threshold is greater than the utility threshold.
  • Step S710 when scanning the transaction database, record the transaction of each item in the project set, and the utility value of each transaction.
  • Step S720 determining that the utility value of the transaction is less than the inefficient transaction of the utility threshold, and deleting the inefficient transaction in the transaction of each recorded item.
  • Step S730 using the Apriori_gen function and the HTWUSPI 1 to generate the k-th layer candidate mode set HTWUSPI k layer by layer until HTWUSPI k+1 is empty, and the final candidate mode set is composed of HTWUSPI 1 to HTWUSPI k .
  • the process of deleting the inefficient transaction is added in the embodiment, that is, the transaction of each item in the recorded project set does not include the inefficient transaction.
  • the generation of the candidate mode whose utility value does not reach the set utility threshold is avoided, thereby reducing the number of subsequent scans of the database and reducing the mode mining time.
  • the database contains the following transactions: transaction 1 (2a, b, c, d, 2f), transaction 2 (a, c, d, 3e), transaction 3 (a, d, f, h), transaction 4. (c, e, g, h); user-set utility threshold Y, extended utility threshold M, periodic threshold T.
  • the pattern mining process is as follows:
  • the HTWUSPI 1 satisfying the condition includes [a, b, c, d].
  • each preliminary candidate mode For each preliminary candidate mode, determine a transaction in which each item included in the preliminary candidate mode is located, and determine an intersection of transactions in which each item is located, and determine an intersection transaction as a transaction in which the preliminary candidate mode is located.
  • the transactions in [c,d] include: transaction 1, transaction 2.
  • the preliminary candidate is The mode is added to HTWUSPI 2 ;
  • the specific generation process can refer to the generation process of the HTWUSPI 2 , and details are not described herein again.
  • the generated HTWUSPI 1 -HTWUSPI 3 is used as a candidate mode set.
  • HTWUSPI 1 includes: ⁇ [a], [b], [c], [d] ⁇ ;
  • HTWUSPI 2 includes: ⁇ [a,b], [a,c], [a,d] ⁇ ;
  • HTWUSPI 3 includes: ⁇ [a, c, d] ⁇ .
  • [a, c, d] has a utility value of X11 in transaction 1, and a utility value of X21 in transaction 2. If it is determined that both X11 and X21 are greater than or equal to Y, transaction 1 and transaction 2 are determined as target transactions. According to the time attribute of the target transaction, the process of determining the period value of [a, c, d] can refer to the related introduction above, and the period value is 2.
  • the candidate mode is determined as a mining result.
  • FIG. 8 is a schematic structural diagram of a mode mining device according to an embodiment of the present application. As shown in FIG. 8, the device includes:
  • a candidate mode set obtaining unit 810 configured to acquire, according to each transaction included in the database, a candidate mode set that satisfies a setting condition, where each transaction includes at least one item; each candidate mode in the candidate mode set includes at least a project in a project set; the project set is a collection generated from projects in each transaction;
  • the utility value calculation unit 820 is configured to calculate the candidate mode in each transaction for each candidate mode in the candidate mode set. Utility value in .
  • the target transaction determining unit 830 is configured to determine a target transaction in which the utility value reaches a set utility threshold.
  • the candidate mode period value determining unit 840 is configured to determine a period value of the candidate mode according to a time attribute of each of the target transactions.
  • the mining result determining unit 850 is configured to determine the candidate mode as the mining result if the period value of the candidate mode is less than or equal to the set period threshold.
  • the candidate mode period value determining unit 840 may include:
  • the time difference calculation unit is configured to calculate a time difference value of the adjacent two target transactions according to the time attribute of each target transaction.
  • a maximum time difference selecting unit configured to determine a maximum time difference value among the time difference values as a period value of the candidate mode.
  • the time difference calculation unit may include:
  • a first time difference calculation subunit for each target transaction sequentially sorted in the database, if there is no other target transaction before the target transaction, calculating a time of the target transaction and the first transaction in the database Difference.
  • a second time difference calculation subunit configured to calculate a time difference between the last transaction in the database and the target transaction if there is no other target transaction after the target transaction.
  • a third time difference calculation subunit configured to calculate a time difference between the target transaction and the previous adjacent target transaction if there are other target transactions before the target transaction.
  • the foregoing candidate mode set obtaining unit 810 may include:
  • a layer 1 subsequent mode set obtaining unit configured to scan each transaction in the database, obtain an item whose utility value in each transaction reaches a set extended utility threshold, and form a first layer candidate by the acquired item.
  • a mode set HTWUSPI 1 wherein the extended utility threshold is greater than or equal to the utility threshold.
  • a transaction record unit configured to record, when scanning the database, a transaction of each item in the project set, and a utility value of each transaction.
  • layer by layer k generates a first set of candidate modes HTWUSPI k layer, until HTWUSPI k + 1 is empty, the final candidate pattern consisting HTWUSPI 1 to HTWUSPI k set.
  • the foregoing k-th layer candidate mode set generating unit may include:
  • the candidate mode two-two combination unit is used to combine the candidate modes in the HTWUSPI k-1 to obtain a plurality of candidate mode pairs.
  • a candidate mode pair selection unit is configured to select candidate mode pairs including k-2 identical items among the plurality of candidate mode pairs.
  • the candidate mode pair merging unit is used to combine the selected candidate mode pairs to obtain a preliminary candidate mode.
  • a transaction determining unit where the preliminary candidate mode is located, is configured to determine, for each preliminary candidate mode, a transaction in which each item included in the preliminary candidate mode is located, and determine an intersection of transactions of each item, and determine an intersection transaction as the preliminary The transaction in which the candidate pattern is located.
  • the adding the foregoing preliminary candidate mode to the aggregation unit may include:
  • the first preliminary candidate mode is added to the set subunit for calculating a sum value of utility values of each transaction in which the preliminary candidate mode is located.
  • the second preliminary candidate mode is added to the set subunit, and is used to determine a period value of the preliminary candidate mode according to a time attribute of each transaction in which the preliminary candidate mode is located.
  • the third preliminary candidate mode is added to the set sub-unit, and the sum value of the utility values of the transactions in which the preliminary candidate mode is located reaches the extended utility threshold, and the period value of the preliminary candidate mode is less than or equal to the set value.
  • the initial candidate mode is added to HTWUSPI k at the periodic threshold.
  • the candidate mode set obtaining unit 810 may further include:
  • An inefficient transaction deletion unit configured to, after the transaction recording unit, determine an inefficient transaction whose transaction value is less than the utility threshold, and delete the location in the transaction of each item recorded by the transaction recording unit Describe inefficient transactions.
  • the embodiment of the present application further provides a mode mining device, where the device includes:
  • a memory for storing a computer program
  • the embodiment of the present application further provides a storage medium for storing program code, and the program code is used to execute the foregoing mode mining method.
  • Embodiments of the present application also provide a computer program product comprising instructions that, when run on a computer, cause the computer to perform the pattern mining method described above.
  • HUIM High Utility Itemset Mining
  • external utility values such as profit values, etc.
  • internal utility values such as The number of occurrences in the transaction, in the transaction scenario, can be the number of transactions, etc., to calculate the item set utility value of the item set in the database, when the item set utility value of the item set is greater than or equal to the user-defined minimum utility threshold At the time, the set is considered to be an efficient item set.
  • An efficient item set mining method is implemented by setting a unique fixed minimum utility threshold as a measure of an efficient item set, that is, after calculating the item set utility value of each set, the item set of each item is set.
  • the utility value is compared with a unique fixed minimum utility threshold, respectively, such that the item set utility value is greater than or equal to the unique fixed minimum utility threshold item set as the efficient item set.
  • the data items included in one item set are often one or more, and the minimum utility thresholds corresponding to different data items are often different, which may result in different minimum utility thresholds for different item sets.
  • the embodiment of the present application provides a method for efficiently mining item sets, solves the problem of inefficient mining of efficient item sets, and improves the accuracy of the efficiently used itemsets.
  • Transaction A record in the transaction database; for example, if the transaction type of the transaction type records the transaction record of the commodity, each transaction in the transaction database may correspond to the transaction record of one commodity.
  • transaction number the number of different transactions in the transaction database, in general, transactions are numbered in chronological order.
  • Data item An information item recorded in a transaction, where a transaction contains at least one data item; for example, in transaction data of a transaction type, each transaction contains a data item of the commodity of the transaction, and an internal utility value of each commodity (for example, the number of transactions; the number of transactions is an embodiment of the internal utility value in the transaction scenario. In the transaction database of other scenarios, the form of the internal utility value can be adjusted accordingly.
  • the transaction type transaction database contains 10 transactions, each transaction indicates a transaction record, each transaction contains the data item of the commodity name of each transaction, and the transaction amount of each commodity in the transaction ( A form of internal utility value).
  • T4 A 1, C: 3, D: 1, E: 2 T5 B: 1, D: 3, E: 2 T6 B: 2, D: 2 T7 B:3, C:2, D:1, E:1 T8 A: 2, C: 3 T9 C: 2, D: 2, E: 1 T10 A: 2, C: 2, D: 1
  • the data item in the transaction can be the commodity name, and the internal utility value can be the transaction amount of each commodity in the transaction.
  • the transaction database contains five data items A, B, C, D, and E.
  • the actual meaning of the T1 transaction can be: one indicates purchase of one A product, two C products, and three D products.
  • the transaction record; the actual meaning of the T7 transaction can be: a shopping record indicating the purchase of 3 B products, 2 C products, 1 D product and 1 E product.
  • each transaction in Table 4 can contain at least one news, each transaction can record the interest value, sensitivity size, freshness size, etc. of each news; in the field of stocks, etc., each transaction in Table 4 can contain at least A stock, each transaction can record the risk size, the amount of income, etc. of each stock.
  • Item set A set of at least one data item used to represent an association rule inherent in the transaction database; the difference between a transaction and an item set is that the transaction is usually generated by the actual event and is generated in the transaction database.
  • k-item set contains a collection of k data items; for example, a 1-item set can be a set of items containing a data item, such as item set A containing only data item A; 2- item set can be included The item set of two data items, such as only the data item A and the B item set AB, and so on.
  • External utility value table (such as Profit Table): a table for recording the unit external utility value corresponding to each data item in the transaction database; in the transaction type transaction database, the profit table may be a kind of external utility value table
  • the avatar, that is, the external utility value table can record the unit profit value of each data item in the transaction database; see a specific income statement shown in Table 5:
  • the income statement represents the unit profit that can be obtained by selling a commodity. For example, if a commodity A is sold, a profit of 6 yuan can be obtained; if a commodity B is sold, a profit of 12 yuan can be obtained; It can be seen that the external utility value table can represent the unit external utility value corresponding to each data item.
  • Utility of an item in a transaction The utility value of a data item in a transaction, which can be the internal utility value of a data item in a transaction multiplied by the data item. Unit external utility value.
  • Itemset utility in Database The utility value of an item set in the transaction database, that is, the sum of the utility values of an item set in each transaction of all data items containing the item set.
  • Minimum Utility threshold A table indicating the minimum utility threshold corresponding to each data item defined in the embodiment of the present application; the minimum utility in the form of an MMU table shown in Table 6
  • the minimum utility threshold of each data item defined in the threshold table is not fixed, but can be set by the user according to the actual situation of each data item, for example, the minimum utility threshold of each item can be updated according to the price fluctuation of the commodity.
  • the minimum utility threshold corresponding to different data items may be different (as shown in Table 6), which results in different item sets corresponding to each item set.
  • the minimum utility threshold may also be different; therefore, in order to solve the problem that the prior art has a low accuracy with a fixed unique minimum utility threshold for different item sets, the embodiments of the present application may be based on items.
  • the data items contained in the set are the minimum utility thresholds for the item set that match the set of items. ;
  • the embodiment of the present application may determine a data item with the smallest minimum utility threshold in the item set, and use the lowest utility threshold of the determined data item as the lowest utility threshold of the item set, thereby obtaining items corresponding to each set.
  • the minimum utility threshold is set to provide a basis for the mining of efficient item sets with higher accuracy.
  • the item set AB includes the data item A and the data item B.
  • the minimum utility threshold of the data item A is the smallest, so The minimum utility threshold of data item A is the lowest utility threshold of the item set of item set AB, that is, the item set minimum utility threshold of item set AB is 56; and the minimum utility threshold of item set BC is the lowest utility of data item C. Threshold 53.
  • LMU Least Minimum Utility Value
  • High Utility Itemset When the item set utility value of the item set ⁇ the item set minimum utility threshold of the item set, the item set is an efficient item set; for example, the item set of the item set A If the utility value is 48, which is less than the minimum utility threshold of item set A of item A, then item set A is not a highly efficient item set. For example, the item set utility value of item set AD is 90, and the item set of item set AD is the least effective. With a threshold of 50, the item set AD is a set of efficient items.
  • TWU Transaction Weighted Utility
  • High Transaction Weighted Utilization Itemset When the item set TWU ⁇ the item set minimum utility threshold of the item set, the item set is a high transaction weighted utility item set; for example, the item The transaction weighting utility of set B is 178, while the minimum utility threshold of item set B is 65, the transaction weighting utility of item set B is greater than the lowest utility threshold, and item set B is determined to be a high transaction weighted utility set.
  • the method for mining an efficient item set provided by the embodiment of the present application is described below, and the method can be applied to data processing with data processing capability.
  • a device such as a data processing server applied to the network side.
  • the mining of efficient item sets may also be performed on a computer such as a user side. As shown in FIG. 9, the method includes:
  • Step S900 determining an item set utility value corresponding to each set in the transaction database.
  • the item set utility value corresponding to an item set represents: the sum of the utility values of the item set in each target transaction corresponding to the item set, and the target transaction of an item set is the data item including the item set of the item set.
  • Transaction The utility value of an item set in the target transaction represents the sum of the utility values of the data items of the item set in the target transaction.
  • the transaction database may include at least one transaction, and one transaction may record at least one data item and an internal utility value corresponding to each data item, and an item set may include at least one data item;
  • the utility value of a data item in a transaction represents: a product of an internal utility value of the data item in the transaction and a unit external utility value corresponding to the data item, and an external utility value corresponding to each data item is
  • the predefined external utility value table determines that the external utility value table records the unit external utility value corresponding to each data item.
  • the embodiment of the present application may pre-set a profit value table (the profit value table is a form of the external utility value table), and record the unit profit value of each commodity through the profit value table (the commodity is data)
  • the unit profit value is a form of the unit external utility value, that is, the utility value of an item in a transaction transaction is the number of transactions of the item in the transaction (the number of transactions is internal)
  • a form of utility value that is the product of the unit profit value of the commodity.
  • Step S910 determining a minimum utility threshold of the item set corresponding to each item set according to the predefined minimum utility threshold value table.
  • the predefined minimum utility threshold table records the minimum utility threshold corresponding to each data item, and the item set minimum utility threshold corresponding to one item set represents: the minimum and minimum utility among the lowest utility thresholds corresponding to the data items included in the item set. Threshold.
  • Step S920 comparing the item set utility value of each item with the corresponding item set minimum utility threshold, and determining an efficient item set according to the comparison result, wherein the item set utility value of the efficient item set is not less than the corresponding item.
  • Set the minimum utility threshold
  • the embodiment of the present application defines a minimum utility threshold table in which the lowest utility threshold corresponding to each data item is recorded.
  • the minimum utility threshold of the item set corresponding to each item set is compared.
  • the utility threshold is used, so that the minimum and minimum utility threshold of the lowest utility threshold corresponding to the data item included in the item set is used as the lowest utility threshold of the item set corresponding to the item set, so that the item set minimum utility threshold corresponding to the determined item set is determined.
  • the least useful case of the item set the item set utility value of each set is compared with the corresponding item set minimum utility threshold based on the determined minimum set utility threshold of each item set, thereby determining the item set An efficient item set whose utility value is not less than the minimum utility threshold of the corresponding item set, and the mining of the efficient item set is realized.
  • the efficient item set mining method provided by the embodiment of the present application does not use a unique fixed minimum utility threshold as the mining standard of the efficient item set, but the minimum minimum utility threshold corresponding to the data items included in each item set.
  • the minimum utility threshold of the item set of each item set As the minimum utility threshold of the item set of each item set, the minimum utility threshold corresponding to the determined item set is closer to the minimum utility of the item set, and the item set utility value of each set corresponds to the item set.
  • the item set minimum utility threshold is compared to realize the mining of the efficient item set, which will make the mining result more accurate; the embodiment of the present application improves the accuracy of efficient item set mining.
  • Table 7 shows a schematic diagram of a set of efficient items whose item set utility value is not less than the item set minimum utility threshold, as shown in Table 7 below:
  • Table 7 Efficient item set with item set utility value not less than the item set minimum utility threshold
  • the method for determining the utility value of the item set corresponding to each set in the transaction database may be: for each set, first determining at least one target of all data items in the transaction database that includes the item set. Transaction, and determine the utility value of all the data items of the item set in the determined target transactions, and add the determined utility values to obtain the item set utility value of the item set.
  • the transactions T3, T5, T6 and T7 containing the data item B can be determined, thereby determining the utility value 3 ⁇ 12 of the item set B in the transaction T3, and determining the utility value 1 ⁇ 12 of the item set B in the transaction T5,
  • the utility value 2 ⁇ 12 of the item set B in the transaction T6 is determined, the utility value 3 ⁇ 12 of the item set B in the transaction T7 is determined, and the determined utility values are summed to obtain the item set utility value of 108.
  • the transactions T3 and T7 containing the data items B and C can be determined, the utility value of the item set BC in the transaction T3 is determined to be 3 ⁇ 12+5 ⁇ 1, and the utility value of the item set BC in the transaction T7 is determined to be 3 ⁇ 12+2 ⁇ 1, thereby summing the determined utility values to obtain an item set utility value of 79.
  • the process for determining the utility value of the item set corresponding to each set in the transaction database may include:
  • Step S1000 recursively constructing a utility list corresponding to each set according to an external utility value corresponding to each transaction of each data item and an internal utility value of each data item recorded in the predefined minimum utility threshold value table.
  • the utility list corresponding to an item set represents a series of tuple information in the transaction of the item set in the database (that is, the target transaction of the item set).
  • the utility list corresponding to an item set may record the transaction number of each target transaction corresponding to the item set, the utility value corresponding to the target transaction of the item set, and the remaining utility value of the item set in each target transaction.
  • the remaining utility value of an item set in a transaction indicates that the data items in a transaction are sorted from the lowest utility threshold from small to large, and after the data items contained in the item set are removed from the transaction, the sort is on the right side of the transaction. The sum of the utility values of the data items.
  • Step S1010 Calculate the utility value of the item set of each set according to the utility list corresponding to each set.
  • the utility value of the item set of each set can be calculated according to the utility list corresponding to each set. Since the utility list corresponding to each set records the utility value corresponding to each target transaction, the sum of the utility values corresponding to each target transaction can be used as the utility value of the item set of each set.
  • the utility list corresponding to each set may be recursively constructed hierarchically, and the hierarchical ordinal number of an item set corresponds to the number of data items included in the item set, that is, the items of the first level
  • the set contains only one data item
  • the second level of the set contains only two data items, and so on; and the utility list corresponding to the next level of the item set can be through at least two high-level groups that can be combined into the item set.
  • the minimum utility threshold of the enumeration may be constructed first.
  • Value tree MIU tree
  • the enumerated MIU tree can be considered as an extended version of the regular enumeration tree
  • the MIU tree contains a hierarchical item set
  • the hierarchical ordinal and item set of an item set in the MIU tree The number of data items included corresponds, and the item sets of each level are sorted in ascending order of minimum utility threshold.
  • the utility list corresponding to each set of the MIU tree may be constructed based on the external utility value corresponding to each transaction and the internal utility value of each data item, and the next The utility list corresponding to the one-level item set can be constructed by at least two utility lists that can be combined into a high-level item set of the item set.
  • the MIU tree When constructing the MIU tree, first determining a set of data items in the transaction database, and sorting the determined items in the first level of the MIU tree to construct an item set located at the first level of the MIU tree; Then, in the depth-first search manner, starting from the various sets of the first level of the MIU tree, the hierarchical item set is constructed, and the hierarchical ordinal number of the item set in the MIU tree and the item set are The number of data items contained corresponds to form an MIU tree.
  • the item sets may be randomly ordered or sorted in order of lowest utility threshold from small to large.
  • the method for constructing an MIU tree is illustrated, and the method may include:
  • Step S1100 determining that the transaction database contains a set of data items, and sorting the determined items in the first level of the MIU tree according to the lowest utility threshold, and constructing the first level of the MIU tree. Set of items.
  • the MIU tree When constructing the MIU tree, you can first determine the set of data items in the transaction database, that is, each 1-item set; and sort the determined sets according to the lowest utility threshold from small to large in the MIU tree. At the first level, the set of items at the first level of the MIU tree is constructed.
  • Step S1110 starting from the set of the first level of the MIU tree in a depth-first search manner, constructing a hierarchical item set, and making the hierarchical number and the item of an item set in the MIU tree
  • the set contains the number of data items, and the item sets of each level are sorted according to the lowest utility threshold from small to large, forming an MIU tree.
  • a hierarchical item set of the MIU tree can be constructed in a depth-first search manner.
  • Figure 12 shows the corresponding MIU tree structure.
  • the transaction database contains a set of data items A, B, C, D, and E, as shown in Table 6, item set A.
  • the minimum utility thresholds for B, C, D, and E are ordered from small to large as item sets D, C, A, B, and E, thereby ordering D, C, A, B, and E sequentially in the first of the MIU trees.
  • Hierarchy after constructing the item set of the first level of the MIU tree, starting from the item set D, construct the item sets DC, DA, DB, and DE corresponding to the item set D in the second level, and set the item set DC, DA, DB, and DE are sorted according to the lowest utility threshold from small to large, and then the item sets DCA, DCB, DCE, DAB, and DAE corresponding to the item set DC in the third level are constructed and sorted; and the corresponding items of the DCA in the next level are constructed.
  • Set DCABE then return to the item set DA to construct its corresponding next level item set, and so on, and then proceed from the first level of the MIU tree to build a hierarchical item set.
  • the utility value of each set in the corresponding target transaction may be sequentially calculated. And determine the remaining utility value of each set in the corresponding target transaction.
  • the transaction number of each target transaction corresponding to the first item set in the first level, the utility value of the item set in the corresponding target transaction, and the item may be recorded first.
  • the remaining utility values in the corresponding target transactions are collected and recorded through the table, and the items in the first level are processed accordingly, and the utility list corresponding to each set in the first level is obtained.
  • Figure 13 shows a list of utilities corresponding to the sets of the first level in the MIU tree. Where tid represents the transaction number and iu represents Utility value, ru represents the remaining utility value.
  • the utility list of the item set of the next level in the MIU tree can be constructed by at least two utility lists that can be combined into a high-level item set of the item set.
  • At least two high-level item sets that can be combined into the item set may be at least two item sets in the high-level level that can be directly combined into the item set, or at least two in the high-level level. After combining, itemsets of the item set can be combined by removing duplicate data items.
  • the utility list such as the item set DC in the second level can be formed by combining the utility lists of the item sets D and C in the first level, as shown in FIG.
  • the transaction number of the corresponding target transaction of the 2-item set DC is the transaction number of the target transaction of the common occurrence of the 1-item set D and C, that is, ⁇ T1, T4, T7, T9, T10 ⁇ ;
  • the remaining utility value of the item set DC in each target transaction can be directly determined by the remaining utility values of the item items C sorted in the item set D and C in each target transaction.
  • the second level can be combined into the second Two item sets of the item set of the hierarchy, the target transaction corresponding to the two item sets as the target transaction corresponding to the item set of the second level, and the two items are set in a common target transaction
  • the utility list of the item set DCA in the third level can be formed by combining the item lists DC and DA in the second level, and the specific combination is as shown in FIG.
  • the construction process of the utility list of the item set may be different from the construction process of the utility list of the item set of the second level, except that the level is not less than three.
  • the utility value of the item set in each target transaction in addition to the utility value of the two item sets in the upper level that can be combined into the item set in the target transaction, the need to subtract the item set The utility value of the prefix data item in the target transaction.
  • the item 15 is the utility value 11 of the DC in the T10 transaction, plus the utility value 21 of the DA in the T10 transaction, and subtracting the prefix data item D of the item set DCA.
  • the item set sorts the remaining utility value of the item set in each target transaction.
  • a utility list of an item set having a hierarchy of not less than three for a set of levels not less than three, it is possible to determine two item sets that can be combined into the item set in the upper level, and the two item sets correspond to each other.
  • the target transaction as the target transaction corresponding to the item set of the hierarchy not less than three; and the sum of the utility values of the two items in a common target transaction, minus the item of the level not less than three
  • the utility value of the set prefix data item in the target transaction, and the result is obtained as the utility value of the item set of the hierarchy not less than three in the target transaction; and the two items are sorted in the subsequent item set in a
  • the remaining utility value in the target transaction corresponding to the common level is used as the remaining utility value of the item set of the second level in the target transaction, thereby obtaining a utility list of the item set of the level not less than three.
  • the pseudo code of the utility list corresponding to the item set of each level constructed above may be as follows.
  • Input X, an itemset; X.UL is the utility-list of X; Xab.UL, Xa.UL, Xb.UL, And Xa ⁇ Xb.//Input: item set X; X corresponding utility list; Xab corresponding utility list; Xa corresponding utility list; Xb corresponding utility list, Xa, Xb are both subsets of X, and Xa ⁇ Xb
  • the TID in the 2-utility list is the TID corresponding to the data item Ea
  • iu is the sum of the utility values of the data item Ea and the data item Eb
  • ru is the remaining utility value corresponding to the data item Eb.
  • At least two item sets in the high level that can be directly combined into the item set may be determined, the at least a target transaction corresponding to the two item sets, as a target transaction corresponding to the item set of the level not less than three; the sum of the utility values of the at least two items in a common corresponding target transaction as the level a utility value of the item set of not less than three in the target transaction; and sorting the item set of the highest level in the at least two item sets, the remaining utility value in a common corresponding target transaction, as the The remaining utility value of the item set of the second level in the target transaction, thereby obtaining a utility list of the item set of the level not less than three.
  • the item set DCA can be composed of the item set DC item set A
  • the utility value of the item set DCA in each target transaction can be the sum of the utility values of the item set DC and A in the target transaction
  • the remaining utility value of the item set DCA in each target transaction may be the highest level in the item set DC and A, and the remaining utility value of the sorted item set A in each target transaction; as shown in FIG. 16 .
  • the efficient item set is mined according to the comparison result.
  • the algorithm pseudo code for mining the efficient item set of the present application can be referred to the following algorithm 1 and algorithm 2.
  • Line 1 initializes several variables, Line 2 calculates the LMU from the MMU table, and then scans the original database to calculate the TWU value (Line 3) of each 1-item set, according to each set in the MMU table.
  • the lowest utility threshold of the 1-item set finds the set of high transaction-weighted utility 1-item sets HTWUI1 (Line 4, here is the Global downward closure property (GDC property)); Line 5 is the pair to find out HTWUI1 sorts from small to large based on their minimum utility threshold.
  • GDC property Global downward closure property
  • Line 6 generates a first-utility list based on the high transaction-weighted utility 1-item set; then calls the mining function HUI-Search to recursively generate a series of subsequent utility lists (Line 7) based on the 1-utility list, and from Mining efficient item sets in the generated utility list.
  • the embodiment of the present application includes a set of data items in the transaction database, and sorts the determined items in the first level of the MIU tree according to the lowest utility threshold, and constructs the MIU.
  • the first level of the item set of the tree can be realized by calculating the transaction weighted utility value (TWU value) of the item set of each set containing one data item, according to the minimum utility of each set containing one data item.
  • Threshold determining a set of high transaction-weighted utility items in each set of data items, sorting the high-transaction-weighted utility items from small to large according to the lowest utility threshold; and then recursively based on the sets of items containing one data item
  • the utility list generates a series of subsequent utility lists to form a list of utilities corresponding to each set.
  • Pruning strategy 1 When traversing the MIU tree by depth-first search, according to the utility list, if the TWU value of an item X is less than the LMU value, then all the supersets of X will not be efficient items;
  • the superset refers to the collection of all data items containing the item set, such as item set A, whose item set is all the tree nodes containing A in the previous MIU tree diagram, not just all the child nodes of item set A. .
  • Pruning strategy 2 When traversing the MIU tree by depth-first search, according to the utility list, if the utility value and remaining of an item X If the sum of the residual utility values is less than the minimum utility threshold of the item set of the item X, then all the extended nodes of the item set X (ie, their descendant nodes) will not be efficient items, because their actual utility values will Less than the MIU(X) value.
  • the utility list of the item set without the future is filtered by the utility list of the generated items, and the utility of the corresponding extended set is generated according to the utility list of the remaining promising item sets.
  • the list makes it only need to scan the database once in the data mining process, and generate a utility list of the first level of the set, and generate the utility of the subsequent other item sets according to the utility list of the first level of the set when needed.
  • the list not only reduces the number of times the database is scanned, but also reduces the speed of mining by reducing the range of data to be mined, saving computing resources.
  • the embodiment of the present application also proposes two characteristics, a global downward closure property (GDC property) and a conditional downward closure property (CDC property).
  • GDC property global downward closure property
  • CDC property conditional downward closure property
  • the EUCP Estimated Utility Co-occurrence Pruning
  • the constructed estimated utility co-occurrence structure table improves processing efficiency;
  • the EUCS table includes a transaction weighting utility upper limit corresponding to the k-item set corresponding to the k-item set, k ⁇ 2, that is, the EUCS table may include a transaction weighting utility upper limit corresponding to the item set of each level of the second level and the item set;
  • the transaction weighted utility upper limit is a sum of the transaction utility upper limits corresponding to the transaction including the k-item set,
  • the transaction utility upper limit refers to the sum of the utilities of the data items in the transaction.
  • the EUCS constructed by the example database is shown in Table 8 below, where the TWU (transaction weighted utility value of the item set) value of the item set BE is calculated by the utility value of the item set in the transaction T5, and the item set is The sum of the utility values in transaction T7, which is 95.
  • TWU transaction weighted utility value of the item set
  • a superset of an item set is a collection of all data items of the item set.
  • its item set is all the tree nodes in the MIU tree that contain item set A, not just all the child nodes of item set A.
  • the embodiment of the present application can also infer that if an item set is an HTWUI (high transaction weighted utility item set), then any item set of the item set (the child set contains all of the item set) The data item) is also HTWUI; if an item set is not HTWUI, then any superset of the item set is not HTWUI.
  • HTWUI high transaction weighted utility item set
  • the embodiment of the present application can sort in ascending order according to the minimum utility threshold value to obtain the sorted 1-item set. For example, according to Table 4, after the 1-item candidate set includes the minimum utility thresholds of the data items A, B, C, D, and E, the sorted 1-item set D, C, A, B, and the like can be obtained in ascending order. E.
  • the 2-item set is generated from the connection.
  • the data items in the 2-item set are sorted in ascending order according to the minimum utility threshold of the 1-item set in the MMU table.
  • the process of generating a 2-item set according to the 1-item set self-joining that is, the specified data item is combined with the data item arranged to the right of the specified data item; for example, the sorted 1-item set is D, C, A , B, E, for item set D
  • the 2-item set generated by the self-join is DC, DA, DB, DE.
  • the extended set of the item set in this article refers to the item set generated by the item set and the item set on the right side after sorting, and the super set is all data in the traditional sense including the item set. A collection of items.
  • the item set after determining the efficient use item set, the item set can be recommended efficiently when the content recommendation is performed on the user.
  • the technical solution provided by the embodiment of the present application is to process a transaction type transaction database that is common in daily applications, and by introducing an MMU table, determine the MIU corresponding to each item set according to the MMU table, and correspondingly use the item set utility value of the item set.
  • the MIU is compared to determine whether the set is HUI; and the existing HUIM-based algorithm solves whether the item set utility value of the item set is greater than the unique minimum utility threshold as a measure, resulting in mining HUI inaccurate problem; reached different HUI metrics according to different item sets, so that the mined HUI is more accurate, more credible and more meaningful.
  • the embodiment of the present application further provides an efficient item set mining device, and the efficient item set mining device described below can refer to the high-efficiency item set mining method described above.
  • FIG. 17 A schematic structural view of the device is shown in FIG. 17, and the device includes:
  • the item set utility value determining module 1700 is configured to determine an item set utility value corresponding to each item set in the transaction database; the item set utility value corresponding to the item set indicates that the item set is in each target transaction corresponding to the item set The sum of the utility values, the target transaction of an item set is a transaction containing all data items of the item set; the utility value of an item set in the target transaction indicates that each data item of the item set is in the target transaction The sum of the utility values.
  • the item set minimum utility threshold determining module 1710 is configured to determine, according to the predefined minimum utility threshold table, a minimum utility threshold corresponding to the item set; the predefined minimum utility threshold table records a minimum utility threshold corresponding to each data item, The item set minimum utility threshold corresponding to an item set indicates the minimum minimum utility threshold among the lowest utility thresholds corresponding to the data items included in the item set.
  • the efficient item set determining module 1720 is configured to compare the item set utility value of each item with the corresponding item set minimum utility threshold, and determine an efficient item set according to the comparison result, wherein the item set of the efficient item set is used.
  • the utility value is not less than the minimum utility threshold for the corresponding item set.
  • FIG. 18 shows an optional structure of the item set utility value determining module 1700.
  • the item set utility value determining module 1700 may include:
  • the utility list construction unit 1800 is configured to recursively construct the utility corresponding to each set according to the external utility value corresponding to each transaction of each data item and the internal utility value of each data item recorded in the predefined minimum utility threshold value table. a list; wherein the utility list corresponding to an item set records the transaction number of each target transaction corresponding to the item set, the utility value corresponding to the target transaction of the item set, and the remaining utility value of the item set in each target transaction.
  • the remaining utility value of an item set in a transaction indicates that the data items in a transaction are sorted from the lowest utility threshold from small to large, and after the data items contained in the item set are removed from the transaction, the order is The sum of the utility values of the data items to the right of the transaction.
  • the item set utility value calculation unit 1810 is configured to calculate an item set utility value of each item according to the utility list corresponding to each item set.
  • the utility model may be specifically configured to recursively construct a utility list corresponding to each set, and the hierarchical ordinal number of the item set is The set contains the number of data items; and the utility list corresponding to the item set of the next level is constructed by at least two utility items that can be combined into a high-level item set of the item set. build.
  • FIG. 19 shows an optional structure of the utility list construction unit 1800.
  • the utility list construction unit 1800 may include:
  • the MIU tree construction subunit 1900 is configured to construct an enumerated minimum utility threshold MIU tree, the MIU tree includes a hierarchical item set, and an item set is included in the MIU tree in the hierarchical ordinal and the item set.
  • the number of data items corresponds to each other, and the item sets of each level are sorted in order of the lowest utility threshold from small to large.
  • the utility list construction execution sub-unit 1910 is configured to construct a utility list corresponding to each set of the MIU tree based on the external utility value corresponding to each transaction of each data item and the internal utility value of each data item, and
  • the utility list corresponding to the one-level item set is constructed by at least two utility lists that can be combined into a high-level item set of the item set.
  • the MIU tree construction subunit 1900 is specifically configured to determine, in the transaction database, a set of data items, and sort the determined items in a first level of the MIU tree according to a minimum utility threshold. Constructing a set of items at the first level of the MIU tree; starting from the set of the first level of the MIU tree in a depth-first search manner, constructing a hierarchical item set and making an item set in the MIU The hierarchical ordinal number in the tree corresponds to the number of data items contained in the item set, and the item sets of each level are sorted according to the lowest utility threshold from small to large to form an MIU tree.
  • the utility list construction execution sub-unit 1910 may be specifically configured to: when constructing the utility list of the items of the second level, determine, in the second level, the first level can be combined into the second Two item sets of the item set of the hierarchy; the target transaction corresponding to the two item sets as the target transaction corresponding to the item set of the second level; the two items are set in a common target transaction The sum of the utility values as the utility value of the item set of the second level in the target transaction; the two items are collectively sorted in the remaining item set in a common corresponding target transaction value, as The remaining utility value of the second level of item set in the target transaction.
  • the utility list construction execution subunit 1910 may specifically be further used to:
  • a utility list of an item set having a hierarchy of not less than three for a set of levels not less than three, determining two item sets in the upper level that can be combined into the item set; the two item sets collectively correspond a target transaction, which is a target transaction corresponding to the item set of the hierarchy not less than three; the sum of the utility values of the two items in a common target transaction, minus the item set of the hierarchy not less than three
  • the utility value of the prefix data item in the target transaction, and the result is obtained as the utility value of the item set of the hierarchy not less than three in the target transaction; the two items are sorted in the subsequent item set in a common correspondence
  • the remaining utility value in the target transaction as the remaining utility value of the item set of the second level in the target transaction.
  • the MIU tree construction subunit 1900 includes a set of data items in the transaction database, and sorts the determined items in a first level of the MIU tree according to a minimum utility threshold.
  • a minimum utility threshold When building an item set at the first level of the MIU tree, it can be used to:
  • the utility list construction execution sub-unit 1910 can be used to construct a utility list corresponding to each set of data items when constructing a utility list corresponding to each set of the MIU tree, recursively including one
  • the utility list of the items of the data item generates a series of subsequent utility lists to form a utility list corresponding to each set.
  • the efficient item set mining apparatus may be further configured to: when traversing the MIU tree in a depth-first search manner, if the transaction weighting utility value of the set is less than the minimum minimum utility threshold of the item set, determining All supersets of this set are not efficient items And/or, when traversing the MIU tree in a depth-first search manner, if the sum of the utility value and the remaining utility value of the episode is less than the minimum utility threshold of the episode of the episode, then determining the episode in the MIU All extension nodes in the tree are not efficient itemsets.
  • the efficient item set mining apparatus may be further configured to: obtain an EUCS table, where the EUCS table includes a transaction weighting utility upper limit corresponding to an item set and an item set of each level not lower than the second level. According to the EUCS table, the item set whose transaction weighting utility upper limit is less than the minimum utility threshold and not smaller than the second level and its superset are filtered.
  • the efficient item set mining apparatus may be further configured to: if an item set is a high transaction weighted utility item set, determine that any item set of the item set is also a high transaction weighted utility item. Set, the child set contains all the data items of the item set; if one item set is not a high transaction weighted utility set, then it is determined that any superset of the item set is not a high transaction weighted utility set.
  • the embodiment of the present application further provides a data processing device, which may include the efficient item set mining device described above.
  • a data processing device which may include the efficient item set mining device described above.
  • the data processing device may include a processor 1, a communication interface 2, a memory 3, and a communication bus 4.
  • the processor 1, the communication interface 2, and the memory 3 complete communication with each other via the communication bus 4.
  • the communication interface 2 can be an interface of the communication module, such as an interface of the GSM module.
  • the processor 1 is for executing a program.
  • the memory 3 is used to store the program.
  • the program can include program code, the program code including computer operating instructions.
  • the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the program may be specifically configured to: determine an item set utility value corresponding to each set in the transaction database; the item set utility value corresponding to the item set represents the utility of the item set in each target transaction corresponding to the item set
  • the sum of values, the target transaction of an item set is a transaction containing all data items of the item set; the utility value of an item set in the target transaction represents the utility value of each item of the item set in the target transaction Addition
  • a predefined minimum utility threshold value table records a minimum utility threshold corresponding to each data item, and an item set minimum utility threshold corresponding to an item set indicates The minimum minimum utility threshold among the lowest utility thresholds for the data items contained in the set.
  • the embodiment of the present application further provides an efficient item collection mining device, and the device includes:
  • a memory for storing a computer program
  • the embodiment of the present application further provides a storage medium for storing program code, and the program code is used to execute the above-mentioned efficient item set mining method.
  • the embodiment of the present application further provides a computer program product including instructions, which, when run on a computer, causes the computer to execute the above-described efficient item set mining method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computer Security & Cryptography (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种模式挖掘方法、高效用项集挖掘方法及相关设备,其中,模式挖掘方法及相关设备,结合时间关系和效用值两个挖掘因素,使得挖掘出的模式的效用值在时间上分布均匀,挖掘结果更精准。其中,高效用项集挖掘方法及相关设备,将每个项集所包含的数据项对应的最小最低效用阈值,作为每个项集的项集最低效用阈值,使得所确定的各项集对应的项集最低效用阈值更为贴近项集的最低效用情况,进而将各项集的项集效用值与项集对应的项集最低效用阈值进行比对,提高高效用项集挖掘的准确性。

Description

模式挖掘方法、高效用项集挖掘方法及相关设备
本申请要求于2016年09月27日提交中国专利局、申请号为201610856770.5、申请名称为“一种模式挖掘方法及装置”及要求于2016年09月28日提交中国专利局、申请号为201610866557.2、申请名称为“一种高效用项集挖掘方法、装置及数据处理设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据挖掘技术领域,具体涉及模式挖掘方法及装置,以及高效用项集挖掘方法、装置及数据处理设备。
背景技术
在高效用挖掘技术中,事务数据库是一种可以记录交易、新闻等事务的数据库,事务数据库通常记录有至少一条事务,每条事务中包括至少一个数据项,即项目;比如交易类型的事务数据库中可以记录有至少一条关于交易记录的事务,一条关于交易记录的事务中可以包括至少一个商品的数据项(商品的数据项可以对应商品名称)及各商品的交易数量,而为表征事务数据库中数据项间的关联规则,至少一个数据项又会集合形成一个项目集,即项集。
由于交易类型等的事务数据库往往能够反映用户的偏好,因此在向用户推荐信息时,往往会从事务数据库形成的多个项集中挖掘出向用户推荐的项集;而在挖掘项集的过程中,往往需要考虑效用值较高的项集(简称高效用项集)。
高效用项集是效用值较高的项集,而项集中往往有一个或多个数据项,如何综合考虑项集中各数据项的效用值,以提升挖掘出的高效用项集的准确性,显得尤为必要。
发明内容
有鉴于此,本申请提供了一种模式挖掘方法、装置,以及一种高效用项集挖掘方法、装置及数据处理设备,用以提升挖掘出的高效用项集的准确性。
在本申请第一方面提供了一种模式挖掘方法,包括:
根据数据库中包含的各事务,获取满足设定条件的候选模式集合;其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合;
针对上述候选模式集合中每一候选模式,计算候选模式在每一事务中的效用值;确定该效用值达到设定的效用阀值的目标事务,并根据各所述目标事务的时间属性,确定候选模式的周期值;若候选模式的周期值小于等于设定的周期阀值,则将该候选模式确定为挖掘结果。
在本申请第二方面提供了一种模式挖掘装置,包括:
候选模式集合获取单元,用于根据数据库中包含的各事务,获取满足设定条件的候选模式集合;其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合;效用值计算单元,用于针对上述候选模式集合中每一候选模式,计算候选模式在每一事务中的效用值;
目标事务确定单元,用于确定上述效用值达到设定的效用阀值的目标事务。
候选模式周期值确定单元,用于根据各目标事务的时间属性,确定上述候选模式的周期值;
挖掘结果确定单元,用于若候选模式的周期值小于等于设定的周期阀值,则将该候选模式确 定为挖掘结果。
在本申请第三方面提供了一种模式挖掘设备,该设备包括:
处理器以及存储器;
所述存储器用于存储计算机程序;
所述处理器用于读取所述计算机程序,根据计算机程序中的可执行指令执行上述模式挖掘方法。
在本申请第四方面提供了一种存储介质,该存储介质用于存储程序代码,程序代码用于执行上述模式挖掘方法。
在本申请第五方面提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述模式挖掘方法。
本申请提供的上述模式挖掘方法及相关设备,针对获取的候选模式集合,计算其在各事务中的效用值,对于效用值小于设定的效用阀值的事务进行删除,该部分事务的模式效用值过小,删除后可以减少挖掘计算时间,并根据删除后剩余的目标事务的时间属性,确定候选模式的周期值,在该周期值小于等于设定的周期阀值时,将候选模式确定为挖掘结果,保证了挖掘得到的模式的效用值在时间上分布均匀,更加便于精确决策,挖掘结果更精准。
在本申请第六方面提供了一种高效用项集挖掘方法,包括:
确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和,一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和;
根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值。
将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
在本申请第七方面提供了一种高效用项集挖掘装置,包括:
项集效用值确定模块,用于确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和,一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和。
项集最低效用阈值确定模块,用于根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值。
高效用项集确定模块,用于将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
在本申请第八方面提供了一种数据处理设备,包括上述第七方面的高效用项集挖掘装置。
在本申请第九方面提供了一种高效用项集挖掘设备,该设备包括:
处理器以及存储器;
所述存储器用于存储计算机程序;
所述处理器用于读取所述计算机程序,根据计算机程序中的可执行指令执行上述高效用项集挖掘方法。
在本申请第十方面提供了一种存储介质,该存储介质用于存储程序代码,程序代码用于执行上述高效用项集挖掘方法。
在本申请第十一方面还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述高效用项集挖掘方法。
在本申请提供的高效用项集挖掘方法及相关设备,定义了记录有各数据项对应的最低效用阈值的最低效用阈值表,在确定每一个项集所对应的项集最低效用阈值时,是通过比对项集包含的数据项所对应的最低效用阈值,从而将项集所包含的数据项对应的最低效用阈值中的最小最低效用阈值,作为项集所对应的项集最低效用阈值,使得所确定的各项集对应的项集最低效用阈值更为贴近项集的最低效用情况;基于所确定的各项集的项集最低效用阈值,将各项集的项集效用值与对应的项集最低效用阈值进行比对,从而确定出项集效用值不小于对应的项集最低效用阈值的高效用项集。
由此可见:本申请提供的高效用项集挖掘方法及相关设备,并不是以唯一固定的最低效用阈值,作为高效用项集的挖掘标准,而是将每个项集所包含的数据项对应的最小最低效用阈值,作为每个项集的项集最低效用阈值,使得所确定的各项集对应的项集最低效用阈值更为贴近项集的最低效用情况,进而将各项集的项集效用值与项集对应的项集最低效用阈值进行比对,来实现高效用项集的挖掘,提高了高效用项集挖掘的准确性。
附图说明
图1所示为根据本申请实施例的服务器硬件结构示意图;
图2所示为根据本申请实施例的模式挖掘方法流程图;
图3所示为根据本申请实施例的确定候选模式的周期值的方法流程图;
图4所示为根据本申请实施例的获取候选模式集合的方法流程图;
图5所示为根据本申请实施例的第k层候选模式集合产生方法流程图;
图6所示为根据本申请实施例的第k层候选模式集合产生方法流程图;
图7所示为根据本申请实施例的获取候选模式集合的方法流程图;
图8所示为根据本申请实施例的模式挖掘装置结构示意图;
图9所示为根据本申请实施例的高效用项集挖掘方法的流程图;
图10所示为根据本申请实施例的确定项集对应的项集效用值的方法流程图;
图11所示为根据本申请实施例的构建MIU树的方法流程图;
图12所示为MIU树的结构示意图;
图13所示为MIU树中第一个层级的各项集对应的效用列表的示意图;
图14所示为效用列表的组合示意图;
图15所示为效用列表的另一组合示意图;
图16所示为效用列表的再一组合示意图;
图17所示为根据本申请实施例的高效用项集挖掘装置的结构框图;
图18所示为根据本申请实施例的项集效用值确定模块的结构框图;
图19所示为根据本申请实施例的效用列表构建单元的结构框图;
图20所示为根据本申请实施例的数据处理设备的硬件结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例的技术方案进行清楚、完整地描述,所描述的实施例仅是本申请一部分实施例,在本申请实施例的基础上,本领域普通技术人员在没有付出创造性劳动的前提下,获得的所有其他实施例,都属于本申请的保护范围。
在介绍本申请方案之前,先对模式挖掘的概念进行简单介绍。
以超市的商品销售记录为例,该商品销售记录用于记录顾客的购买清单内容,其中顾客的购买清单包括购买商品相关信息,例如商品名称和商品数量;采用高效用模式挖掘方式,从这些购买清单中找出销售额或者利润较高的商品组合,利用找出的该商品组合,改变销售策略,提高销售利润。将上述例子抽象成模式挖掘的模型,具体为:购买商品对应项目,购买清单对应事务,将所有购买清单存储在事务数据库中,则事务数据库中包括一个或者多个事务,一个事务包括至少一个项目,根据事务包括的项目生成项目集;而模式挖掘是为了从项目集中挖掘出符合条件的项目。
本申请将周期和效用值进行结合,提出了一种基于周期的高效用模式挖掘方法,初步获取的候选模式集合,计算其在每一事务中的效用值,对于效用值未达到设定的效用阀值的事务,由于此类事务的效用值过小,对总效用值贡献很小,为了避免浪费挖掘计算时间,可以删除此类事务,进而利用剩余事务的时间属性,计算候选模式的周期值,并将周期值小于等于设定的周期阀值的候选模式作为挖掘结果进行保留。该类模式在每个周期内都具有较高效用值,利于快速决策。
其中,模式的周期值是根据指定的包含模式的各事务的时间属性而确定的。具体是,在指定的包含模式的各事务中,将相邻事务的时间差值中的最大时间差值,确定为模式的周期值。上述指定的包含模式的事务,可以是指定的所有包含模式的事务,也可以是根据一定条件从所有包含模式的事务中挑选出的部分事务。
本申请实施例提供的模式挖掘方法是基于服务器实现的,在介绍本申请的模式挖掘方法之前,首先介绍一下服务器,该服务器可以是电脑、笔记本等处理设备,参见图1,图1所示为根据本申请实施例的服务器硬件结构示意图,如图1所示,该服务器可以包括:
处理器1,通信接口2,存储器3,通信总线4,和显示屏5。
其中,处理器1、通信接口2、存储器3和显示屏5通过通信总线4完成相互间的通信。
下面,结合服务器硬件结构,对本申请的模式挖掘方法进行介绍。
参见图2,图2所示为根据本申请实施例的模式挖掘方法流程图,如图2所示,应用于服务器,该方法包括:
步骤S200,根据事务数据库中包含的各事务,获取满足设定条件的候选模式集合。
其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合。
利用设定条件,扫描事务数据库获取满足设定条件的候选模式集合。该设定条件可以包括对候选模式的效用值大小进行限定的条件,在具体实现时,本申请实施例对效用值的大小数值不作具体限定。
在服务器上实现时,可以预先通过通信接口2,将事务数据库存储至存储器3中。在挖掘时,通过通信接口2输入设定条件,并由处理器1通过通信总线4在存储器存储的数据库中查询满足设定条件的候选模式集合。
可选地,通信接口2可以为通信模块的接口,如GSM模块的接口;可选地,处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。
步骤S210,针对所述候选模式集合中每一候选模式,计算所述候选模式在每一事务中的效用值。
具体地,通过扫描事务数据库,可以确定事务数据库中包含候选模式的事务,并计算候选模式在该事务中的效用值。
举例说明,事务数据库中包含三个事务,分别为(2a,3b,c)、(a,2b,3d)、(b,3c,4d),其中a、b、c、d为四个项目。每个事务包含的项目之前的数字表示事务所包含对应项目的个数,例如:事务(2a,3b,c)包含2个项目a,3个项目b,和1个项目c。
假设有某一候选模式为[a,b],则扫描事务数据库可以确定包含该候选模式的事务为:(a,b,c)和(a,b,d),分别计算候选模式在上述两个事务中的效用值,当然,对于事务数据库中不包含候选模式的事务,候选模式在对应事务中的效用值为0。具体实施时,可以由处理器1计算所述候选模式在每一事务中的效用值。
步骤S220,确定所述效用值达到设定的效用阀值的目标事务。
在本申请实施例中,用户可以根据需要预先设定模式在每个事务中的效用阀值,以及模式的周期阀值。通过S210得到候选模式在每一事务中的效用值,就能够确定出效用值达到设定的效用阀值的目标事务。在具体实施时,可以由处理器1比较各事务的效用值与设定的效用阀值的大小关系,从而确定出所述效用值达到设定的效用阀值的目标事务。
步骤S230,根据各所述目标事务的时间属性,确定所述候选模式的周期值。
具体地,事务数据库中各事务都存在时间属性,抽象分析时,可以定义事务数据库的时间长度为事务数据库中所包含事务的个数,且相邻两个事务的时间差均相同,例如相邻两个事物的时间差记为1。比如,数据库中包含A,B,C,D,E五个事务,则可以确定数据库的长度为5,事务A与事务B之间的时间差为1,事务A与事务D之间的时间差为3。
而根据各目标事务的时间属性,确定候选模式的周期值,仍以上述示例进行说明,假设针对候选模式1,其对应的目标事务包括A、C、E,则候选模式1的周期值为三个目标事务中相邻两个差值中的最大值,其中A和C的时间差值为2,C和E的时间差值为2,即候选模式1的周期值为2。
在服务器上实现时,可以由处理器1根据各目标事务的时间属性,确定候选模式的周期值。
步骤S240,若所述候选模式的周期值小于等于设定的周期阀值,则将所述候选模式确定为挖掘结果。
具体地,如果某一候选模式的周期值小于等于设定的周期阀值,则代表符合用户定义的周期大小条件,可以将该候选模式确定为挖掘结果。具体实施时,可以由处理器1比较各候选模式的周期值与设定的周期阀值的大小关系,并将周期值小于等于设定的周期阀值的候选模式确定为挖掘结果,通过显示屏5输出显示。
本申请实施例提供的模式挖掘方法,针对获取的候选模式集合,计算其在各事务中的效用值,对于效用值小于设定的效用阀值的事务进行删除,该部分事务的模式效用值过小,删除后可以减少挖掘计算 时间,并根据删除后剩余的目标事务的时间属性,确定候选模式的周期值,在该周期值小于等于设定的周期阀值时,将候选模式确定为挖掘结果,保证了挖掘得到的模式的效用值在时间上分布均匀,更加便于精确决策。
下面对上述步骤S230的实现过程进行介绍,参见图3,图3为本申请实施例公开的一种确定候选模式的周期值的方法流程图,如图3所示,该方法包括:
步骤S300,根据各目标事务的时间属性,计算相邻两目标事务的时间差值。
每一目标事务均存在时间属性,根据目标事务的时间属性,计算相邻两个目标事务的时间差值,具体计算过程为:数据库中事务按照时间先后顺序排序,针对数据库中顺序排序的各目标事务:若上述目标事务之前不存在任何其它目标事务,则计算该目标事务与数据库中首个事务的时间差值;若上述目标事务之后不存在任何其它目标事务,则计算数据库中末尾事务与该目标事务的时间差值;若上述目标事务之前存在其它目标事务,则计算该目标事务与前一相邻目标事务的时间差值。
为了便于理解,下面进行举例说明。
假设事务数据库中包含A,B,C,D,E五个事务,其中目标事务为事务B和C,对于目标事务B而言,由于其前面不存在其它目标事务,则计算目标事务B与数据库中首个事务A的时间差值,为1;对于目标事务C,由于其后不存在任何其它目标事务,则计算目标事务C与数据库中末尾事务E的时间差值为2;且对于目标事务C,其前面存在目标事务B,计算该两个目标事务的时间差值为1。
步骤S310,将各所述时间差值中最大时间差值确定为所述候选模式的周期值。
仍参见上述例子进行说明,各时间差值包括1,2,1,其中最大时间差值为2,故,确定候选模式的周期值为2。
候选模式的周期值的含义为,对于包含候选模式的事务,删除其中模式效用值小于设定效用阀值的事务之后,以剩余事务的时间差的最大值作为候选模式的周期值。
参见图4,对上述步骤S200的过程进行介绍,该方法包括:
步骤S400,扫描所述数据库中的各事务,获取在各事务中效用值的和值达到设定的扩展效用阀值的项目,由获取的项目组成第1层候选模式集合HTWUSPI1
其中,扩展效用阀值大于等于所述效用阀值,其中,一种可选地设置方式,扩展效用阀值M与效用阀值Y之间的关系为:M=Y*TU*1/T,其中,TU为事务数据库中所有事务效用值的和值,T为设定的周期阀值。
步骤S410,在扫描事务数据库时记录所述项目集中各项目所在事务,以及各事务的效用值。
在执行步骤S400的同时,还可以同时记录项目集中各项目所在的事务,以及各事务的效用值。具体操作时,可以记录项目所在事务的事务编号,以及各事务编号与对应事务的效用值,其中,事务的效用值为事务所包含各项目的效用值的和值。
步骤S420,利用Apriori_gen函数以及所述HTWUSPI1,逐层产生第k层候选模式集合HTWUSPIk,直至HTWUSPIk+1为空,由HTWUSPI1至HTWUSPIk组成最终的候选模式集合。
其中,Apriori_gen函数为Apriori算法所提供的函数,根据该函数能够逐层产生候选模式集合。在产生第k层候选模式集合HTWUSPIk时,利用第k-1层候选模式集合HTWUSPIk-1中符号条件的两两候选模式进行组合产生。
接下来对上述步骤S420中,TWUSPIk的产生过程进行介绍,参见图5,该过程包括:
步骤S500,对HTWUSPIk-1中的候选模式两两组合,得到若干候选模式对。
步骤S510,在所述若干候选模式对中,选取包含k-2个相同项目的候选模式对。
如果某一候选模式对中的两个候选模式,包含k-2个相同的项目,则选取该候选模式对。
步骤S520,将选取的候选模式对进行合并,得到初步候选模式。
举例说明:假设k=4,HTWUSPI4-1中存在两个候选模式[a,b,c]、[a,b,d],这两个候选模式包含4-2个相同的项目,对两个候选模式进行合并,合并后得到初步候选模式:[a,b,c,d]。
步骤S530,针对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务。
具体地,为了确定初步候选模式所在的事务,可以根据上述步骤S410中记录的项目集中各项目所在事务,确定初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,该交集事务即为所述初步候选模式所在的事务。
步骤S540,至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk
具体地,根据步骤S410中记录的事务数据库中各事务的效用值,可以确定初步候选模式所在的各事务的效用值的和,再确定至少满足该和值达到扩展效用阀值的初步候选模式,将确定的初步候选模式加入HTWUSPIk
进一步地,本申请实施例针对TWUSPIk的产生过程提出了一种剪枝策略,能够减少周期值不满足设定周期阈值的候选模式的产生,对于融合该剪枝策略的TWUSPIk的产生过程,参见图6,该过程包括:
步骤S600,对HTWUSPIk-1中的候选模式两两组合,得到若干候选模式对。
步骤S610,在所述若干候选模式对中,选取包含k-2个相同项目的候选模式对。
如果某一候选模式对中,两个候选模式包含k-2个相同的项目,则选取该对候选模式对。
步骤S620,由选取的候选模式对进行合并,得到初步候选模式。
步骤S630,针对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务。
步骤S640,计算所述初步候选模式所在的各事务的效用值的和值。
根据步骤S410中记录的数据库中各事务的效用值,可以确定所述初步候选模式所在的各事务的效用值的和值。
步骤S650,根据所述初步候选模式所在的各事务的时间属性,确定所述初步候选模式的周期值。
针对初步候选模式所在的各事务,根据各事务的时间属性,计算相邻两事务的时间差值,并将计算得到的各时间差值中最大时间差值确定为初步候选模式的周期值。
步骤S660,在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值,且所述初步候选模式的周期值小于等于设定的周期阀值时,将所述初步候选模式加入HTWUSPIk
相比于上一实施方法,本实施例在产生HTWUSPIk时进一步增加了周期阀值的判断,筛选掉周期值未达到周期阀值的初步候选模式,从而减少了后续扫描数据库的次数,降低了模式挖掘时间。
对于获取满足设定条件的候选模式集合的过程,提出了另一种剪枝策略,能够减少效用值未达到设定的效用阀值的候选模式的产生,对于融合该剪枝策略的获取满足设定条件的候选模式集合的过程进行介绍,参见图7,该过程包括:
步骤S700,扫描所述数据库中的各事务,获取在各事务中效用值的和值达到设定的扩展效用阀值的项目,由获取的项目组成第1层候选模式集合HTWUSPI1
其中,扩展效用阀值大于效用阀值。一种可选地设置方式,扩展效用阀值M与效用阀值Y之间的关系为:M=Y*TU*1/T,其中,TU为数据库中所有事务效用值的和值,T为设定的周期阀值。
步骤S710,在扫描事务数据库时记录所述项目集中各项目所在事务,以及各事务的效用值。
步骤S720,确定事务的效用值小于所述效用阀值的低效用事务,并在记录的各项目所在事务中删除所述低效用事务。
步骤S730,利用Apriori_gen函数以及所述HTWUSPI1,逐层产生第k层候选模式集合HTWUSPIk,直至HTWUSPIk+1为空,由HTWUSPI1至HTWUSPIk组成最终的候选模式集合。
相比于图4示例的获取后续模式集合的过程,可知本实施方式中新增了删除低效用事务的过程,即对于记录的项目集中各项目所在的事务,其中不包含低效用事务,一定程度避免了效用值未达到设定效用阀值的候选模式的产生,从而减少了后续扫描数据库的次数,降低了模式挖掘时间。
为了使本申请实施例更加清楚,下面通过一个完整的实例对方案整体进行介绍。假设数据库中包含如下事务:事务1(2a,b,c,d,2f),事务2(a,c,d,3e),事务3(a,d,f,h),事务4。(c,e,g,h);用户设定的效用阀值Y,扩展效用阀值M,周期阀值T。
模式挖掘过程如下:
S1,扫描数据库,获取在各事务中效用值的和值达到M的项目,由获取的项目组成第1层候选模式集合HTWUSPI1
假设满足条件的HTWUSPI1包括[a,b,c,d]。
S2,记录项目集中各项目所在事务,以及各事务的效用值;具体记录信息可以参照下述两个表:
表1 项目集中各项目所在事务
Figure PCTCN2017102663-appb-000001
表2 各事务的效用值
事务编号 1 2 3 4
事务效用值 X1 X2 X3 X4
S3,确定事务的效用值小于所述效用阀值的低效用事务,并在记录的各项目所在事务中删除所述低效用事务。
假设事务4的效用值X4小于效用阀值Y,则对上表1进行修改,删除其中的事务4,修改后如下表3:
表3 修改后的项目集中各项目所在事务
Figure PCTCN2017102663-appb-000002
S4,生成HTWUSPI2
具体生成过程如下:
S41,对HTWUSPI1{[a]、[b]、[c]、[d]}中各候选模式两两组合,选取包含2-2个相同项目的候选模 式对进行合并,得到初步候选模式:[a,b]、[a,c]、[a,d]、[b,c]、[b,d]、[c,d]。
S42,对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务。
具体确定的各初步候选模式所在的事务如下:
[a,b]所在事务包括:事务1;
[a,c]所在事务包括:事务1、事务2;
……
[c,d]所在事务包括:事务1、事务2。
S43,在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值,且所述初步候选模式的周期值小于等于设定的周期阀值时,将所述初步候选模式加入HTWUSPI2
为了更加清楚的说明,这里仅以初步候选模式[a,c]为例进行说明:
[a,c]所在的各事务的效用值的和值为:X1+X2;[a,c]的周期值计算如下:
数据库包括事务1-4,[a,c]所在事务为事务1和事务2,因此按照本申请公开的差值计算方式得到如下若干时间差值:1-1、2-1、4-2,选取其中最大时间差值4-2=2,作为[a,c]的周期值。
判断X1+X2是否大于M,且2是否小于等于T,若是,则将[a,c]加入HTWUSPI2
S5,生成HTWUSPI3
具体生成过程可以参照HTWUSPI2的生成过程,此处不再赘述。
假设生成的HTWUSPI4为空,即不存在HTWUSPI4
将生成的HTWUSPI1-HTWUSPI3作为候选模式集合。
假设,HTWUSPI1包括:{[a]、[b]、[c]、[d]};
HTWUSPI2包括:{[a,b]、[a,c]、[a,d]};
HTWUSPI3包括:{[a,c,d]}。
S6,针对每一候选模式,计算所述候选模式在每一事务中的效用值,确定所述效用值达到Y的目标事务,并根据各所述目标事务的时间属性,确定所述候选模式的周期值。
为了更加清楚的说明,此处仅以候选模式[a,c,d]为例进行说明:
[a,c,d]在事务1中的效用值为X11,在事务2中的效用值为X21。若确定X11和X21均大于等于Y,则将事务1和事务2确定为目标事务。根据目标事务的时间属性,确定[a,c,d]的周期值的过程可以参照上文相关介绍,该周期值为2。
S7,若所述候选模式的周期值小于等于T,则将所述候选模式确定为挖掘结果。
假定[a,c,d]的周期值2小于等于T,则可以将[a,c,d]作为挖掘得到的一个结果。
上文是对本申请实施例提供的一种模式挖掘方法进行解释说明,下面为本申请实施例提供的一种模式挖掘装置进行解释说明,下文描述的模式挖掘装置与上文描述的模式挖掘方法可相互对应参照。
参见图8,图8为本申请实施例公开的一种模式挖掘装置结构示意图,如图8所示,该装置包括:
候选模式集合获取单元810,用于根据数据库中包含的各事务,获取满足设定条件的候选模式集合,其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合;
效用值计算单元820,用于针对所述候选模式集合中每一候选模式,计算所述候选模式在每一事务 中的效用值。
目标事务确定单元830,用于确定所述效用值达到设定的效用阀值的目标事务。
候选模式周期值确定单元840,用于根据各所述目标事务的时间属性,确定所述候选模式的周期值。
挖掘结果确定单元850,用于若所述候选模式的周期值小于等于设定的周期阀值,则将所述候选模式确定为挖掘结果。
可选地,候选模式周期值确定单元840可以包括:
时间差值计算单元,用于根据各目标事务的时间属性,计算相邻两目标事务的时间差值。
最大时间差值选取单元,用于将各所述时间差值中最大时间差值确定为所述候选模式的周期值。
可选地,时间差值计算单元可以包括:
第一时间差值计算子单元,用于针对数据库中顺序排序的各目标事务,若所述目标事务之前不存在任何其它目标事务,则计算所述目标事务与所述数据库中首个事务的时间差值。
第二时间差值计算子单元,用于若所述目标事务之后不存在任何其它目标事务,则计算所述数据库中末尾事务与所述目标事务的时间差值。;
第三时间差值计算子单元,用于若所述目标事务之前存在其它目标事务,则计算所述目标事务与前一相邻目标事务的时间差值。
可选地,上述候选模式集合获取单元810可以包括:
第1层后续模式集合获取单元,用于扫描所述数据库中的各事务,获取在各事务中效用值的和值达到设定的扩展效用阀值的项目,由获取的项目组成第1层候选模式集合HTWUSPI1,其中,所述扩展效用阀值大于等于所述效用阀值.
事务记录单元,用于在扫描所述数据库时记录所述项目集中各项目所在事务,以及各事务的效用值。
第k层候选模式集合产生单元,用于利用Apriori_gen函数以及所述HTWUSPI1,逐层产生第k层候选模式集合HTWUSPIk,直至HTWUSPIk+1为空,由HTWUSPI1至HTWUSPIk组成最终的候选模式集合。
其中,上述第k层候选模式集合产生单元可以包括:
候选模式两两组合单元,用于对HTWUSPIk-1中的候选模式两两组合,得到若干候选模式对。
候选模式对选取单元,用于在所述若干候选模式对中,选取包含k-2个相同项目的候选模式对。
候选模式对合并单元,用于由选取的候选模式对进行合并,得到初步候选模式。
初步候选模式所在事务确定单元,用于针对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务。
初步候选模式加入集合单元,用于至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk
其中,上述初步候选模式加入集合单元可以包括:
第一初步候选模式加入集合子单元,用于计算所述初步候选模式所在的各事务的效用值的和值。
第二初步候选模式加入集合子单元,用于根据所述初步候选模式所在的各事务的时间属性,确定所述初步候选模式的周期值。
第三初步候选模式加入集合子单元,用于在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值,且所述初步候选模式的周期值小于等于设定的周期阀值时,将所述初步候选模式加入 HTWUSPIk
可选地,所述候选模式集合获取单元810还可以包括:
低效用事务删除单元,用于在所述事务记录单元之后,确定事务的效用值小于所述效用阀值的低效用事务,并在所述事务记录单元记录的各项目所在事务中删除所述低效用事务。
本申请实施例还提供一种模式挖掘设备,该设备包括:
处理器以及存储器;
存储器,用于存储计算机程序;
处理器,用于读取所述计算机程序,以执行可执行指令,所述可执行指令用于实现上述模式挖掘方法。本申请实施例还提供一种存储介质,该存储介质用于存储程序代码,程序代码用于执行上述模式挖掘方法。
本申请实施例还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述模式挖掘方法。
以上对本申请提供的模式挖掘方法和装置进行了解释说明。
下文对本申请提供的高效用项集挖掘方法及相关设备分别进行解释说明。
首先介绍高效用项集挖掘(High Utility Itemset Mining,HUIM)技术,HUM是一种基于效用的项集挖掘技术,通过衡量项集对应的外部效用值(如利润值等)和内部效用值(如在事务中的发生次数,在交易场景下,可以是交易数量等),从而计算项集在数据库中的项集效用值,当项集的项集效用值大于或等于用户自定义的最低效用阈值时,则认为该项集是高效用项集。
一种高效用项集挖掘方法是通过设定唯一固定的最低效用阈值作为高效用项集的衡量标准实现,即,在计算出各项集的项集效用值后,将各项集的项集效用值分别与唯一固定的最低效用阈值进行比较,从而将项集效用值大于或等于该唯一固定的最低效用阈值的项集,作为高效用项集。
然而,一个项集中包括的数据项往往是一个或多个,而不同的数据项对应的最低效用阈值往往是不同的,这就导致不同项集对应的最低效用阈值也可能是不同的。本申请实施例提供了一种高效用项集挖掘的方式,解决高效用项集挖掘不准确的问题,提升挖掘出的高效用项集的准确性。
为便于理解本申请实施例描述的技术方案,下面先对本申请实施例涉及的名称概念进行介绍。
1、事务:事务数据库中的一条记录;比如,交易类型的事务数据库中记录的是商品的交易记录,则事务数据库中的每一条事务可以对应一条商品的交易记录。
2、事务编号(TID):事务数据库中不同事务的编号,一般情况下,事务按照时态顺序编号。
3、数据项:事务中记录的信息项目,一条事务中包含至少一个数据项;比如,交易类型的事务数据中,每一条事务中包含交易的商品的数据项,及各商品的内部效用值(如交易数量);交易数量是内部效用值在交易场景下的一种体现形式,在其他场景的事务数据库中,内部效用值的形式可相应的调整。
如下表4所示,交易类型的事务数据库中包含10条事务,每条事务指示一条交易记录,每条事务中包含各交易的商品名称的数据项,及各商品的在事务中的交易数量(内部效用值的一种形式)。
表4 交易类型数据库中的事务
事务编号 事务(商品名称:交易数量)
T1 A:1,C:2,D:3
T2 A:2,D:1,E:2
T3 B:3,C:5
T4 A:1,C:3,D:1,E:2
T5 B:1,D:3,E:2
T6 B:2,D:2
T7 B:3,C:2,D:1,E:1
T8 A:2,C:3
T9 C:2,D:2,E:1
T10 A:2,C:2,D:1
从表4中可以看出,在交易类型的事务数据库中,事务中的数据项可以是商品名称,内部效用值可以是事务中各商品的交易数量。在表4中,事务数据库包含A、B、C、D和E这5个数据项,其中,T1事务的实际意义可以为:一条指示购买1件A商品、2件C商品和3件D商品的交易记录;T7事务的实际意义可以为:一条指示购买3件B商品、2件C商品、1件D商品和1件E商品的购物记录。
在新闻领域,表4中的各事务可以包含至少一条新闻,各事务可以记录每一条新闻的兴趣值、敏感度大小、新鲜度大小等;在股票等领域,表4中的各事务可以包含至少一个股票,各事务可以记录每一个股票的风险大小、收益大小等。
4、项集:至少一个数据项构成的集合,用于表征事务数据库内在的一种关联规则;事务与项集的不同的点是,事务通常是由实际的事件所触发生成的在事务数据库中的记录,而项集通常是从数据库挖掘而出的,并不一定有实际的含义。
5、k-项集:包含有k个数据项的集合;比如,1-项集可以是包含一个数据项的项集,如仅包含数据项A的项集A;2-项集可以是包含两个数据项的项集,如仅包含数据项A和B项集AB,以此类推。
6、外部效用值表(如利润表,Profit Table):记录事务数据库中各数据项对应的单位外部效用值的表格;在交易类型的事务数据库中,利润表可以是外部效用值表的一种体现形式,也就是说,外部效用值表可以记录事务数据库中各数据项的单位利润值;可参见表5所示的一张具体的利润表:
表5 利润表
数据项 A B C D E
单位利润值 6 12 1 9 3
从表5可以看出,利润表表示的是卖出一件商品可以获得的单位利润,比如卖出一件商品A,可以获得利润6元;卖出一件商品B,可以获得利润12元;可见,外部效用值表可以表示每个数据项对应的单位外部效用值。
7、数据项在事务中的效用值(Utility of an item in a transaction):一个数据项在一条事务中的效用值,可以是某一数据项在一事务中的内部效用值乘以该数据项的单位外部效用值。比如在交易类型的事务数据库中,某一数据项在一事务中的效用值可以是,该数据项在该事务中的交易数量乘以该数据项的单位利润值;以表4和表5所示,数据项B在T3事务中的效用值可以是3×12=36。
8、项集在事务中的效用值(Utility of an itemset in a transaction):某一项集中的各数据项在某一事务中的效用值的加和。以表4和表5所示,项集BC(仅包含数据项B和C的项集)在T3事务中的效用值为3×12+5×1=41。
9、项集效用值(Itemset utility in Database):某一项集在事务数据库中的效用值,即某一项集在包含该项集的所有数据项的各事务中的效用值的加和。
10、最低效用阈值表(Minimum Utility threshold,MMU表):本申请实施例定义的,指示有各数据项对应的最低效用阈值的表格;可参照表6所示的一种MMU表的形式最低效用阈值表中定义的各数据项的最低效用阈值并不是固定的,而是可以由用户根据各数据项的实际情况设定,如可根据商品的价格波动情况,更新各商品的最低效用阈值。
表6 最低效用阈值表
数据项 A B C D E
最低效用阈值 56 65 53 50 70
11、项集最低效用阈值(minimum utility threshold of an itemset,MIU),在本申请实施例中,由于不同数据项对应的最低效用阈值可能不同(如表6所示),这导致不同项集对应的最低效用阈值也可能是不同的;因此为解决现有技术为不同的项集设置固定唯一的最低效用阈值所带的准确性较低的问题,本申请实施例针对各项集,可根据项集中包含的数据项,为项集匹配适应的项集最低效用阈值。;
针对各项集,本申请实施例可确定项集中最低效用阈值最小的数据项,将所确定的数据项的最低效用阈值作为该项集的项集最低效用阈值,从而得到各项集对应的项集最低效用阈值,为后续准确性较高的高效用项集的挖掘提供基础。
以项集AB的项集最低效用阈值确定为例,项集AB中包含数据项A和数据项B,从表6设置的MMU表可以看出,数据项A的最低效用阈值最小,因此可将数据项A的最低效用阈值作为项集AB的项集最低效用阈值,即项集AB的项集最低效用阈值为56;又如项集BC的项集最低效用阈值为,数据项C的最低效用阈值53。
12、事务的效用值(Transaction Utility):某一事务的效用值为,组成该事务的各个数据项在该事务中的效用值的加和;以表4所示,事务T5中包含数据项B、D和E,本申请实施例可确定事务T5的效用值为1×12+3×9+2×3=45。
13、数据库的总效用值:数据库中各事务的效用值的加和;以表4所示,数据库的总效用值为T1至T10的各事务的效用值的加和为:35+27+41+24+45+42+50+15+23+23=325。
14、最小最低效用阈值(Least Minimum Utility value,LMU):MMU表中最小的最低效用阈值,以表6中的内容为例,最小最低效用阈值为数据项D的最低效用阈值50。
15、高效用项集(High Utility Itemset,HUI):当项集的项集效用值≥该项集的项集最低效用阈值,则该项集为高效用项集;比如项集A的项集效用值为48,小于项集A的项集最低效用阈值56,则项集A不是高效用项集,又如,项集AD的项集效用值为90,大于项集AD的项集最低效用阈值50,则项集AD为高效用项集。
16、项集的事务加权效用(Transaction Weighted Utility,TWU):包含指定项集的事务的效用值之和;以表4和表5所示为例,当指定项集为B时(仅包含数据项B的项集),则包含项集B的事务为T3,T5,T6和T7,相应的T3,T5,T6和T7事务的效用值的加和为41+45+42+50=178,则项集B的事务加权效用为178。
17、高事务加权效用项集(High Transaction Weighted Utilization Itemset,HTWUI):当项集的TWU≥该项集的项集最低效用阈值时,则该项集为高事务加权效用项集;比如,项集B的事务加权效用为178,而项集B的最低效用阈值为65,项集B的事务加权效用大于最低效用阈值,确定项集B为高事务加权效用项集。
下面介绍本申请实施例提供的高效用项集挖掘方法,该方法可应用于具有数据处理能力的数据处理 设备,如应用于网络侧的数据处理服务器。根据数据挖掘场景的不同,高效用项集的挖掘也可能是在用户侧的计算机等设备上进行。如图9所示,该方法包括:
步骤S900,确定事务数据库中各项集对应的项集效用值。
可选地,一个项集对应的项集效用值表示:该项集在该项集对应的各目标事务中的效用值的加和,一个项集的目标事务为包含该项集所有数据项的事务。一个项集在目标事务中的效用值表示:该项集的各数据项在目标事务中的效用值的加和。
可选地,事务数据库中可以包括至少一条事务,一条事务可以记录有至少一个数据项及各数据项对应的内部效用值,一个项集可以包括至少一个数据项;
可选地,一个数据项在一个事务中的效用值表示:该数据项在该事务中的内部效用值及该数据项对应的单位外部效用值的乘积,各数据项对应的外部效用值可根据预定义的外部效用值表确定,外部效用值表记录有各数据项对应的单位外部效用值。
比如,在交易类型的数据库中,本申请实施例可预先设定利润值表(利润值表为外部效用值表的一种形式),通过利润值表记录各商品的单位利润值(商品为数据项的一种形式,单位利润值为单位外部效用值的一种形式),也就是说,一个商品在一个交易事务中的效用值是该商品在该交易事务中的交易数量(交易数量为内部效用值的一种形式)与该商品的单位利润值的乘积。
步骤S910,根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值。
上述预定义的最低效用阈值表,记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示:该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值。
步骤S920,将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
本申请实施例定义了记录有各数据项对应的最低效用阈值的最低效用阈值表,在确定每一个项集所对应的项集最低效用阈值时,通过比较项集包含的数据项所对应的最低效用阈值,从而将项集所包含的数据项对应的最低效用阈值中的最小最低效用阈值,作为项集所对应的项集最低效用阈值,使得所确定的各项集对应的项集最低效用阈值更为贴近项集的最低效用情况;基于所确定的各项集的项集最低效用阈值,将各项集的项集效用值与对应的项集最低效用阈值进行比对,从而确定出项集效用值不小于对应的项集最低效用阈值的高效用项集,实现高效用项集的挖掘。
本申请实施例提供的高效用项集挖掘方法并不是以唯一固定的最低效用阈值,作为高效用项集的挖掘标准,而是将每个项集所包含的数据项对应的最小最低效用阈值,作为每个项集的项集最低效用阈值,使得所确定的各项集对应的项集最低效用阈值更为贴近项集的最低效用情况,进而将各项集的项集效用值与项集对应的项集最低效用阈值进行比对,来实现高效用项集的挖掘,将使得挖掘结果更为准确;本申请实施例提高了高效用项集挖掘的准确性。
以表4、5和6所示为基础,下表7示出了项集效用值不小于项集最低效用阈值的高效用项集的示意图,如下表7所示:
表7 项集效用值不小于项集最低效用阈值的高效用项集
项集 项集最低效用阈值 项集效用值
(B) 65 108
(D) 50 126
(AD) 50 90
(BC) 53 79
(BD) 50 126
(CD) 50 83
(DE) 50 96
(ACD) 50 76
(BDE) 50 93
(CDE) 50 55
(BCDE) 50 50
可选地,本申请实施例提供的确定事务数据库中各项集对应的项集效用值的方式可以是:对于各项集,先确定事务数据库中包含该项集的所有数据项的至少一目标事务,并确定该项集的所有数据项在所确定的各目标事务中的效用值,并将所确定各效用值相加和,得到该项集的项集效用值。
以表4和表5所示,项集B(仅包含数据项B的项集)的项集效用值为3×12+1×12+2×12+3×12=108,在本申请实施例中,可确定包含数据项B的事务T3,T5,T6和T7,从而确定项集B在事务T3中的效用值3×12,确定项集B在事务T5种的效用值1×12,确定项集B在事务T6中的效用值2×12,确定项集B在事务T7中的效用值3×12,将所确定的各效用值加和,得到108的项集效用值。
项集BC(仅包含数据项B和C的项集)的项集效用值为(3×12+5×1)+(3×12+2×1)=79,在本申请实施例中,可确定包含数据项B和C的事务T3和T7,确定项集BC在事务T3中的效用值3×12+5×1,确定项集BC在事务T7中的效用值3×12+2×1,从而将所确定的各效用值加和,得到79的项集效用值。
可选地,参见图10,本申请实施例提供的确定事务数据库中各项集对应的项集效用值的过程,可以包括:
步骤S1000,根据各数据项在各事务对应的外部效用值,和预定义的最低效用阈值表中记录的各数据项的内部效用值,以递归方式构建各项集对应的效用列表。
一个项集对应的效用列表表示:该项集在数据库中出现的事务(即该项集的目标事务)中的一系列元组信息。其中,一个项集对应的效用列表可以记录有该项集对应的各目标事务的事务编号,该项集在各目标事务对应的效用值,及该项集在各目标事务中的剩余效用值。一个项集在一个事务中的剩余效用值表示:一个事务中的数据项以最低效用阈值从小到大排序,并在该事务中除去该项集所包含的数据项后,排序在该事务右边的数据项的效用值的总和。
步骤S1010,根据各项集对应的效用列表,计算出各项集的项集效用值。
在构建出各项集对应的效用列表后,可根据各项集对应的效用列表,计算出各项集的项集效用值。由于各项集对应的效用列表记录有各项集在各目标事务对应的效用值,可将各项集在各目标事务对应的效用值的加和,作为各项集的项集效用值。
在图10所示方法中,确定各项集对应的项集效用值时,如何以递归方式构建各项集对应的效用列表是一个关键点。本申请实施例,可以分层级以递归方式构建各项集对应的效用列表,一个项集所处于的层级序数与该项集所包含的数据项的数量相对应,即第一层级的各项集仅包含一个数据项,第二层级的各项集仅包含两个数据项,以此类推;且下一层级的项集对应的效用列表,可通过至少两个能够组合成该项集的高层级项集的效用列表构建。
可选地,在通过上述分层级以递归方式构建各项集对应的效用列表时,可先构建枚举的最低效用阈 值树(MIU树),枚举的MIU树可以认为是常规枚举树的扩展版,MIU树包含有分层级的项集,一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序。
可选地,构建出MIU树后,可基于各数据项在各事务对应的外部效用值和各数据项的内部效用值,构建出与MIU树相结合的各项集对应的效用列表,且下一层级的项集对应的效用列表,可通过至少两个能够组合成该项集的高层级项集的效用列表构建。
在构建MIU树时,可先确定事务数据库中包含一个数据项的各项集,并将所确定的各项集排序在MIU树的第一层级,构建出位于MIU树第一层级的项集;然后以深度优先搜索的方式,依序从MIU树第一层级的各项集出发,构建出分层级的项集,并使得一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,从而形成MIU树。
可选地,在MIU树的一个层级中,项集可以随机的排序,也可以按照最低效用阈值从小到大的顺序排序。
可选地,参见图11,示出的构建MIU树的方法,该方法可以包括:
步骤S1100,确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集。
在构建MIU树时,可以先确定事务数据库中包含一个数据项的各项集,即各1-项集;并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集。
步骤S1110,以深度优先搜索的方式,依序从MIU树第一层级的各项集出发,构建出分层级的项集,并使得一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序,形成MIU树。
在构建出位于MIU树第一层级的项集后,可以以深度优先搜索的方式,构建出MIU树的分层级的项集。
图12给出了相应的MIU树结构,结合图12所示,可以先确定事务数据库中包含一个数据项的各项集A、B、C、D和E,结合表6所示,项集A、B、C、D和E的最低效用阈值从小到大的排序为项集D、C、A、B和E,从而将D、C、A、B和E依序排序在MIU树的第一层级;在构建出MIU树的第一层级的项集后,可从项集D出发,构建出第二层级中项集D对应的项集DC、DA、DB和DE,并将项集DC、DA、DB和DE按照最低效用阈值从小到大的排序,然后构建第三层级中项集DC对应的项集DCA、DCB、DCE、DAB和DAE并排序;再构建DCA在下一层级中对应的项集DCABE,然后回到项集DA构建其对应的下一层级的项集,以此类推,进而依序从MIU树第一层级的各项集出发,构建出分层级的项集。
可选地,在构建出MIU树时,确定各项集在各层级的排序后,对于第一个层级中的各项集,可依序计算各项集在对应的各目标事务中的效用值,并且确定各项集在对应的各目标事务中的剩余效用值。比如,在本申请实施例中,可先记录第一个层级中的第一个项集所对应的各目标事务的事务编号,该项集在对应的各目标事务中的效用值,及该项集在对应的各目标事务中的剩余效用值,并通过表格进行记录,依此对第一个层级中的各项集进行处理,则可得到第一个层级中各项集对应的效用列表。结合表4、5和6,图13给出了MIU树中第一个层级的各项集对应的效用列表。其中,tid表示事务编号,iu表示 效用值,ru表示剩余效用值。
在确定第一个层级的各项集对应的效用列表后,MIU树中下一层级的项集的效用列表可通过至少两个能够组合成该项集的高层级项集的效用列表构建。
可选地,此处的至少两个能够组合成该项集的高层级项集,可以是高层级中能够直接组合成该项集的至少两个项集,也可以是高层级中至少两个组合后,通过去除重复数据项能够组合成该项集的项集。
如第二层级中的项集DC的效用列表可通过第一层级中项集D和C的效用列表组合形成,具体组合如图14所示。其中,2-项集DC的对应的目标事务的事务编号,为1-项集D和C的共同出现的目标事务的事务编号,即{T1,T4,T7,T9,T10};在T1中,项集DC的效用值等于为项集D和项集C在事务T1中的效用值的加和,即27+2=29,项集DC在其他目标事务中的效用值的处理类似;而项集DC在各目标事务中的剩余效用值,可以直接以项集D和C中排序在后的项集C在各目标事务的剩余效用值确定。
在构建出第一层级中各项集的效用列表后,在构建第二层级的各项集的效用列表时,对于第二层级的各项集,可以确定第一层级中能够组合成该第二层级的项集的两个项集,将该两个项集共同对应的目标事务,作为该第二层级的项集所对应的目标事务,将该两个项集在一共同对应的目标事务中的效用值的加和,作为该第二层级的项集在该目标事务中的效用值;并将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值,从而得到该第二层级的项集的效用列表。
又如第三层级中的项集DCA的效用列表可通过第二层级中的项集DC和DA的效用列表组合形成,具体组合如图15所示。当需要确定效用列表的项集所在的层级不小于三时,项集的效用列表的构建过程可以与第二层级的项集的效用列表的构建过程存在差异,不同之处在于:层级不小于三的项集在各目标事务中的效用值,除需要将上一层级中能够组合成该项集的两个项集在目标事务中的效用值相加外,还需要再减去该项集的前缀数据项在该目标事务中的效用值。比如,图15中项集DCA在T10事务中的效用值为,DC在T10事务中的效用值11,加上DA在T10事务中的效用值21,再减去项集DCA的前缀数据项D在T1O中的效用值9,即11+21-9=23;相应的,层级不小于三的项集在各目标事务中的剩余效用值为,上一层级中能够组合成该项集的两个项集中排序在后的项集在各目标事务中的剩余效用值。
在构建层级不小于三的项集的效用列表时,对于层级不小于三的各项集,可以确定上一层级中能够组合成该项集的两个项集,将该两个项集共同对应的目标事务,作为该层级不小于三的项集所对应的目标事务;并将该两个项集在一共同对应的目标事务中的效用值的加和,减去该层级不小于三的项集的前缀数据项在该目标事务中的效用值,将得到结果作为该层级不小于三的项集在该目标事务中的效用值;并将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值,从而得到该层级不小于三的项集的效用列表。
相应的,上述构建各层级的项集所对应的效用列表的伪代码可以如下,具体算法过程可以如下文代码中的Line 5(k≥3的情况,即层级≥3的情况)和Line 7(k=1或2的情况,即层级=1或2的情况):
Input:X,an itemset;X.UL is the utility-list of X;Xab.UL,Xa.UL,Xb.UL,
Figure PCTCN2017102663-appb-000003
and
Figure PCTCN2017102663-appb-000004
Xa≠Xb.//输入:项集X;X对应的效用列表;Xab对应的效用列表;Xa对应的效用列表;Xb对应的效用列表,Xa、Xb均是X的子集,且Xa≠Xb
Output:Xab.UL.//输出:Xab的效用列表
1:set Xab.UL←null.//设Xab对应的效用列表为空
2:for each element Ea∈Xa do
3:
Figure PCTCN2017102663-appb-000005
4:search E∈X.UL∧E.TID:=Ea.TID.
5:Eab←<Ea.TID,Ea.iu+Eb.iu-E.iu,Eb.ru>.//针对Xa和Xb中拥有相同的数据项且Xa对应的效用列表与Xb对应的效用列表存在相同事务,构建K-项集Xab(k≥3)的效用列表
6:else
7:Eab←<Ea.TID,Ea.iu+Eb.iu,Eb.ru>.//针对根据1-项集对应的第1-效用列表生成2-项集对应的第2-效用列表的过程,第2-效用列表中TID为数据项Ea对应的TID,iu为数据项Ea与数据项Eb效用值之和,ru为数据项Eb对应的剩余效用值ru
8:end if
9:Xab.UL←Eab.
10:end for
11:return Xab.UL.//输出Xab的效用列表。
可选地,在构建层级不小于三的项集的效用列表时,对于层级不小于三的各项集,可以确定高层级中可直接组合成该项集的至少两个项集,将该至少两个项集共同对应的目标事务,作为该层级不小于三的项集所对应的目标事务;将该至少两个项集在一共同对应的目标事务中的效用值的加和,作为该层级不小于三的项集在该目标事务中的效用值;并将该至少两个项集中最高层级的项集中排序在后的项集,在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值,从而得到该层级不小于三的项集的效用列表。
比如,项集DCA可以由项集DC合项集A组合而成,项集DCA在各目标事务中的效用值,可以是项集DC和A在该目标事务中的效用值的加和,且项集DCA在各目标事务中的剩余效用值,可以是项集DC和A中最高层级,且排序在后的项集A在各目标事务中的剩余效用值;具体如图16所示。
可选地,在构建出各项集对应的效用列表后,可以在挖掘高效用项集的过程中,基于各项集对应的效用列表,计算各项集的项集效用值;并根据预定义的最低效用阈值表,将各项集所包含的数据项所对应的最小项集最低效用阈值,作为各项集对应的项集最低效用阈值;从而将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果,挖掘出高效用项集。
本申请挖掘高效用项集的算法伪代码可以参照如下算法1和算法2所示。
Figure PCTCN2017102663-appb-000006
Figure PCTCN2017102663-appb-000007
Figure PCTCN2017102663-appb-000008
Figure PCTCN2017102663-appb-000009
在上述算法1中,Line 1为初始化几个变量,Line 2为由MMU表计算出LMU,然后扫描原始数据库计算各个1-项集的TWU值(Line 3),依据MMU表中设定的各个1-项集的最低效用阈值找出高事务加权效用1-项集的集合HTWUI1(Line 4,这里属于应用全局向下封闭特性(Global downward closure property,GDC property));Line 5是对找出的HTWUI1依据它们的最低效用阈值进行从小到大的排序。
Line 6是根据高事务加权效用1-项集生成第1-效用列表;然后调用挖掘函数HUI-Search,递归地根据第1-效用列表生成一系列的后续的效用列表(Line 7),并从生成的效用列表中挖掘高效用项集。
可见,本申请实施例在确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集时,具体可通过如下方式实现:计算包含一个数据项的各项集的项集的事务加权效用值(TWU值),根据包含一个数据项的各项集的最低效用阈值,确定包含一个数据项的各项集中的高事务加权效用项集,将高事务加权效用项集按照最低效用阈值从小到大进行排序;然后再递归地根据包含一个数据项的各项集的效用列表生成一系列的后续的效用列表,形成各项集对应的效用列表。
函数HUI-Search的伪代码如算法2所示,其中Line 5应用了有条件的向下封闭特性(Conditional downward closure property,CDC property)进行提前剪枝操作,Line 8则应用了全局向下封闭特性(Global downward closure property,GDC property)进行剪枝操作。
剪枝策略1:当采用深度优先搜索方式遍历MIU树时,依据效用列表,如果某项集X的TWU值小于LMU值时,则X的所有超集都不会是高效用项集;项集的超集就是指包含该项集的所有数据项的集合,如项集A,它的项集就是前面的MIU树图中所有含有A的树节点,而不仅仅是项集A的所有子节点。
剪枝策略2:当采用深度优先搜索方式遍历MIU树时,依据效用列表,如果某项集X的效用值和剩 余效用值的加和,小于该项集X的项集最低效用阈值时,则项集X的所有扩展节点(即其后代节点)都不会是高效用项集,因为它们的实际效用值都会小于MIU(X)值。
在本申请实施例中,还可通过对生成的各项集的效用列表中,没有前途的项集的效用列表进行过滤,并根据剩余的有前途的项集的效用列表生成对应扩展集的效用列表,使得数据挖掘过程中只需要扫描一次数据库,并生成第一层级的各项集的效用列表,并在需要时,根据该第一层级的各项集的效用列表生成后续其他项集的效用列表,不仅减少了扫描数据库的次数,而且通过缩小所要挖掘数据的范围,提高了挖掘的速度,节约了计算资源。
本申请实施例还提出了两个特性,全局向下封闭特性(Global downward closure property,GDCproperty)和有条件的向下封闭特性(Conditional downward closure property,CDC property)。根据效用列表检测对应的项集是否有前途,并对没有前途的项集进行过滤,从而减少了后续生成的效用列表数量,达到了节约计算资源,提高挖掘速度的效果。
可选地,本申请实施例中,在通过深度优先搜索构建各项集对应的效用列表的过程中,还可以利用EUCP(Estimated Utility Co-occurrence Pruning)技术,通过在第二次扫描事务数据库时构建的估计效用共现结构表(EUCS表)提高处理效率;EUCS表中包含所述k-项集与所述k-项集对应的事务加权效用上限,k≥2,即EUCS表中可以包含不小于第二层级的各层级的项集与项集对应的事务加权效用上限;所述事务加权效用上限是指包含所述k-项集的所述事务对应的所述事务效用上限之和,所述事务效用上限指所述事务中所述数据项的效用之和。
例如,由示例数据库构建而得的EUCS如下表8所示,其中项集BE的TWU(项集的事务加权效用)值的计算方法是,项集在事务T5中的效用值,与项集在事务T7中的效用值之和,即95。
表8 EUCS表
数据项 A B C D E
B 0 - - - -
C 97 91 - - -
D 109 135 155 - -
E 51 95 97 169 -
根据所述EUCS表,对所述事务加权效用上限<所述最低效用阈值的所述k-项集及其超集进行过滤,则可直接忽略其拓展项集的产生与判断,从而大大加速挖掘的性能,同时又保证了挖掘结果的完整性和准确性。一个项集的超集是指该项集的所有数据项的集合。如项集A,它的项集就是MIU树中所有含有项集A的树节点,而不仅仅是项集A的所有子节点。
基于上述特性,本申请实施例还可以得出如下推论:如果一个项集是HTWUI(高事务加权效用项集),那么该项集的任一子项集(子项集包含该项集的所有数据项)也是HTWUI;如果一个项集不是HTWUI,那么该项集的任一超集均不是HTWUI。
因此,本申请实施例在得到包含一个数据项的项集后(即得到1-项集),可按照最低效用阈值的大小进行升序排序,得到排序后的1-项集。比如,根据表4得到1-项集候选项集中包括数据项A、B、C、D和E的最低效用阈值后,可按照升序得到排序后的1-项集D、C、A、B、E。
然后根据排序后的1-项集,自连接生成2-项集,显然,该2-项集中的数据项按照MMU表中1-项集的最低效用阈值的大小进行升序排序。其中,根据1-项集自连接生成2-项集的过程,即指定数据项与排在该指定数据项右边的数据项进行组合;比如,排序后的1-项集为D、C、A、B、E,对于项集D 它的后续扩展,由自连接生成的2-项集则为DC、DA、DB、DE。
计算自连接生成的各个2-项集的TWU,并针对各2-项集,检测项集的效用值和的剩余效用值之和是否不小于≥项集的最低效用阈值,若是,则继续进行深度搜索,若否,则确定该2-项集及其超集均不是HTWUI,并对该2-项集进行过滤;同理,其他K-项集(k≥3)同样的处理,最后RUP算法返回最终的近期有效的高效用项集的完整集合。
需要说明的是,本文中项集的扩展集合均是指该项集与其排序后右边的各个项集自连接组合后生成的项集,而超集是传统意义上的包含该项集的所有数据项的集合。
在本申请实施例中,确定出高效用项集后,可以在对用户进行内容推荐时,推荐高效用项集。
本申请实施例提供的技术方案在处理日常应用中常见的交易型等事务数据库,通过引入MMU表,根据该MMU表确定各个项集各自对应的MIU,并将项集的项集效用值与对应的MIU进行比较,从而确定该项集是否为HUI;解决了现有的基于HUIM的算法中,都是将项集的项集效用值是否大于唯一的最低效用阈值作为衡量标准,导致挖掘出的HUI不准确的问题;达到了根据不同项集制定不同的HUI衡量标准,从而使挖掘出的HUI更准确、更可信、更有意义。
本申请实施例还提供了高效用项集挖掘装置,下文描述的高效用项集挖掘装置可与上文描述的高效用项集挖掘方法相互对应参照。
该装置的结构示意图如图17所示,该装置包括:
项集效用值确定模块1700,用于确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和,一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和。
项集最低效用阈值确定模块1710,用于根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值。
高效用项集确定模块1720,用于将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
可选地,图18示出了项集效用值确定模块1700的可选结构,如图18所示,项集效用值确定模块1700可以包括:
效用列表构建单元1800,用于根据各数据项在各事务对应的外部效用值,和预定义的最低效用阈值表中记录的各数据项的内部效用值,以递归方式构建各项集对应的效用列表;其中,一个项集对应的效用列表记录有该项集对应的各目标事务的事务编号,该项集在各目标事务对应的效用值,及该项集在各目标事务中的剩余效用值;一个项集在一个事务中的剩余效用值表示的是,一个事务中的数据项以最低效用阈值从小到大排序,并在该事务中除去该项集所包含的数据项后,排序在该事务右边的数据项的效用值的总和。
项集效用值计算单元1810,用于根据各项集对应的效用列表,计算出各项集的项集效用值。
可选地,效用列表构建单元1800以递归方式构建各项集对应的效用列表时,具体可用于,分层级以递归方式构建各项集对应的效用列表,一个项集所处于的层级序数与该项集所包含的数据项的数量相对应;且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构 建。
可选地,图19示出了效用列表构建单元1800的可选结构,如图19所示,效用列表构建单元1800可以包括:
MIU树构建子单元1900,用于构建枚举的最低效用阈值MIU树,所述MIU树包含有分层级的项集,一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序。
效用列表构建执行子单元1910,用于基于各数据项在各事务对应的外部效用值,和各数据项的内部效用值,构建出与MIU树相结合的各项集对应的效用列表,且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建。
其中,MIU树构建子单元1900具体可用于,确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集;以深度优先搜索的方式,依序从MIU树第一层级的各项集出发,构建出分层级的项集,并使得一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序,形成MIU树。
可选地,效用列表构建执行子单元1910具体可以用于:在构建第二层级的各项集的效用列表时,对于第二层级的各项集,确定第一层级中能够组合成该第二层级的项集的两个项集;将该两个项集共同对应的目标事务,作为该第二层级的项集所对应的目标事务;将该两个项集在一共同对应的目标事务中的效用值的加和,作为该第二层级的项集在该目标事务中的效用值;将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
可选地,效用列表构建执行子单元1910具体还可以用于:
在构建层级不小于三的项集的效用列表时,对于层级不小于三的各项集,确定上一层级中能够组合成该项集的两个项集;将该两个项集共同对应的目标事务,作为该层级不小于三的项集所对应的目标事务;将该两个项集在一共同对应的目标事务中的效用值的加和,减去该层级不小于三的项集的前缀数据项在该目标事务中的效用值,将得到结果作为该层级不小于三的项集在该目标事务中的效用值;将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
可选地,MIU树构建子单元1900在确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集时具体可用于:
计算包含一个数据项的各项集的项集的事务加权效用值,根据包含一个数据项的各项集的最低效用阈值,确定包含一个数据项的各项集中的高事务加权效用项集;将高事务加权效用项集按照最低效用阈值从小到大进行排序。
相应的,效用列表构建执行子单元1910在构建出与MIU树相结合的各项集对应的效用列表时具体可用于:构建出包含一个数据项的各项集的效用列表,递归地根据包含一个数据项的各项集的效用列表生成一系列的后续的效用列表,形成各项集对应的效用列表。
在本申请实施例中,高效用项集挖掘装置还可用于:在以深度优先搜索方式遍历MIU树时,如果一项集的事务加权效用值小于,该项集的最小最低效用阈值时,确定该项集的所有超集均不是高效用项 集;和/或,在以深度优先搜索方式遍历MIU树时,如果一项集的效用值和剩余效用值的加和,小于该项集的项集最低效用阈值,则确定该项集在MIU树中的所有扩展节点均不是高效用项集。
可选地,在本申请实施例中,高效用项集挖掘装置还可用于:获取EUCS表,所述EUCS表包含不小于第二层级的各层级的项集与项集对应的事务加权效用上限;根据所述EUCS表,对事务加权效用上限小于最低效用阈值的不小于第二层级的项集及其超集进行过滤。
可选地,在本申请实施例中,高效用项集挖掘装置还可用于:如果一个项集是高事务加权效用项集,则确定该项集的任一子项集也是高事务加权效用项集,子项集包含该项集的所有数据项;如果一个项集不是高事务加权效用项集,则确定该项集的任一超集均不是高事务加权效用项集。
本申请实施例还提供一种数据处理设备,该数据处理设备可以包括上述所述的高效用项集挖掘装置。具体的,该数据处理设备的硬件结构框图,如图20所示,该数据处理设备可以包括:处理器1,通信接口2,存储器3和通信总线4。
其中,处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信。
可选地,通信接口2可以为通信模块的接口,如GSM模块的接口。
处理器1,用于执行程序。
存储器3,用于存放程序。
程序可以包括程序代码,所述程序代码包括计算机操作指令。
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
其中,程序可具体用于:确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和,一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和;
根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值。
将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
本申请实施例还提供一种高效用项集挖掘设备,该设备包括:
处理器以及存储器;
存储器,用于存储计算机程序;
处理器,用于读取所述计算机程序以执行可执行指令,该可执行指令用于实现上述高效用项集挖掘方法。本申请实施例还提供一种存储介质,该存储介质用于存储程序代码,程序代码用于执行上述高效用项集挖掘方法。
本申请实施例还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述高效用项集挖掘方法。
还需要说明的是,在本文中,“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (36)

  1. 一种模式挖掘方法,应用于服务器,包括:
    根据事务数据库中包含的各事务,获取满足设定条件的候选模式集合,其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合;
    针对所述候选模式集合中每一候选模式,计算所述候选模式在每一事务中的效用值;
    确定所述效用值达到设定的效用阀值的目标事务,并根据各所述目标事务的时间属性,确定所述候选模式的周期值;
    若所述候选模式的周期值小于等于设定的周期阀值,则将所述候选模式确定为挖掘结果。
  2. 根据权利要求1所述的模式挖掘方法,所述根据各所述目标事务的时间属性,确定所述候选模式的周期值,包括:
    根据各所述目标事务的时间属性,计算相邻两个目标事务的时间差值;
    将各所述时间差值中最大时间差值确定为所述候选模式的周期值。
  3. 根据权利要求2所述的模式挖掘方法,所述根据各所述目标事务的时间属性,计算相邻两个目标事务的时间差值,包括:
    针对事务数据库中顺序排序的各目标事务,若所述目标事务之前不存在任何其它目标事务,则计算所述目标事务与所述事务数据库中首个事务的时间差值;
    若所述目标事务之后不存在任何其它目标事务,则计算所述事务数据库中末尾事务与所述目标事务的时间差值;
    若所述目标事务之前存在其它目标事务,则计算所述目标事务与前一相邻目标事务的时间差值。
  4. 根据权利要求1所述的模式挖掘方法,所述根据事务数据库中包含的各事务,获取满足设定条件的候选模式集合,包括:
    扫描所述事务数据库中的各事务,获取在各事务中效用值的和值达到设定的扩展效用阀值的项目,由获取的项目组成第1层候选模式集合HTWUSPI1,其中,所述扩展效用阀值大于等于所述效用阀值;同时,记录所述项目集中各项目所在事务,以及各事务的效用值;
    利用Apriori_gen函数以及所述HTWUSPI1,逐层产生第k层候选模式集合HTWUSPIk,直至HTWUSPIk+1为空,由HTWUSPI1至HTWUSPIk组成最终的候选模式集合,其中,HTWUSPIk的产生过程包括:
    对HTWUSPIk-1中的候选模式两两组合,得到若干候选模式对;
    在所述若干候选模式对中,选取包含k-2个相同项目的候选模式对;
    由选取的候选模式对进行合并,得到初步候选模式;
    针对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务;
    至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk
  5. 根据权利要求4所述的模式挖掘方法,所述至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk,包括:
    计算所述初步候选模式所在的各事务的效用值的和值;
    根据所述初步候选模式所在的各事务的时间属性,确定所述初步候选模式的周期值;
    在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值,且所述初步候选模式的周期值小于等于设定的周期阀值时,将所述初步候选模式加入HTWUSPIk
  6. 根据权利要求4所述的模式挖掘方法,在所述记录所述项目集中各项目所在事务,以及各事务的效用值之后,还包括:
    确定事务的效用值小于所述效用阀值的低效用事务,并在记录的各项目所在事务中删除所述低效用事务。
  7. 一种模式挖掘装置,包括:
    处理器和存储器;
    所述存储器用于存储计算机程序;
    所述处理器用于读取所述计算机程序以执行以下可执行指令:
    根据事务数据库中包含的各事务,获取满足设定条件的候选模式集合,其中,每个事务包括至少一个项目;所述候选模式集合中的每个候选模式包括至少一个项目集中的项目;所述项目集是根据每个事务中的项目生成的集合;
    针对所述候选模式集合中每一候选模式,计算所述候选模式在每一事务中的效用值;
    确定所述效用值达到设定的效用阀值的目标事务,并根据各所述目标事务的时间属性,确定所述候选模式的周期值;
    若所述候选模式的周期值小于等于设定的周期阀值,则将所述候选模式确定为挖掘结果。
  8. 根据权利要求7所述的模式挖掘装置,所述处理器执行所述根据各所述目标事务的时间属性,确定所述候选模式的周期值的可执行指令,包括:
    根据各所述目标事务的时间属性,计算相邻两个目标事务的时间差值;
    将各所述时间差值中最大时间差值确定为所述候选模式的周期值。
  9. 根据权利要求8所述的模式挖掘装置,所述处理器执行所述根据各所述目标事务的时间属性,计算相邻两个目标事务的时间差值的可执行指令,包括:
    针对事务数据库中顺序排序的各目标事务,若所述目标事务之前不存在任何其它目标事务,则计算所述目标事务与所述事务数据库中首个事务的时间差值;
    若所述目标事务之后不存在任何其它目标事务,则计算所述事务数据库中末尾事务与所述目标事务的时间差值;
    若所述目标事务之前存在其它目标事务,则计算所述目标事务与前一相邻目标事务的时间差值。
  10. 根据权利要求7所述的模式挖掘装置,所述处理器执行所述根据事务数据库中包含的各事务,获取满足设定条件的候选模式集合的可执行指令,包括:
    扫描所述事务数据库中的各事务,获取在各事务中效用值的和值达到设定的扩展效用阀值的项目,由获取的项目组成第1层候选模式集合HTWUSPI1,其中,所述扩展效用阀值大于等于所述效用阀值;同时,记录所述项目集中各项目所在事务,以及各事务的效用值;
    利用Apriori_gen函数以及所述HTWUSPI1,逐层产生第k层候选模式集合HTWUSPIk,直至HTWUSPIk+1为空,由HTWUSPI1至HTWUSPIk组成最终的候选模式集合,其中,HTWUSPIk的产生过程包括:
    对HTWUSPIk-1中的候选模式两两组合,得到若干候选模式对;
    在所述若干候选模式对中,选取包含k-2个相同项目的候选模式对;
    由选取的候选模式对进行合并,得到初步候选模式;
    针对每一初步候选模式,确定所述初步候选模式所包含的每一项目所在的事务,并确定各项目所在事务的交集,将交集事务确定为所述初步候选模式所在的事务;
    至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk
  11. 根据权利要求10所述的模式挖掘装置,所述处理器执行所述至少在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值时,将所述初步候选模式加入HTWUSPIk的可执行指令,包括:
    计算所述初步候选模式所在的各事务的效用值的和值;
    根据所述初步候选模式所在的各事务的时间属性,确定所述初步候选模式的周期值;
    在所述初步候选模式所在的各事务的效用值的和值达到所述扩展效用阀值,且所述初步候选模式的周期值小于等于设定的周期阀值时,将所述初步候选模式加入HTWUSPIk
  12. 根据权利要求10所述的模式挖掘装置,在所述记录所述项目集中各项目所在事务,以及各事务的效用值之后,所述处理器执行的可执行指令还包括:
    确定事务的效用值小于所述效用阀值的低效用事务,并在记录的各项目所在事务中删除所述低效用事务。
  13. 一种存储介质,其特征在于,所述存储介质用于存储程序代码,所述程序代码用于执行权利要求1-6任意一项所述的模式挖掘方法。
  14. 一种高效用项集挖掘方法,应用于服务器,包括:
    确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和;一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和;
    根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值;
    将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
  15. 根据权利要求14所述的高效用项集挖掘方法,所述确定事务数据库中各项集对应的项集效用值包括:
    根据各数据项在各事务对应的外部效用值,和预定义的最低效用阈值表中记录的各数据项的内部效用值,以递归方式构建各项集对应的效用列表;其中,一个项集对应的效用列表记录有该项集对应的各目标事务的事务编号,该项集在各目标事务对应的效用值,及该项集在各目标事务中的剩余效用值;一个项集在一个事务中的剩余效用值表示的是,一个事务中的数据项以最低效用阈值从小到大排序,并在该事务中除去该项集所包含的数据项后,排序在该事务右边的数据项的效用值的总和;
    根据各项集对应的效用列表,计算出各项集的项集效用值。
  16. 根据权利要求15所述的高效用项集挖掘方法,所述以递归方式构建各项集对应的效用列表包括:
    分层级以递归方式构建各项集对应的效用列表,一个项集所处于的层级序数与该项集所包含的数据项的数量相对应;且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建。
  17. 根据权利要求16所述的高效用项集挖掘方法,所述分层级以递归方式构建各项集对应的效用列表包括:
    构建枚举的最低效用阈值MIU树,所述MIU树包含有分层级的项集,一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序;
    基于各数据项在各事务对应的外部效用值,和各数据项的内部效用值,构建出与MIU树相结合的各项集对应的效用列表,且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建。
  18. 根据权利要求17所述的高效用项集挖掘方法,所述构建枚举的MIU树包括:
    确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集;
    以深度优先搜索的方式,依序从MIU树第一层级的各项集出发,构建出分层级的项集,并使得一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序,形成MIU树。
  19. 根据权利要求16-18任一项所述的高效用项集挖掘方法,所述下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建包括:
    在构建第二层级的各项集的效用列表时,对于第二层级的各项集,确定第一层级中能够组合成该第二层级的项集的两个项集;
    将该两个项集共同对应的目标事务,作为该第二层级的项集所对应的目标事务;
    将该两个项集在一共同对应的目标事务中的效用值的加和,作为该第二层级的项集在该目标事务中的效用值;
    将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
  20. 根据权利要求16-18任一项所述的高效用项集挖掘方法,所述下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建包括:
    在构建层级不小于三的项集的效用列表时,对于层级不小于三的各项集,确定上一层级中能够组合成该项集的两个项集;
    将该两个项集共同对应的目标事务,作为该层级不小于三的项集所对应的目标事务;
    将该两个项集在一共同对应的目标事务中的效用值的加和,减去该层级不小于三的项集的前缀数据项在该目标事务中的效用值,将得到结果作为该层级不小于三的项集在该目标事务中的效用值;
    将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
  21. 根据权利要求18所述的高效用项集挖掘方法,所述确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集包括:
    计算包含一个数据项的各项集的项集的事务加权效用值,根据包含一个数据项的各项集的最低效用阈值,确定包含一个数据项的各项集中的高事务加权效用项集;
    将高事务加权效用项集按照最低效用阈值从小到大进行排序;
    所述构建出与MIU树相结合的各项集对应的效用列表包括:
    构建出包含一个数据项的各项集的效用列表,递归地根据包含一个数据项的各项集的效用列表生成一系列的后续的效用列表,形成各项集对应的效用列表。
  22. 根据权利要求18所述的高效用项集挖掘方法,所述方法还包括:
    在以深度优先搜索方式遍历MIU树时,如果一项集的事务加权效用值小于,该项集的最小最低效用阈值时,确定该项集的所有超集均不是高效用项集;
    和/或,在以深度优先搜索方式遍历MIU树时,如果一项集的效用值和剩余效用值的加和,小于该项集的项集最低效用阈值,则确定该项集在MIU树中的所有扩展节点均不是高效用项集。
  23. 根据权利要求14-18任一项所述的高效用项集挖掘方法,所示方法还包括:
    获取EUCS表,所述EUCS表包含不小于第二层级的各层级的项集与项集对应的事务加权效用上限;
    根据所述EUCS表,对事务加权效用上限小于最低效用阈值的不小于第二层级的项集及其超集进行过滤。
  24. 根据权利要求14-18任一项所述的高效用项集挖掘方法,所示方法还包括:
    如果一个项集是高事务加权效用项集,则确定该项集的任一子项集也是高事务加权效用项集,子项集包含该项集的所有数据项;
    如果一个项集不是高事务加权效用项集,则确定该项集的任一超集均不是高事务加权效用项集。
  25. 一种高效用项集挖掘装置,包括:
    处理器和存储器;
    所述存储器用于存储计算机程序;
    所述处理器用于读取所述计算机程序以执行以下可执行指令:
    确定事务数据库中各项集对应的项集效用值;一个项集对应的项集效用值表示的是,该项集在该项集对应的各目标事务中的效用值的加和;一个项集的目标事务为包含该项集所有数据项的事务;一个项集在目标事务中的效用值表示的是,该项集的各数据项在目标事务中的效用值的加和;
    根据预定义的最低效用阈值表,确定各项集对应的项集最低效用阈值;预定义的最低效用阈值表记录有各数据项对应的最低效用阈值,一个项集对应的项集最低效用阈值表示的是,该项集包含的数据项所对应的最低效用阈值中的最小最低效用阈值;
    将各项集的项集效用值与对应的项集最低效用阈值进行比对,根据比对结果确定高效用项集,其中,高效用项集的项集效用值不小于对应的项集最低效用阈值。
  26. 根据权利要求25所述的高效用项集挖掘装置,所述处理器执行所述确定事务数据库中各项集对应的项集效用值的可执行指令,包括:
    根据各数据项在各事务对应的外部效用值,和预定义的最低效用阈值表中记录的各数据项的内部效用值,以递归方式构建各项集对应的效用列表;其中,一个项集对应的效用列表记录有该项集对应的各目标事务的事务编号,该项集在各目标事务对应的效用值,及该项集在各目标事务中的剩余效用值;一个项集在一个事务中的剩余效用值表示的是,一个事务中的数据项以最低效用阈值从小到大排序,并在该事务中除去该项集所包含的数据项后,排序在该事务右边的数据项的效用值的总和;
    根据各项集对应的效用列表,计算出各项集的项集效用值。
  27. 根据权利要求26所述的高效用项集挖掘装置,所述处理器执行所述以递归方式构建各项集对应的效用列表的可执行指令,包括:
    分层级以递归方式构建各项集对应的效用列表,一个项集所处于的层级序数与该项集所包含的数据项的数量相对应;且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建。
  28. 根据权利要求27所述的高效用项集挖掘装置,所述处理器执行所述分层级以递归方式构建各项集对应的效用列表的可执行指令,包括:
    构建枚举的最低效用阈值MIU树,所述MIU树包含有分层级的项集,一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序;
    基于各数据项在各事务对应的外部效用值,和各数据项的内部效用值,构建出与MIU树相结合的各项集对应的效用列表,且下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建。
  29. 根据权利要求28所述的高效用项集挖掘装置,所述处理器执行所述构建枚举的MIU树的可执行指令,包括:
    确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集;
    以深度优先搜索的方式,依序从MIU树第一层级的各项集出发,构建出分层级的项集,并使得一个项集在MIU树中所处于的层级序数与该项集所包含的数据项的数量相对应,且各层级的项集按照最低效用阈值从小到大的顺序排序,形成MIU树。
  30. 根据权利要求27-29任一项所述的高效用项集挖掘装置,所述处理器执行所述下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建的可执行指令,包括:
    在构建第二层级的各项集的效用列表时,对于第二层级的各项集,确定第一层级中能够组合成该第二层级的项集的两个项集;
    将该两个项集共同对应的目标事务,作为该第二层级的项集所对应的目标事务;
    将该两个项集在一共同对应的目标事务中的效用值的加和,作为该第二层级的项集在该目标事务中的效用值;
    将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
  31. 根据权利要求27-29任一项所述的高效用项集挖掘装置,所述处理器执行所述下一层级的项集对应的效用列表,通过至少两个能够组合成该项集的高层级项集的效用列表构建的可执行指令,包括:
    在构建层级不小于三的项集的效用列表时,对于层级不小于三的各项集,确定上一层级中能够组合成该项集的两个项集;
    将该两个项集共同对应的目标事务,作为该层级不小于三的项集所对应的目标事务;
    将该两个项集在一共同对应的目标事务中的效用值的加和,减去该层级不小于三的项集的前缀数据项在该目标事务中的效用值,将得到结果作为该层级不小于三的项集在该目标事务中的效用值;
    将该两个项集中排序在后的项集在一共同对应的目标事务中的剩余效用值,作为该第二层级的项集在该目标事务中的剩余效用值。
  32. 根据权利要求29所述的高效用项集挖掘装置,所述处理器执行所述确定事务数据库中包含一个数据项的各项集,并将所确定的各项集按照最低效用阈值从小到大的顺序排序在MIU树的第一层级,构建出位于MIU树第一层级的项集的可执行指令,包括:
    计算包含一个数据项的各项集的项集的事务加权效用值,根据包含一个数据项的各项集的最低效用阈值,确定包含一个数据项的各项集中的高事务加权效用项集;
    将高事务加权效用项集按照最低效用阈值从小到大进行排序;
    所述构建出与MIU树相结合的各项集对应的效用列表包括:
    构建出包含一个数据项的各项集的效用列表,递归地根据包含一个数据项的各项集的效用列表生成一系列的后续的效用列表,形成各项集对应的效用列表。
  33. 根据权利要求29所述的高效用项集挖掘装置,所述处理器执行的可执行指令还包括:
    在以深度优先搜索方式遍历MIU树时,如果一项集的事务加权效用值小于,该项集的最小最低效用阈值时,确定该项集的所有超集均不是高效用项集;
    和/或,在以深度优先搜索方式遍历MIU树时,如果一项集的效用值和剩余效用值的加和,小于该项集的项集最低效用阈值,则确定该项集在MIU树中的所有扩展节点均不是高效用项集。
  34. 根据权利要求25-29任一项所述的高效用项集挖掘装置,所述处理器执行的可执行指令还包括:
    获取EUCS表,所述EUCS表包含不小于第二层级的各层级的项集与项集对应的事务加权效用上限;
    根据所述EUCS表,对事务加权效用上限小于最低效用阈值的不小于第二层级的项集及其超集进行过滤。
  35. 根据权利要求25-29任一项所述的高效用项集挖掘装置,所述处理器执行的可执行指令还包括:
    如果一个项集是高事务加权效用项集,则确定该项集的任一子项集也是高事务加权效用项集,子项集包含该项集的所有数据项;
    如果一个项集不是高事务加权效用项集,则确定该项集的任一超集均不是高事务加权效用项集。
  36. 一种存储介质,所述存储介质用于存储程序代码,所述程序代码用于执行权利要求14-24任意一项高效用项集挖掘方法。
PCT/CN2017/102663 2016-09-27 2017-09-21 模式挖掘方法、高效用项集挖掘方法及相关设备 WO2018059298A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/022,891 US10776347B2 (en) 2016-09-27 2018-06-29 Pattern mining method, high-utility itemset mining method, and related device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201610856770.5A CN107870939B (zh) 2016-09-27 2016-09-27 一种模式挖掘方法及装置
CN201610856770.5 2016-09-27
CN201610866557.2 2016-09-28
CN201610866557.2A CN107870956B (zh) 2016-09-28 2016-09-28 一种高效用项集挖掘方法、装置及数据处理设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/022,891 Continuation US10776347B2 (en) 2016-09-27 2018-06-29 Pattern mining method, high-utility itemset mining method, and related device

Publications (1)

Publication Number Publication Date
WO2018059298A1 true WO2018059298A1 (zh) 2018-04-05

Family

ID=61763316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/102663 WO2018059298A1 (zh) 2016-09-27 2017-09-21 模式挖掘方法、高效用项集挖掘方法及相关设备

Country Status (2)

Country Link
US (1) US10776347B2 (zh)
WO (1) WO2018059298A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955702A (zh) * 2019-11-28 2020-04-03 江南大学 一种基于改进遗传算法的模式数据挖掘方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241136A (zh) * 2020-01-07 2020-06-05 桂林电子科技大学 一种基于数据缓冲池的Top-k高效用项集挖掘方法
CN113377766B (zh) * 2021-05-21 2022-09-13 哈尔滨工业大学(深圳) 基于效用的序列数据库对比挖掘方法、装置及计算机设备
CN115563192B (zh) * 2022-11-22 2023-03-10 山东科技大学 一种应用于购买模式下的高效用周期频繁模式挖掘的方法
CN117010991B (zh) * 2023-07-31 2024-05-03 江南大学 基于gpu并行改进遗传算法的高利润商品组合挖掘方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120130964A1 (en) * 2010-11-18 2012-05-24 Yen Show-Jane Fast algorithm for mining high utility itemsets
CN102662948A (zh) * 2012-02-23 2012-09-12 浙江工商大学 一种快速发现效用模式的数据挖掘方法
CN105608182A (zh) * 2015-12-23 2016-05-25 一兰云联科技股份有限公司 面向不确定数据模型中的效用项集挖掘方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090018994A1 (en) * 2007-07-12 2009-01-15 Honeywell International, Inc. Time series data complex query visualization
EP2936341B1 (en) * 2012-12-18 2016-09-14 Telefonaktiebolaget LM Ericsson (publ) Load shedding in a data stream management system
US9098587B2 (en) * 2013-01-15 2015-08-04 Oracle International Corporation Variable duration non-event pattern matching
US9910882B2 (en) * 2014-12-19 2018-03-06 International Business Machines Corporation Isolation anomaly quantification through heuristical pattern detection
US10089334B2 (en) * 2015-03-26 2018-10-02 Ca, Inc. Grouping of database objects
US20180268015A1 (en) * 2015-09-02 2018-09-20 Sasha Sugaberry Method and apparatus for locating errors in documents via database queries, similarity-based information retrieval and modeling the errors for error resolution
US10509780B2 (en) * 2016-06-03 2019-12-17 Dell Products L.P. Maintaining I/O transaction metadata in log-with-index structure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120130964A1 (en) * 2010-11-18 2012-05-24 Yen Show-Jane Fast algorithm for mining high utility itemsets
CN102662948A (zh) * 2012-02-23 2012-09-12 浙江工商大学 一种快速发现效用模式的数据挖掘方法
CN105608182A (zh) * 2015-12-23 2016-05-25 一兰云联科技股份有限公司 面向不确定数据模型中的效用项集挖掘方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PHILIPPE FOURNIER-VIGER ET AL.: "PHM: Mining Periodic High-Utility Itemsets", ICDM 2016: ADVANCES IN DATA MINING. APPLICATIONS AND THEORETICAL ASPECTS, vol. 9728 Chap.6, no. 558, 28 June 2016 (2016-06-28), pages 64 - 79, XP047347933, DOI: 10.1007/978-3-319-41561-1_6 *
PHILIPPE FOURNIER-VIGER: "FHM:Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning", ISMIS 2014, vol. 8502 Chap.9, no. 558, 25 June 2014 (2014-06-25) - 31 December 2014 (2014-12-31), pages 83 - 92, XP047294870, DOI: 10.1007/978-3-319-08326-1_9 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955702A (zh) * 2019-11-28 2020-04-03 江南大学 一种基于改进遗传算法的模式数据挖掘方法
CN110955702B (zh) * 2019-11-28 2024-03-29 江南大学 一种基于改进遗传算法的模式数据挖掘方法

Also Published As

Publication number Publication date
US20180307722A1 (en) 2018-10-25
US10776347B2 (en) 2020-09-15

Similar Documents

Publication Publication Date Title
Gan et al. A survey of utility-oriented pattern mining
WO2018059298A1 (zh) 模式挖掘方法、高效用项集挖掘方法及相关设备
CN106484875B (zh) 基于molap的数据处理方法及装置
US11734233B2 (en) Method for classifying an unmanaged dataset
Yun Efficient mining of weighted interesting patterns with a strong weight and/or support affinity
US9053171B2 (en) Clustering data points
JP6377622B2 (ja) 位置情報を用いたデータのプロファイリング
US11960471B2 (en) Using lineage to infer data quality issues
US10140325B2 (en) Data source identification mapping in blended data operations
CN107870956B (zh) 一种高效用项集挖掘方法、装置及数据处理设备
CN104077723B (zh) 一种社交网络推荐系统及方法
US10268737B2 (en) System and method for performing blended data operations
JP6431055B2 (ja) 文献のテキストマイニングのシステムおよび方法
Gao et al. SeCo-LDA: Mining service co-occurrence topics for recommendation
US9606997B2 (en) Inferred operations for data analysis
JP6696568B2 (ja) アイテム推奨方法、アイテム推奨プログラムおよびアイテム推奨装置
CN106599122B (zh) 一种基于垂直分解的并行频繁闭序列挖掘方法
Balasubramaniam et al. Efficient nonnegative tensor factorization via saturating coordinate descent
CN116628228B (zh) 一种rpa流程推荐方法以及计算机可读存储介质
Li et al. Cost-efficient data acquisition on online data marketplaces for correlation analysis
CN111930967B (zh) 一种基于知识图谱的数据查询方法、装置及存储介质
Ahmed et al. Computing source-to-target shortest paths for complex networks in RDBMS
CN110879853A (zh) 信息向量化方法与计算机可读存储介质
Samorani et al. Automatic generation of relational attributes: An application to product returns
Levinas et al. BFS-based distributed algorithm for parallel local-directed subgraph enumeration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17854748

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17854748

Country of ref document: EP

Kind code of ref document: A1