CN109408563B

CN109408563B - High average utility item set mining method and device and computer equipment

Info

Publication number: CN109408563B
Application number: CN201811320172.1A
Authority: CN
Inventors: 林浚玮; 张玉龙; 刘婷婷; 陈伟
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2021-06-22
Anticipated expiration: 2038-11-07
Also published as: CN109408563A

Abstract

In the method, if the total utility value of all transactions which are inserted into a database in an accumulated mode is smaller than a utility safety value, determining an average utility list of each 1-item set contained in a data set which is not mined in the current database, and acquiring an average utility list of at least one 1-item set with item set expansion conditions in a stored original database; and determining the high average utility item set in the database according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set extension conditions in the original database. The scheme of the application can reduce the computing resources consumed by mining data from the database.

Description

High average utility item set mining method and device and computer equipment

Technical Field

The application relates to the field of data processing, in particular to a high average utility item set mining method and device and computer equipment.

Background

Mining of efficient use item sets is widely applied in various fields, such as hot-spot high-frequency word mining in commercial search; as another example, recommendations for content of interest (e.g., web pages, news, merchandise, etc.) and the like. But since efficient mining of item sets does not take into account the effect of item set length on utility values, high average utility item set mining is proposed.

High average utility item set mining may mine a set of items from the database that have a higher average utility value. When the high average efficiency item set is mined from the database, the high average efficiency item set can be mined from the database only by scanning the whole database, namely sequentially searching and processing the data items of each transaction recorded in the database. However, the database often has new data, and once the new data appears in the database, the average efficient item set mined in the database may change. Therefore, as long as the database is newly added with data, the updated database needs to be rescanned to dig out the high average utilization item set from the updated database, and as the data volume of the database is large, scanning the entire database necessarily needs to consume a large amount of computing resources.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, and a computer device for mining a high average utility item set, so as to reduce the computing resources consumed for mining data from a database when new data exists in the database.

To achieve the above object, in one aspect, the present application provides the following solutions:

a high average utility item set mining method comprises the following steps:

determining a total utility value of all accumulated inserted transactions in the database before the current moment according to at least one transaction contained in a data set which is inserted into the database and is not mined;

obtaining a utility security value corresponding to an initial total utility value of the database, the initial total utility value being the total utility value of the database prior to inserting all the transactions into the database;

when the total utility value of all the transactions is smaller than the utility safety value, determining an average utility list of each 1-item set in the data set, wherein at least each target transaction containing the 1-item set and the utility value of the 1-item set in each target transaction are recorded in the average utility list of each 1-item set;

acquiring an average utility list of at least one 1-item set with item set extension conditions in a stored original database; the original database is the database before the data set is inserted, the 1-item set with the item set expansion condition is determined when a high average utility item set is mined from the original database, and the average utility boundary is larger than the 1-item set of the low average utility threshold corresponding to the original database; the average utility boundary of the item set is the sum of the maximum utility values of all transactions containing the item set;

and determining a high average utility item set in the database according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set extension conditions in the original database.

In another aspect, the present application further provides a high average utility item set mining apparatus, including:

the transaction utility determining unit is used for determining the total utility value of all accumulated inserted transactions in the database before the current moment according to at least one transaction contained in the data set which is inserted into the database and is not mined;

a security value obtaining unit, configured to obtain a utility security value corresponding to an initial total utility value of the database, where the initial total utility value is a total utility value of the database before all transactions are inserted into the database;

a first list obtaining unit, configured to determine, when the total utility value of all the transactions is smaller than the utility safety value, an average utility list of each 1-item set in the data set, where the average utility list of each 1-item set at least records each target transaction including the 1-item set, and a utility value of the 1-item set in each target transaction;

the second list acquisition unit is used for acquiring an average utility list of at least one 1-item set with item set expansion conditions in the stored original database; the original database is the database before the data set is inserted, the 1-item set with the item set expansion condition is determined when a high average utility item set is mined from the original database, and the average utility boundary is larger than the 1-item set of the low average utility threshold corresponding to the original database; the average utility boundary of the item set is the sum of the maximum utility values of all transactions containing the item set;

and the item set mining unit is used for determining the high average utility item set in the database according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set extension conditions in the original database.

In yet another aspect, the present application further provides a computer device, including:

a processor and a memory;

wherein the processor is configured to execute a program stored in the memory;

the memory is to store a program to at least:

It can be seen that, in the embodiment of the present application, in the case that there is a newly inserted data set that has not been processed in the database, if the total utility value of all transactions inserted into the database is less than a utility safety value, it indicates that, in the original database before inserting the data set, the 1-item set whose average utility boundary is less than the inefficiency threshold set by the original database still does not have the condition of expanding the item set, and therefore, if other 1-item sets than the 1-item set whose average utility boundary is less than the inefficiency threshold in the original database can be obtained, the 1-item set that has the item set expansion condition may be determined without rescanning each transaction in the original database. Therefore, when the total utility value of all transactions inserted into the database is smaller than the safety utility value, the application can obtain the pre-stored average utility list of each 1-item set with the item set expansion condition in the original database, so that the current high-average utility item set in the database can be determined according to the average utility list of each 1-item set determined from the newly inserted data set, and only the newly inserted data set needs to be scanned without rescanning each transaction in the original database, thereby greatly reducing the data resources required to be scanned and reducing the computing resources consumed for mining the high-average utility threshold item set.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1 is a diagram illustrating nine types of situations that are possible for a set of items after a data set is inserted into an original database in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a component architecture of a system to which the high average utility item set mining method according to the embodiment of the present application is applied;

FIG. 3 is a flow chart diagram illustrating a high average utility item set mining method according to an embodiment of the present application;

FIGS. 4a and 4b are schematic diagrams illustrating the average utility lists of the respective 1-item sets in the database based on Table 1 and the data set based on Table 3, respectively;

FIG. 5 is a schematic diagram illustrating a list of average utilities of a 1-item set incorporated herein;

FIG. 6 is a flow diagram illustrating the mining of a high average utility item set according to the present application from an average utility list of a 1-item set;

FIG. 7 shows a schematic diagram of expanding a set of items based on an enumeration tree;

FIG. 8 illustrates a schematic diagram of building a mean utility list for a 2-item set based on a mean utility list for a 1-item set;

FIG. 9 is a schematic flow chart diagram illustrating a further embodiment of a high average utility item set mining method of the present application;

FIG. 10 is a schematic diagram illustrating an exemplary embodiment of a high average utility item set mining apparatus according to the present application;

FIG. 11 illustrates a schematic diagram of a computer device according to the present application.

Detailed Description

In order to facilitate understanding of the technical solutions described in the embodiments of the present application, some naming concepts related to the embodiments of the present application are introduced below.

1. A transaction refers to a record in a database (also referred to as a transaction database). For example, for a database that records transaction records for a commodity, each transaction may be a transaction record for a commodity. For another example, in a database for recording the access volume of web pages, a transaction may record the web pages viewed by a user, the residence time of the web pages, and the like.

2. And transaction identification, namely identification used for distinguishing each transaction in the database. For example, the order of the transactions is numbered in order of generation time.

3. Data items (items), also referred to as items or items, refer to items of information recorded in a transaction. Wherein a transaction includes at least one data item, and an internal utility value for each data item. For example, in a transaction type transaction, the data item of each transaction may be a commodity name of the transaction, and an internal utility value of the commodity, that is, a transaction amount, may also be recorded in the transaction.

An example of a database containing multiple transactions is shown in table 1.

TABLE 1

Transaction numbering	Affairs (trade name: transaction amount)
		T1	a:1，b:5，c:2，d:3，f:6
T2	b:2，c:3，e:2
		T3	a:1，b:2，d:1，f:1
T4	a:1，c:3，d:2
		T5	a:1，e:1
T6	b:7，d:1，d:3，f:2
		T7	a:3，b:9，c:3，d:1

As can be seen from the database shown in table 1, the database is a transaction type database, the database includes 7 transactions, the data item in each transaction is a commodity name, and meanwhile, the transaction number of the commodity corresponding to the commodity name is also recorded in the transaction. For example, the transaction number of the first transaction is "T1", which is abbreviated as transaction T1, and the data items in the transaction T1 include: a. b, c, d and f, transaction T1 actually represents: transaction records for 1 item a, 5 items b, 2 items c, 3 items d, and 6 items f were purchased.

Of course, table 1 is illustrated by taking a transaction-type database as an example, and assuming that the database records news data, each transaction may record an interest value, a sensitivity size, a freshness size, and the like of at least one piece of news, and a name of the news is a data item of the transaction.

4. A set of items (also called a schema), a collection of at least one data item, which is used to characterize an association rule inherent in a transactional database. The points at which a transaction differs from a set of items are: transactions are typically records in a database that are triggered by actual events, while item sets are typically mined from the database and do not necessarily have an actual meaning.

5. The K-item set comprises a set of K data items. For example, a 1-item set can be an item set containing one data item, e.g., a 1-item set can be an item set containing data item a, and then the 1-item set can be represented as item set a or item set { a }. As another example, a 2-item set is a set of items that includes two data items, e.g., a 2-item set that includes only data item a and data item b, which can be represented as a set of items ab or a set of items { ab }.

6. An external utility table (e.g., profit table), a table that records the unit external utility value (e.g., profit value) corresponding to each data item in the database. For example, in the database of transaction types, the profit tables record profit values corresponding to different commodities respectively. While for other types of databases, the profit margin may be characterized as an interest value for the data item, or a fixed value set in advance. For example, for a database storing web browsing information, a transaction in the database may record data items: the web page name of the web page may also record the user browsing duration corresponding to the web page, and different profit values may be preset for different web pages, or the profit value of each web page is fixedly set to 1. For example, Table 2 shows a profit table that contains profit values corresponding to the respective data items in the database of Table 1.

TABLE 2

Data item (item)	a	b	c	d	e	f
							Profit
	4	1	5	7	6	3

As can be seen from table 2, the profit per unit of the data item a is 4, and if the data item a in table 1 represents the product a, the profit for selling one product (which may be one kilogram or the like) a is 4 yuan.

7. The utility value of a data item, i.e., the utility value au (i) of the data item in a transaction_j,T_q) Is a data item i_jAt a certain transaction T_qInternal utility value q (i) of (1)_j,T_q) Multiplied by the data item i_jExternal utility value p (i)_j) Specifically, the following formula one can be seen:

for example, in a database of transaction types, the utility value of a data item is the product of the number of transactions of a good in a transaction and the profit of the good. Taking data item b in tables 1 and 2 as an example, the utility value of data item c in transaction T1 is: (number of transactions c in transaction T1) × (profit of c) ═ 2 × 5 ═ 10.

8. The utility value of the item set, also referred to as the utility value of the item set in the transaction, is the sum of the utility values of the data items in the item set in the transaction. For example, taking the item set bc as including data item b and data item c in table 1, the utility value of the item set bc in the transaction T2 is: 2 x 1+3 x 5 ═ 17.

9. The average utility value of the item set in the transaction is the length of the item set for the utility value of the item set in the transaction. Wherein the length of an item set is the number of data items contained in the item set, which number is denoted k. For example, item set X is in transaction T_qAverage utility value au (X, T) of (2)_q) Watch capable of showingShown as the following equation two:

10. the average utility value au (X) of the item set in the database D is the item set in each transaction T_qAverage utility value au (X, T) of (2)_q) And (4) adding.

For example, the average utility value of the item set ab in the database is calculated, and the average utility value of the item set ab in all transactions is first calculated, for example, the average utility value in T1 is (1 × 4+5 × 1) ÷ 2 ═ 4.5; similarly, the average utility values of the item set ab in the transactions T3 and T7 are 3 and 10.5, respectively, and then the average utility values of the item set ab in the transactions are summed to obtain an average utility value of 4.5+3+10.5 ═ 18 in the database.

Of course, the average utility value of the set of items in the database is also equal to the total utility value of the set of items in the database divided by the length of the set of items, where the total utility value of the set of items in the database is the sum of the utility values of the set of items in each transaction containing the set of items.

11. Utility value of transaction, the transaction T_qAll data items i in_jAt the transaction T_qUtility value u (i) of_j,T_q) The sum of (a) and (b). Such as transaction T_qEffective value of tu (T)_q) Can be expressed as the following equation four:

12. the total utility value of the database is the total value of all the transactions T in the database D_qUtility value tu (T) of corresponding transaction_q) The sum of (a) and (b). Total utility value TU as database D^DCan be expressed as the following formula five:

13. and the average utility value of the item set in the database is greater than the set high average utility threshold value. Wherein the high average utility threshold may be set by a user. In practical applications, a first preset percentage for determining the high average utility threshold of the database may also be set by the user, for example, the high average utility threshold may be a product of the total utility value of the database and the first preset percentage.

Still taking tables 1 and 2 as an example, the utility of transaction T5 is 1 × 4+1 × 6 ═ 10; similarly, the utility values of the transactions T1, T2, T3, T4, T6 and T7 are 58, 29, 16, 33, 20 and 43, respectively, and the corresponding total database utility value is 58+29+16+33+10+20+43, which is 209. If the high average utility threshold is set to 10% of the total utility value of the database, and the high average utility threshold is 209 x 10% ═ 20.9, then the average utility value of the item set ab in the database is 18< the high average utility threshold 20.9, then the item set ab is not a high average utility item set.

14. Maximum utility value of transaction (also referred to as transaction maximum utility value), utility value u (i) of data item corresponding to data item having maximum utility value in the transaction_j,T_q) I.e. transaction T_qMaximum utility value of tmu (T)_q) Can be expressed as formula six:

15. average utility boundary (auub) for a set of items, each transaction T containing the set of items X_qMaximum utility value of tmu (T)_q) E.g., mean utility boundary auub (X) of term set X in database D^DCan be expressed as the following formula seven:

for example, in table 1, the transactions containing 1-item set a are transaction T1, transaction T3, T4, T5, and T7, where the maximum utility value of transaction T1 is max {4, 5, 10, 21, 18} ═ 21. Similarly, the maximum utility values of the transactions T3, T4, T5 and T7 are 7, 15, 6 and 15, respectively, so that the average utility boundary of the item set a is 21+7+15+6+15 ═ 64.

16. The method comprises the steps that (1) a high average utility boundary item set HAUUBI is adopted, and if the average utility boundary of a 1-item set is larger than a set high average utility threshold value, the 1-item set is the high average utility boundary item set;

17. the higher average utility bound term set PAUUBI, if the average utility bound of a 1-term set is less than or equal to the high average utility threshold but greater than the low average utility threshold, then the 1-term set is the higher average utility bound term set. Wherein the low average utility threshold may be set by a user. In practical applications, a second preset percentage for determining the low average utility threshold of the database may also be set by the user, for example, the low average utility threshold may be a product of the total utility value of the database and the second preset percentage. And if the low average utility threshold is less than the high average utility threshold, the second preset percentage is less than the first preset percentage.

18. And a small item set, wherein if the average utility boundary of the 1-item set is less than the set low average utility threshold value, the 1-item set is the small item set.

19. A utility security value, the utility security value being related to a total utility value of the database. The utility security value is a utility threshold that limits whether the database is rescanned in the event that the database is populated with data sets containing at least one transaction.

20. Pruning strategy, if the average utility boundary auub of a set of items is less than the high average utility threshold and the low average utility threshold, then the superset of the set of items is also less than the high average utility threshold and the low average utility threshold. In short, if the average utility boundary of a set of items is greater than the high average utility threshold, the set of items can continue to expand the set of items.

Wherein another set of items expanded by a set of items is a superset of the set of items. For example, if item set B is extended from item set A, then item set B is a superset of item set A. Wherein, on the premise that the item set B is a superset of the item set A, all data items in the item set A are present in the item set B, and the item set B contains some data items which are not present in the item set A.

The inventor of the application finds out through research that: in the process of mining the high average utilization item set by the database, data items recorded in each transaction in the database and internal utility values of the data items need to be sequentially inquired, a preset external utility table is combined to determine a 1-item set of the database, wherein the 1-item set can be continuously expanded, other item sets are expanded based on the 1-item set, and the high average utilization item set is determined from the 1-item set and the expanded item sets. And after at least one transaction is inserted into the database, the 1-item set capable of continuously expanding the item set in the database may change, so that the 1-item set incapable of continuously expanding the item set in the original database may change into the 1-item set capable of continuously expanding the item set, and if the updated database is not rescanned, that is, the transactions in the updated database are not inquired again in sequence, the determined 1-item set capable of continuously expanding the item set may be missing, so that all high-average-efficiency item sets in the updated database cannot be mined out completely in the follow-up process.

Based on the above research findings, the inventors thought that in the case where the number of transactions inserted into the database is small, so that the total utility value of the newly inserted transactions is small, it is likely that the newly inserted transactions exist without affecting the 1-item set that can continue to expand the item set. Based on this, the inventors of the present application further conducted the following studies:

instead of setting only one high average utility threshold for the database, the inventor of the present application sets two thresholds for the database as an example. One of the thresholds is a high average utility threshold, and the other threshold is a low average utility threshold, where the high average utility threshold is greater than the low average utility threshold, and the two thresholds can be determined as described above.

For each 1-item set in the database, if the average utility boundary of the 1-item set is greater than the high average utility threshold, determining the 1-item set as a high average utility boundary item set (HAUUBI); determining a 1-term set as a higher average utility boundary term set (PAUUBI) if the average utility boundary of the 1-term set is less than or equal to the high average utility threshold but greater than the low average utility threshold; correspondingly, the 1-item set which does not belong to the high average utility boundary item set and the low average utility boundary item set is a small item set. Based on this, the 1-entry set in the database (referred to as the original database for convenience of distinction) before the insertion of the data set is divided into the above three cases, that is, the 1-entry set in the original database is divided into three cases of HAUUBI, PAUUBI, and small entry set. Similarly, for a data set inserted in a database and containing at least one transaction, at least one 1-item set can be determined from at least one transaction contained in the data set (the data set can be regarded as a new small database), and each 1-item set in the data set can be divided into three cases.

Accordingly, for a set of items of the original database before the insertion of a data set, three cases to which the set of items may belong, and three cases to which the set of items may belong in the inserted data set, then in the database after the insertion of the data set, there are nine possible cases for the set of items, e.g., see fig. 1, which shows the nine possible cases for a set of items after the insertion of a data set into the original database, based on the high average utility threshold of the database after the insertion of the data set (which also changes after the insertion of the database into a transaction).

As can be seen from fig. 1, there may be nine cases of item set types in the updated database after the data set is inserted into the original database to obtain the updated database according to different item set types to which the item set belongs in the data set and the original database, respectively. The following nine conditions are specified:

case 1: if the item set belongs to the high average utility boundary item set HAUUBI in both the original database without the inserted data set and the data set needing to be inserted into the database, the item set still belongs to the high average boundary item set HAUUBI in the database with the inserted data set (i.e. the updated database).

Case 2: the set of items belongs to the high average utility bound item set, HAUUBI, in the original database and the higher average utility bound item set, paubi, in the inserted data set, in which case this set of items may remain either HAUUBI or PAUUBI in the updated database.

Case 3: the set of items is a hauubii in the original database and neither a hauubii nor a paubii in the inserted data set, i.e. belongs to a small set of items, in which case this set of items may be a hauubii, a paubii or a small set of items in the updated database.

Case 4: the set of items is PAUUBI in the original database and hauubii in the data set. In this case, this set of entries belongs to either the HAUUBI or PAUUBI in the updated database.

Case 5: the set of items is PAUUBI in both the original database and the inserted data set. In this case, this set of entries still belongs to PAUUBI in the updated database.

Case 6: the set of items is the PAUUBI in the original database and the small set of items in the inserted data set, in which case this set of items belongs to the PAUUBI or small set of items in the updated database.

Case 7: the item set is a small item set in the original database, and belongs to HAUUBI in the inserted data set, in this case, if the total utility value of each transaction in the inserted data set is less than a utility safety value corresponding to the original database, the item set still belongs to the small item set in the updated database; if the total utility value of each transaction in the data set is not less than the utility security value, the item set may belong to the PAUUBI in the updated database.

Case 8: the set of items is a small set of items in the original database and a PAUUBI in the inserted data set, in which case this set of items is a small set of items or a PAUUBI in the updated database.

Case 9: the item set is a small item set in both the original database and the inserted data set, in which case this item set remains a small item set in the updated database.

As can be seen from the above nine cases, the database belongs to the hauubii in the original database, and is most likely to remain the hauubii in the database after the data set is inserted (i.e., the updated database). And if the item set belongs to the HAUUBI in the updated database, the average boundary utility of the item set is larger than the high average utility threshold of the updated database, and the item set belongs to the item set which can extend other item sets in the updated database. Meanwhile, considering that item set mining in the database is to extend an item set from a 1-item set, in order to avoid incomplete item set mining, the 1-item set belonging to the HAUUBI in the original database needs to be saved, so that after the database is updated, other item sets are extended based on the item set, and the time and the computing resources for rescanning the original database are saved.

The item set belonging to PAUUBI in the original database has a high probability of becoming HAUUBI in the updated database. In this case, in order to avoid incomplete mining of the item set, it is necessary to save the 1-item set belonging to PAUUBI in the original database.

Meanwhile, as can be seen from the above nine cases, for the item sets belonging to the small item set in the original database, the item sets may become PAUBBI in the database after the data set is inserted, if a certain condition is satisfied. Wherein, the condition of meeting is that the total utility value of the inserted data set is not less than the safety utility value.

In practical application, a database may have sporadic transaction insertions, and if one item set is changed from a small item set to PAUBBI after a data set is inserted into the database for the first time, and then another data set containing at least one transaction is inserted into the database for the next time, the database in which the item set in the original database is inserted twice is likely to become HAUUBI, and the item set actually belongs to an item set capable of expanding other item sets in the database in which the data sets are inserted twice. It can be seen that if the total utility value inserted into a data set is not less than the security utility value when the data set is inserted into the database for the first time without rescanning the original database, then after the data set is subsequently inserted into the original database again, item set mining is performed based on only 1-item sets belonging to HAUUBI and PAUUBI stored in the original database, and some item sets that may extend other item sets may be missed.

Based on the above analysis, for the case of inserting the data set into the database for the first time, the original database may be rescanned only in the case that the total utility value of each transaction in the data set inserted into the database is not less than the safety utility value, so as to dig out the 1-item set that needs to be expanded into other item sets; and under the condition that the total utility value of the inserted data set is smaller than the safety utility value, the original database does not need to be scanned, and the missing of item set mining can be avoided as long as the item sets are continuously expanded on the basis of the 1 item sets which belong to the HAUUBI and the PAUUBI in the original database and the 1 item set in the inserted data set. In fact, the total utility value of each transaction inserted into the data set of the database is relatively smaller than that of the original database, so that the condition that the total utility value of each transaction in the inserted data set reaches the safe utility value is relatively less, the original database is rescanned only under few conditions, further computing resources are reduced, time consumed for scanning the original database is greatly reduced, and mining efficiency is improved.

Meanwhile, it is considered that although the total utility value of each transaction inserted into the database at one time is smaller than the security utility value of the original database, there may be multiple insertions of the data set into the database, and when the total number of transactions in the multiple-insertion data set may be large, the total utility value of all transactions in the multiple-insertion data set may be larger than the security utility value. Therefore, for the case that there may be multiple transactions inserted, when a certain time detects that a data set is inserted into the database, the sum of the total utility values of all transactions in the data set inserted into the database this time and each time before this time needs to be calculated, that is, the total utility values of the data sets inserted into the database at different times are accumulated, and if the accumulated value of the total utility values of the data sets inserted at each time is smaller than the total utility value, the database does not need to be rescanned currently.

In combination with the above findings, the following describes the embodiments of the present application in detail with reference to the drawings.

For example, referring to fig. 2, a schematic diagram of a system for which a high average utility item set mining method of the present application is applicable is shown.

In the system shown in fig. 2, it comprises: a computer device 21 and a storage server 22, wherein the computer device may establish a communication connection with the storage server via a network.

The storage server 22 may be a server where a database is located, and a database containing a plurality of transactions is stored in the storage server, for example, the storage server may store a database composed of a plurality of commodity transaction records.

The computer device 21 may obtain each transaction in the database stored in the storage server, and determine the average utilization item set in the database according to each transaction in the database.

Of course, fig. 1 is only one system component to which the present application is applicable, and in practical applications, the computer device may also store the database therein, without separately providing a storage server.

The high average utility item set mining method of the present application is described below from the perspective of a computer device. Referring to fig. 3, which shows a schematic flowchart of an embodiment of a high average utility item set mining method according to the present application, the method of the embodiment may include:

s301, according to at least one transaction contained in the data set which is inserted into the database and is not mined, determining the total utility value of all accumulated inserted transactions in the database before the current time.

Wherein a data set belongs to a data set that has not been mined if, after insertion of the data set, the average-efficiency item set in the database into which the data set is inserted has not been determined.

All transactions inserted into the database before the current time are accumulated, that is, all transactions inserted into the database are accumulated after the database is scanned for the most recent time.

From the foregoing research of the inventors of the present application, it is found that, on the premise that the database does not need to be scanned again, the total utility of all transactions inserted into the database before the current time is not less than a security utility value, and therefore, the total utility value of all transactions inserted cumulatively in the database needs to be determined for subsequent comparison with the security utility value.

If 5 transactions are inserted into the database accumulatively before the current time, the utility values of the 5 transactions need to be calculated sequentially, and then the value obtained by adding the utility values of the 5 transactions is the total utility value of the 5 transactions inserted accumulatively.

It will be appreciated that in actual practice, the computer device may cache the aggregate utility value of the transactions inserted cumulatively. Such as. The computer device may calculate, each time at least one transaction is inserted into the database, a total utility value of the at least one currently inserted transaction, and add the currently calculated total utility value to a total utility value of the inserted database cached before the current time, thereby obtaining a total utility value of all transactions inserted into the database. Accordingly, in this step S201, the computer device may read the cached total utility value of all transactions accumulated and inserted into the database from the cache region.

It will be appreciated that there are various conditions that trigger the determination of the aggregate utility value for all transactions inserted cumulatively, e.g., the determination of the set of unprocessed data inserted into the database and the determination of the aggregate utility value for all transactions inserted cumulatively can be performed when a set data mining time is met or a data mining period is reached.

For another example, in practical applications, it may also be that each time a data set containing at least one transaction is inserted into the database, the determination of the high average utilization item set in the database into which the data set is inserted is triggered. In this case, then, when a data set containing at least one transaction is detected to be inserted into the database, a total utility value of all transactions inserted into the database accumulated before the current time may be determined. If the currently inserted data set is the data set inserted for the first time after the last full scan of the database, the total utility value of each transaction in the data set can be directly determined, so that the total utility value can be compared with the safety utility value in the following process.

S302, obtaining a utility safety value corresponding to the initial total utility value of the database.

Wherein the initial total utility value of the database is the total utility value of the database before all the transactions are inserted into the database. It will be appreciated that if no other transactions are inserted into the database prior to the data set, the initial total utility value for the database is the total utility value for the database prior to the data set being inserted.

It is understood that the utility security value is related to the initial total utility value of the database, and can be set according to actual needs. As an option, an initial total utility value for the database, a first preset percentage S in the database for determining a high average utility threshold, may be obtained_uAnd a second preset percentage S for determining a low average utility threshold_t. Then, according to the initial total utility value of the database and a first preset percentage S_uAnd a second preset percentage S_tAnd determining the corresponding utility safety value f of the database. For example, the utility safety value f can be calculated by the following equation eight:

the first preset percentage and the second preset percentage can be set by a user according to needs. In this case, when the database is changed, so that the total utility value of the database is changed, the high average utility threshold and the low average utility threshold of the changed database calculated based on the first preset percentage and the second preset percentage are also changed accordingly.

For example, taking the database shown in table 1 as an example, if the utility value of transaction T1 is 58, the utility value of transaction T2 is 29, the utility value of transaction T3 is 16, the utility value of transaction T4 is 33, the utility value of transaction T5 is 10, the utility value of transaction T6 is 20, and the utility value of transaction T7 is 43, then the total utility value (the initial total utility value also applies) of the database is: 58+29+16+33+10+20+43 equals 209. Assuming that the first predetermined percentage is 10% and the second predetermined percentage is 5%, the high average utility threshold of the database is 209 × 10% — 20.9, and the low average utility threshold of the database is 10.45. According to equation 7, the safety utility value of the database can be calculated to be (10% -5%) 209/(1-10%) -11.61.

S303, when the total utility value of all the transactions accumulated and inserted into the database is less than the utility safety value, determining the average utility list of each 1-item set in the data set which is not mined.

Wherein, the average utility list of each 1-item set at least records the respective target transaction containing the 1-item set and the utility value of the 1-item set in each target transaction.

Under the condition that a target transaction to which the 1-item set belongs and the utility value of the 1-item set in the target transaction are determined, the average utility value of the 1-item set can be calculated; meanwhile, each 1-item set corresponds to an average utility list, so that when the item sets are expanded subsequently, target transactions of the expanded 2-item sets, 3-item sets and other item sets and the average utility value of the item sets can be determined according to the average utility list, and a basis is provided for screening the high-average-efficiency item sets.

Optionally, the average utility list of the 1-item set may further include: maximum utility value for each target transaction. It can be understood that, in order to reduce the expansion of useless item sets in the process of expanding other item sets based on the 1-item set, the application may further expand the item sets according to the pruning policy, and the process of expanding the item sets based on the pruning policy further needs to obtain the average utility boundary of the item sets, which is defined by the average utility boundary of the previous item set, and the average utility boundary of the item sets is the sum of the maximum utility values of the transactions including the item set, so that the average utility list of the 1-item set includes the maximum utility value of the target transaction, which is beneficial to quickly determining the average utility boundary of the item sets subsequently, and is further beneficial to reduce the expansion of useless redundant item sets.

It can be understood that, in the case that at least one transaction included in a data set is determined, which 1-item sets are included in the data set can be determined according to the data items recorded by each transaction in the data set. Wherein each data item is a 1-item set, thereby finding at least one 1-item set that is not repeated. Correspondingly, the utility value of each 1-item set in the target transaction containing the 1-item set can be determined according to the external utility value of each data item in the preset external utility table, the data item in each transaction in the data set and the internal utility value of the data item. Of course, for each target transaction in which the 1-item set is located, the maximum utility value of the target transaction may be calculated according to the data items in the target transaction and the internal utility values of the data items, and by combining with the external utility table.

For example, assume that a data set inserted into a database contains transactions as follows in Table 3:

TABLE 3

Transaction numbering	Affairs (trade name: transaction amount)
		T8	a:1，e:10，f:5
T9	c:1，d:3
		T3	b:1

Then, as shown in table 3 above, the data set includes three transactions, i.e. transaction T8, transaction T9 and transaction T10, and as can be seen from these 3 transactions, the 1-item set included in the data set includes: the item set a, the item set b, the item set c, the item set d, the item set e, and the item set f, and the transaction in which the item sets exist may also be obtained, for example, the transaction in which the item set a exists is a transaction T8.

Also, a profit sheet is shown in connection with Table 2, and the utility value for each 1-item set in the data set in a transaction, and the maximum utility value for each transaction, can be calculated.

For example, based on the data set of Table 1, and the profit tables of Table 2, a list of the average utilities of each 1-item set in the data set of Table 3 can be obtained as shown in FIG. 4 b. As can be seen from fig. 4b, the item set a, the item set b, the item set c, the item set d, the item set e, and the item set f in the data set each have a respective average utility list, where, the cell 401 in the first row in the average utility list records the identification of the 1-item set or the data items contained in the 1-item set, and starting from the second row in the average utility list, the first column 402 records the transaction identification of the target transaction in which the 1-item set is located, and the second column 403 records the utility value of the 1-item set in the target transaction; the third column 403 records the maximum utility value for the target transaction.

For example, taking the item set d as an example, as can be seen from table 3, the item set d exists in the transaction T9, and the profit value according to d in table 2 is 7, and the internal utility value of the item set d in the transaction T9 is 3, then the utility value of the transaction T9 is 3 × 7 — 21, and the utility value of the data item c in the transaction T9 is 5 in the transaction T9, and the utility value of the data item d in the transaction T9 is 21, so the maximum utility value in the transaction T9 is 21. Accordingly, it can be seen from FIG. 4b that the number "9" in the second row and first column of the average utility list of 1-item set { d } indicates that this item set d exists in transaction number 9, i.e., item set { d } exists in transaction T9. Meanwhile, it can be seen from FIG. 4b that in the data set of Table 3, the utility value of the item set { d } in the transaction T9 is 21, and the maximum utility value of the transaction T9 is 21.

The meaning of each parameter and the calculation of the parameter value in the average utility list of the other 1-item set in fig. 4b are similar to the item set d, and are not repeated here.

S304, obtaining the average utility list of at least one 1-item set with item set extension conditions in the stored original database.

Wherein the original database is the database before inserting the data set, that is, the original database includes: the data portion preceding the data set is inserted into the database. To distinguish from the database into which the data set is inserted, the present application refers to the database before the data set is inserted as the original database.

It will be appreciated that if no other transactions are inserted into the database before the data set is inserted, the original database is actually the most recent full scan to mine the database for the average high utility set.

The average utility list is the same as the average list of the aforementioned 1-item set, and is not described herein again.

And the 1-item set with the item set expansion condition is determined when the high average utility item set is mined in the original database, and the average utility boundary is not less than the 1-item set of the low average utility threshold corresponding to the original database.

For convenience of understanding and description, the original database is taken as an example of a database without any transaction insertion, that is, the current database is obtained after the data set is inserted into the original database, and then the average utilization item set is mined by scanning transactions of the original database. In this case, in the process of mining the high average utility item set from the original database, it is also necessary to mine a 1-item set in the original database and select a 1-item set from the 1-item set whose average utility boundary is greater than the high average utility threshold. In the embodiment of the application, in the process of mining the original database, an average utility list of 1-item sets of which the average utility boundary determined from the original database is greater than a high average utility threshold value is stored, and at the same time, 1-item sets of which the average utility boundary determined from the original database is not greater than the high average utility threshold value but is greater than a low average utility threshold value are also stored. That is, in the process of mining the high average utility item set from the original database, 1-item sets belonging to the hauubii and the PAUUBI in the original database are determined, and the 1-item sets belonging to the hauubii and the PAUUBI in the original database are stored as 1-item sets having an item set extension condition.

In combination with the above-mentioned research and analysis of the nine possible cases of the item set after the database is inserted into the data set, it can be seen that, after the database is inserted into the data set, if the total utility value of the data set is smaller than the security utility value corresponding to the original database before the data set is inserted, the 1-item set belonging to the small item set in the original database does not have the possibility of becoming a hauubii or PAUUBI, i.e. does not have the possibility of extending other item sets, and therefore, the original database does not need to be scanned to mine the 1-item sets belonging to the small item set.

Moreover, after inserting the data set into the original database, if the total utility value of the data set is less than the security utility value of the original database, then only the 1-entry set in the original database that belongs to either the hauubii or the PAUUBI has a high probability of belonging to the hauubii. That is, after inserting a data set into the original database, only the 1-item set belonging to either the HAUUBI or PAUUBI in the original database may be referred to as a 1-item set that can extend the other item sets. Therefore, the 1-item set belonging to the HAUUBI or PAUUBI determined in the original database is stored in advance, so that the database does not need to be rescanned to determine the 1-item set with the extended other item sets from the original database.

For example, taking table 1 as an original database as an example, there are 6 1-entry sets in the original database, which are: the item set a, the item set b, the item set c, the item set d, the item set e and the item set f are sequentially constructed into an average utility list of the six 1-item sets according to each transaction in the table 1 and the profit table in the table 2, so that the average utility boundaries of the 6 1-item sets can be obtained based on the average utility lists of the six 1-item sets. Then, based on the high average utility threshold and the low average utility threshold of the original database, one can derive: the average utility boundary of item set e is less than the low average utility threshold, while the average utility boundaries of the remaining 5 1-item sets are all greater than the low average utility threshold. Accordingly, the computer device can store a list of average utilities for the set of items a, the set of items b, the set of items c, the set of items d, and the set of items f.

For example, see FIG. 4a, which shows the average utility list of the 1-item set belonging to either HAUUBI or PAUUBI in the original database shown in Table 1. To the left of the dotted line in fig. 4a is the average utility list of the 1-item set belonging to the HAUUBI, e.g., the average utility list comprising item set d, item set c, item set a, and item set b. And the right side of the dashed line is the average utility list of the 1-item set belonging to PAUUBI, e.g., the average utility list comprising the item set f. The average utility list includes an item set identification area 401 in which item set identifications are recorded, rows of a column 402 respectively record transaction numbers of transactions in which the item sets are located, rows of a column 403 are used for recording average utility values of 1-item sets in the transactions, and rows of a column 404 respectively record maximum utility values in the transactions.

It will be appreciated that if other transactions are inserted into the database before each transaction in the data set is inserted into the database, and a high average utility set of items has been determined from the database before the data set is inserted, the average utility list of at least one 1-item set in the original database subject to the item set extension condition is actually made up of: the average utility list of at least one 1-item set with item set extension conditions in the database without any transaction inserted, and the average utility list corresponding to each of at least one 1-item set contained in a plurality of transactions inserted into the database before the data set.

S305, determining the high average utility item set in the database after being inserted into the data set according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set extension conditions in the original database.

It can be understood that after determining the average utility lists of the 1-item sets in the data set and the average utility lists of the 1-item sets with item set extension conditions in the original database, the 1-item sets corresponding to the average lists actually contain the 1-item sets of the 2-item sets, the 3-item sets and the like which can be extended continuously in the current database (i.e., the database into which the data set is inserted).

Correspondingly, based on the average utility lists of the 1-item sets in the data sets and the average utility list of at least one 1-item set with item set extension conditions in the original database, the average utility list of the 1-item set of the currently-available database capable of continuing to extend the item sets can be obtained.

Optionally, the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set having item set extension conditions in the original database may be merged to obtain an average utility list of at least one 1-item set currently used for extending the item set in the database; then, based on the merged average utility list of the at least one 1-item set, analyzing the high average utility item set currently existing in the database.

The specific merging situation can be divided into the following two types:

if the 1-item set of the data set and the 1-item set of the original database have the same 1-item set, and the average utility list of the 1-item set in the data set is merged with the average utility list of the 1-item set in the original database for each identical 1-item set to obtain the average utility list of the 1-item set in the current database. The combination of the average utility list of the 1-item set in the data set and the average utility list of the 1-item set in the original database is the target transaction of the 1-item set in the data set and the original database, the utility value of the 1-item set in each target transaction, and the maximum utility value of each target transaction are combined into one average utility list.

For a 1-item set contained in the data set but not contained in at least one 1-item set having an item set expansion condition in the original database, or a 1-item set contained in at least one 1-item set having an item set expansion condition in the original database but not contained in the data set, the average utility list of the 1-item set is the average utility list of the 1-item set in the current database, and therefore, the average utility list of the 1-item set can be maintained unchanged.

For example, the average utility list of 5 1-item sets still having the item set extension condition in the original database is shown in fig. 4a, and the average utility list of each 1-item set in the data set is shown in fig. 4 b.

As can be seen from fig. 4a and 4b, average utility lists of item set a, item set b, item set c, item set d, and item set f exist in fig. 4a and 4b, and for each of these several item sets, the average utility lists of the item sets in fig. 4a and 4b are merged into one average utility list.

For example, taking the item set a as an example, there is a record in the average utility list in the data set by the item set a, and the record records the transaction T8 to which the item set a belongs, the utility value of the item set in the transaction T8, and the maximum utility value of the transaction T8, so that a row may be added to the average utility list of the original database by the item set a for recording the record, resulting in the average utility list of the merged item set a. For example, referring to fig. 5, which shows the average utility list of each item set after merging, it can be seen from the average utility list of the item set a after merging shown in fig. 5 that, with respect to the average utility list of the item set a in the original database, the average utility list of the item set a after merging adds information of the utility value of the item set a in the transaction T8 and the maximum utility value of the transaction T8.

The merging of the average utility lists of the item set b, the item set c, the item set d and the item set f is similar to the merging process of the average utility list of the item set a, and is not repeated here.

Meanwhile, as can be seen from comparing fig. 4a and fig. 4b, there is an average utility list of the item set e in the data set, and the stored 1-item set with the item set extension condition in the original database does not include the item set e, in which case, the average utility list of the merged item set e remains unchanged, as shown in fig. 5.

It is understood that, based on the merged average utility list of the at least one 1-item set, the specific implementation manner of determining the high average utility item set from the database may be various, and the application is not limited thereto.

It can be understood that, in order to screen out the high average utility item set, the total utility value of the current database needs to be determined based on the total utility value of the data set and the total utility value of the original database; and then, determining a high average utility threshold of the current database according to the total utility value of the current database. Accordingly, a high average utility item set can be determined from the current database according to the determined average utility list of the 1-item set in the data set and the stored average utility list of the 1-item set of the original database, in combination with the high average utility threshold of the current database.

And the total utility value of the database at the current moment is the sum of the total utility value of the data set and the total utility value of the original database.

The high average utility threshold of the current database is related to the total utility value of the current database, e.g., the high average utility threshold of the database is the product of the total utility value of the database and a first preset percentage.

S306, when the total utility value of all the transactions inserted into the database is not less than the utility safety value, determining all the 1-item sets contained in the database according to all the transactions in the database at the current moment, and determining the high average utility item set in the database based on all the determined 1-item sets.

That is, when the total utility value of each transaction in the data set is greater than or equal to the utility security value, the database inserted with the data set needs to be rescanned to re-determine the 1-item sets contained in the database, then determine which of the 1-item sets belong to the high average utility item set, select the 1-item sets capable of expanding the item sets from the 1-item sets, and finally expand the item sets downwards based on the selected 1-item sets and determine which of the expanded item sets belong to the high average utility item set.

Here, the step S306 is an optional step, and is only a process of scanning the database by the computer device in order to understand that the total utility value of the data set of the present application is greater than the utility safety value.

It can be seen that, in the embodiment of the present application, in the case that there is a new data set inserted into the database that has not been processed yet, if the total utility value of all transactions inserted into the database cumulatively is less than the utility safety value, it indicates that, in the original database before the data set is inserted, the condition that the 1-item set whose average utility boundary is less than the low average utility threshold set by the original database still does not have the expanded item set. Therefore, if 1-item sets other than the 1-item set with the average utility boundary less than the low average utility threshold in the original database are obtained, the 1-item set with the item set expansion condition can be determined without rescanning transactions in the original database.

Therefore, when all the transactions inserted into the database are smaller than the safety utility value, the application can determine the high-average-utility item set in the database into which the data set is inserted only by obtaining the pre-stored average utility list of each 1-item set with the item set expansion condition in the original database and determining the average utility list of each 1-item set according to the newly inserted data set, so that only the newly inserted data set needs to be scanned without re-scanning each transaction in the original database, and the time consumed for scanning the database is greatly reduced; meanwhile, data resources required to be scanned are greatly reduced, and computing resources consumed for mining the high average utility threshold are reduced.

It is understood that after determining at least one 1-item set currently in the database for expanding the item set and the average utility list thereof, a 1-item set with high average utility may be determined from the at least one 1-item set; then, the 2-item set and the 3-item set are further expanded … … layer by layer downward based on the at least one 1-item set used to expand the item set until the item set expansion is completed. However, in the process of expanding the item set layer by layer based on the 1-item set, the item set which does not become high in average utility, namely, the item set which is commonly called to be expanded to be redundant, is easily expanded, thereby affecting the computing efficiency. In order to reduce redundant item sets and improve the computational efficiency, in the process of expanding the item sets, the method can also use an average utility boundary with downward closed attributes to prune so as to reduce the redundant item sets.

For example, referring to fig. 6, a schematic diagram of an implementation flow for determining a high average utility item set according to an average utility list of each 1-item set in the data set and an average utility list of at least one 1-item set with an item set extension condition in the original database is shown, where the flow may include:

s601, merging the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set expansion conditions in the original database to obtain the average utility list of at least one 1-item set used for expanding the item sets in the database.

For reference, in step S601, reference may be made to the related description about merging the average utility lists in step S305, and details are not described herein again.

S602, determining a high average utility threshold of the database at the current moment.

It will be appreciated that after the database is inserted into the data set, the total utility value of the database may change such that the high average utility threshold of the database is also different from the high average utility threshold of the original database prior to insertion into the data set.

The high average utility threshold in the current database is determined according to the total utility threshold of the database at the current time, which may specifically refer to the related description above, and will not be described herein again.

S603, according to the high average utility threshold of the database and the average utility list of at least one 1-item set used for expanding the item set, selecting the 1-item set with high average utility from the at least one 1-item set used for expanding the item set.

For example, the average utility value of the 1-item set in the database at present can be obtained by adding the utility values of the 1-item set in the transactions recorded in the average utility list of the 1-item set, and if the average utility value of the 1-item set is greater than the high average utility threshold of the database at the present moment, the 1-item set is the high average utility item set.

For example, referring to fig. 5, assuming that the high average utility threshold of the database at the current time is 70, it can be determined from fig. 5 that the average utility value of the item set { d } in the database is (21+7+14+7+7+21)/1 ═ 77, and the average utility value of the item set { d } in the database is greater than the high average utility threshold of the database, the item set { d } is the high average utility item set. Accordingly, as can be seen from the average utility list of the item set { e } in fig. 5, if the average utility value of the item set { e } in the database is 60, but 60 is less than the high average utility threshold of the database, then the item set { e } does not belong to the high average utility item set.

S604, for each 1-item set used for expanding the item sets, according to a preset pruning strategy, a high average utility threshold of the database at the current moment and the maximum utility value of each target transaction recorded in the average utility list of the 1-item set, and with the 1-item set as a reference, expanding the item sets, and determining the item sets with high average utility from the expanded item sets.

Wherein, the pruning strategy is as follows: when the average utility boundary of a set of items is greater than the high average utility threshold of the database, then the set of items can continue to expand the set of items.

For example, for each 1-item set used to expand the item set, the average utility boundary for the 1-item set can be determined from the average utility list for the 1-item set; and if the average utility boundary of the 1-item set is larger than the high average utility threshold of the database at the current moment, combining the 1-item set and other 1-item sets used for expanding the item set into a 2-item set. For example, as shown in the average utility list of each 1-item set shown in fig. 5, the average utility boundary of the item set a is 21+7+15+6+15+60 ═ 124, and correspondingly, the average utility boundaries of the item set b, the item set c, the item set d, the item set e, and the item set e are, in order: 59. 87, 86, 60, 95. If the high average utility threshold of the database is 70, the average utility boundaries of both item set b and item set e are less than the high average utility threshold of the database, and item set b and item set e cannot continue to expand the 2-item set. And the item set a, the item set c, the item set d and the item set f can be mutually merged to expand a 2-item set.

For each 1-item set capable of continuing to expand the item set, a K-item set can be formed by expanding in an enumeration tree manner, where K may be a natural number greater than or equal to 2. The form of the enumeration tree may be as shown in fig. 7.

The diagram of extending the K-item set from 1-item set d down is mainly shown in fig. 7. As can be seen in FIG. 7, the 1-item set and the 2-item set can continue to be expanded from the item set { d } with others, as item set d can merge with item set a, item set c, item set f into a 2-item set: a set of terms da, a set of terms dc, and a set of terms df.

If at least one 2-item set is merged in the above manner, it is first determined whether each of the merged 2-item sets belongs to a high average utility item set. Wherein, in order to facilitate a fast determination of whether each 2-item set merged out belongs to a high average utility item set, an average utility list of the 2-item set can be constructed according to an average utility table of two 1-item sets merged out of the 2-item set.

For example, for a set of items dc, a list of average utilities for the set of items dc can be constructed from the list of average utilities for the set of items d and the list of average utilities for the set of items c. Still taking the average utility lists of the item set d and the item set c in the updated database in fig. 5 as an example, the process of merging the average utility list of the item set dc based on the average utility lists of the item set d and the item set c is shown in fig. 8.

As can be seen from fig. 8, a transaction containing both item set d and item set c can be determined from the average utility list of item set d and item set c. Then, for a transaction containing both item set c and item set d, the utility value of item set c in the transaction may be added to the utility value of item set d in the transaction to obtain the utility value of item set dc in the transaction; at the same time, the maximum utility value for the transaction is unchanged. For example, the transactions that contain both item set c and item set d in FIG. 8 include transaction T1, transaction T4, transaction T7, and transaction T9. Wherein, for the transaction T1, the utility value of the item set dc in the transaction T1 is the sum of the utility value of the item set d in the transaction T1 and the utility value of the item set c in the transaction T1, that is, the utility value of the item set dc in the transaction T1 is 21+ 10-31. At the same time, the maximum utility value of transaction T1 is still 21.

Accordingly, for each 2-item set merged, after the average utility list of the 2-item set is constructed, the average utility value of the 2-item set in the updated database can be determined according to the average utility list of the 2-item set. Specifically, the utility values of the 2-item set in each transaction recorded in the average utility list of the 2-item set may be summed, and the value obtained by summing is divided by the length 2 of the item set, so as to obtain the average utility threshold value of the 2-item set in the updated database. For example, taking the term set dc in fig. 8 as an example, the utility value of the term set dc in the database is: 31+29+22+26 is 108, the average utility value of the item set dc in the database is 108/2-54, the average utility threshold is less than the high average utility threshold 70 of the database, and the item set dc does not belong to the high average utility item set.

For each merged 2-item set, it is further necessary to analyze whether the average utility boundary of the 2-item set is greater than the high average utility threshold of the database according to a pruning policy, and if the average utility boundary of the 2-item set is greater than the high average utility threshold of the database, the 2-item set may continue to expand the 3-item set. The process of determining the average utility boundary of the 2-item set according to the average utility list of the 2-item set is similar to the process of determining the average utility boundary of the 1-item set, and is not repeated herein.

For each 2-item set that may continue to expand the 3-item set, any two 2-item sets may be merged into a 3-item set containing 3 data items, e.g., may be expanded with reference to the enumeration tree shown in FIG. 7. Of course, after each item set is expanded, it is necessary to determine whether the item set belongs to the item set with high average utility and whether the item set can continue to expand other item sets until all item sets with high average utility are mined.

It is to be understood that, considering that there may be multiple transaction insertions into the database, in any of the above embodiments, in order to reduce the situations of rescanning the database when there is a transaction insertion into the database again later, the present application may further store the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with an item set extension condition in the original database as the corresponding 1-item set with an item set extension condition of the database after the data set is currently inserted.

Optionally, in the process of mining the high average utility item set for the database into which the data set is inserted, if the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set having the item set extension condition in the original database are combined, the average utility list of the at least one 1-item set used for extending the item set may be stored as the average utility list of the at least one 1-item set having the item set extension condition corresponding to the database at the current time.

For convenience of understanding, taking a case that after a database is scanned and a high average utility item set is mined, a transaction is first inserted into the database as an example, for example, referring to fig. 9, which shows a flowchart of another embodiment of the mining method for a high average utility item set of the present application, the method of this embodiment may include:

s901, when detecting that a data set containing at least one transaction is inserted into a database, determining the total utility value of each transaction in the data set, and caching the total utility value in a cache region.

In this step, after the high average utilization item set of the database is mined by scanning the database, the inserted data set is detected in the database for the first time, and therefore, the total utility value of the transactions inserted into the database is not cached in the cache region, and therefore, the total utility value of the currently inserted data set is the total utility value of all the transactions inserted into the database after the database is scanned for the last time.

And S902, obtaining a utility safety value corresponding to the initial total utility value of the original database.

The original database is the database prior to insertion into the data set. Since the database is not inserted into other databases before the database is inserted into the data set, in this embodiment, the total utility value of the original database is the initial total utility value.

And S903, when the cached total utility value is determined to be smaller than the utility safety value, determining an average utility list of each 1-item set in the currently inserted data set.

That is, if the total utility value currently inserted into the database is smaller than the utility safety value, an average utility list corresponding to each 1-item set included in the data set is constructed according to the inserted data set, so that only the currently inserted data set needs to be scanned.

S904, obtaining the cached average utility list of the 1-item set belonging to the HAUUBI and the PAUUBI in the original database.

The specific contents included in the average utility list in step S903 and step S904 may refer to the related descriptions in the foregoing embodiments, and are not described herein again.

S905, merging the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set expansion conditions in the original database to obtain the average utility list of at least one 1-item set used for expanding the item sets in the database.

S906, based on the average utility list of at least one 1-item set used for expanding the item set, determining a high average utility item set in the database after the data set is inserted.

The steps S905 and S906 can be referred to the related description of the previous embodiment, and are not described herein again.

S907, storing the average utility list of the at least one 1-item set used for expanding the item set as the average utility list of the at least one 1-item set with item set expansion conditions corresponding to the current database.

It can be understood that, after storing at least one 1-entry set with an entry set extension condition corresponding to the database at the current time, if another data set including at least one transaction is inserted into the database at a later time, when it is detected that the database is inserted into the new data set, the computer device calculates a total utility value of each transaction in the new data set currently inserted, and adds the currently calculated total utility value to a total utility value of the cache region cache, so that the total utility value of the cache region cache is a sum of a total utility value cached before the current time and the currently calculated total utility value.

Based on the above, the computer device obtains a security utility value, wherein the security utility value is still determined based on the initial total utility value of the original database (the data without any data set inserted therein, which is the same as the original database in fig. 9), and is the same as the security utility value in step S902 in the embodiment of fig. 9. The total utility value accumulated by the cache area is then compared with the utility security value. If the total utility value accumulated in the cache region is less than the utility safety value, the average utility list of each 1-item set included in the currently inserted new data set is determined, and the average utility list of at least one 1-item set having an item set extension condition in the database that is stored last time (i.e., the average utility list of each 1-item set stored in step S907) is obtained. Correspondingly, the high average utility item set in the database after the new data set is inserted can be determined based on the average utility list of each 1-item set in the currently inserted new data set and the average utility list of at least one 1-item set with item set extension conditions stored last time.

Of course, after determining the high average utility item set in the database into which the new data set is inserted, the average utility list of each 1-item set in the currently inserted new data set and the average utility list of the at least one recently stored 1-item set with the item set expansion condition are also stored as the average utility list of the at least one 1-item set with the item set expansion condition in the current database; or storing the average utility lists of the 1-item sets in the currently inserted new data set and the average utility lists of the 1-item sets after the average utility lists of at least one 1-item set with item set extension conditions stored last time are merged.

It can be understood that, in this embodiment, after detecting that the database inserts at least one transaction, mining of the high average efficiency item set is started, and in practical applications, if a mining condition is satisfied, it is determined that a data set that has not been mined in the database is also performed according to the scheme of this embodiment, which is not described herein again.

The application also provides a high average utility item set mining device. For example, referring to fig. 10, a schematic structural diagram of an embodiment of a high average utility item set mining device according to the present application is shown. As can be seen from fig. 10, the apparatus may include:

a transaction utility determining unit 1001, configured to determine, according to at least one transaction included in a data set that is inserted into a database and has not been mined, a total utility value of all transactions inserted in the database in an accumulated manner before a current time;

a security value obtaining unit 1002, configured to obtain a utility security value corresponding to an initial total utility value of the database, where the initial total utility value is a total utility value of the database before all the transactions are inserted into the database;

a first list obtaining unit 1003, configured to determine, when the total utility value of all the transactions is smaller than the utility security value, an average utility list of each 1-item set in the data set, where the average utility list of each 1-item set at least records a respective target transaction including the 1-item set, and a utility value of the 1-item set in each target transaction;

a second list obtaining unit 1004, configured to obtain an average utility list of at least one 1-item set having an item set extension condition in the stored original database; the original database is the database before the data set is inserted, the 1-item set with the item set expansion condition is determined when a high average utility item set is mined from the original database, and the average utility boundary is larger than the 1-item set of the low average utility threshold corresponding to the original database; the average utility boundary of the item set is the sum of the maximum utility values of all transactions containing the item set;

an item set mining unit 1005, configured to determine a high average utility item set in the database according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set in the original database having an item set extension condition.

Optionally, the apparatus may further include:

and the database scanning unit is used for determining all 1-item sets contained in the database according to all the transactions in the database when the total utility value of all the transactions is not less than the utility safety value, and determining the high-average utility item set in the database based on all the determined 1-item sets.

In one possible implementation, the transaction utility determination unit includes:

and the transaction utility determining subunit is used for determining the total utility value of all the transactions which are cumulatively inserted into the database before the current moment when the data set containing at least one transaction is detected to be inserted into the database.

In a possible implementation manner, in the above embodiment, the item set mining unit may include:

a list merging subunit, configured to merge an average utility list of each 1-item set in the data set and an average utility list of at least one 1-item set having an item set expansion condition in the original database, to obtain an average utility list of at least one 1-item set used for expanding the item set in the database;

an item set mining subunit, configured to determine a high average utility item set in the database based on the average utility list of the at least one 1-item set used for expanding the item set.

Optionally, in the above embodiment, the average utility list of each 1-item set further includes: a maximum utility value for each of the target transactions;

the item set mining subunit may include:

the threshold value determining subunit is used for determining a high average utility threshold value of the database at the current moment;

a first item set mining subunit, configured to select a 1-item set with high average utility from the at least one 1-item set used for expanding the item set according to a high average utility threshold of the database and an average utility list of the at least one 1-item set used for expanding the item set;

and the second item set mining subunit is used for expanding the item sets according to a preset pruning strategy, a high average utility threshold value of the database and the maximum utility value of each target transaction recorded in the average utility list of the 1-item set and taking the 1-item set as a reference for each 1-item set used for expanding the item sets, and determining the item set with high average utility from the expanded item sets, wherein the pruning strategy is that when the average utility boundary of one item set is greater than the high average utility threshold value of the database, the item set can continue to expand the item sets.

Optionally, the above apparatus may further include:

and the list storage unit is used for storing the average utility list of the at least one 1-item set used for expanding the item set as the average utility list of the at least one 1-item set corresponding to the database and having the item set expansion condition after the item set mining subunit determines the high average utility item set in the database.

Optionally, in the above embodiment, the security value acquiring unit may include:

a reference value obtaining subunit, configured to obtain an initial total utility value of the database, a first preset percentage, and a second preset percentage, where the first preset percentage is a set threshold according to which a high average utility threshold of the database is determined, and the second preset percentage is a set threshold according to which a low efficiency threshold of the database is determined;

and the safety value calculation unit is used for determining the utility safety value corresponding to the original database according to the total utility value, the first preset percentage and the second preset percentage of the original database.

In yet another aspect, the present application further provides a computer device. For example, referring to FIG. 11, a schematic diagram of one component of the computer device of an embodiment of the present application is shown; as in fig. 11, the computer device 1100 may include: a processor 1101 and a memory 1102.

In this embodiment, the processor 1101 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor or other programmable logic device.

The processor may call a program stored in the memory 1102, and in particular, the processor may perform the operations performed on the computer device side in the embodiments of fig. 3-9 above.

The memory 1102 is used for storing one or more programs, which may include program codes including computer operation instructions, and in this embodiment, the memory stores at least the programs for implementing the following functions:

In one possible implementation, the memory 1102 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer, such as user data and audio data, etc.

Further, the memory 1102 may include high speed random access memory and may also include non-volatile memory.

The computer device may further include: a communication interface 1103, an input unit 1104, and a display 1105 and a communication bus 1106. The processor 1101, the memory 1102, the communication interface 1103, the input unit 1104, and the display 1105 all communicate with each other via a communication bus 1106.

The display 1104 includes a display panel, such as a touch display panel; the input unit may be a touch sensing unit, a keyboard, or the like.

Of course, the computer device structure shown in fig. 11 does not constitute a limitation of the computer device in the embodiment of the present application, and in practical applications, the computer device may include more or less components than those shown in fig. 11, or some components may be combined.

In another aspect, the present application further provides a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method for mining a high average utility item set as described in any one of the above embodiments is implemented.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A high average utility item set mining method is characterized by comprising the following steps:

obtaining a utility safety value corresponding to an initial total utility value of the database, wherein the initial total utility value is the total utility value of the database before all transactions are inserted into the database, the utility safety value is determined according to the initial total utility value of the database, a first preset percentage and a second preset percentage, the first preset percentage is a set threshold according to which a high average utility threshold of the database is determined, the second preset percentage is a set threshold according to which a low average utility threshold of the database is determined, and the high average utility threshold is greater than the low average utility threshold;

2. The high average utility item set mining method of claim 1, further comprising:

when the total utility value of all the transactions is not less than the utility safety value, determining all 1-item sets contained in the database according to all the transactions in the database, and determining a high average utility item set in the database based on all the determined 1-item sets.

3. The method for mining high average utility items according to claim 1, wherein the determining the high average utility item set in the database according to the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set extension condition in the original database comprises:

merging the average utility list of each 1-item set in the data set and the average utility list of at least one 1-item set with item set expansion conditions in the original database to obtain the average utility list of at least one 1-item set used for expanding the item sets in the database;

determining a high average utility item set in the database based on the average utility list of the at least one 1-item set used to expand the item set.

4. The high average utility item set mining method of claim 3, wherein the average utility list for each of the 1-item sets further comprises: a maximum utility value for each of the target transactions;

the determining a high average utility item set in the database based on the average utility list of the at least one 1-item set used to expand the item set comprises:

determining a high average utility threshold of the database at the current moment;

selecting a 1-item set with high average utility from the at least one 1-item set for the extended item set according to the high average utility threshold of the database and the average utility list of the at least one 1-item set for the extended item set;

for each 1-item set used for expanding the item set, according to a preset pruning strategy, a high average utility threshold value of the database and a maximum utility value of each target transaction recorded in an average utility list of the 1-item set, and expanding the item set by taking the 1-item set as a reference, determining the item set with high average utility from the expanded item set, wherein the pruning strategy is that when an average utility boundary of one item set is greater than the high average utility threshold value of the database, the item set can continue to expand the item set.

5. The method for mining high average utility item sets according to claim 1, wherein the determining the total utility value of all the transactions cumulatively inserted in the database before the current time according to at least one transaction included in the data set which is inserted in the database and has not been mined comprises:

when a data set containing at least one transaction is detected to be inserted into the database, the total utility value of all the transactions inserted into the database in a cumulative mode before the current time is determined.

6. The high average utility item set mining method of claim 3, further comprising, after said determining a high average utility item set in said database:

storing the average utility list of the at least one 1-item set used for expanding the item set as the average utility list of the at least one 1-item set with item set expansion conditions corresponding to the database.

7. A high average utility item set mining device, comprising:

a safety value obtaining unit, configured to obtain a utility safety value corresponding to an initial total utility value of the database, where the initial total utility value is a total utility value of the database before all transactions are inserted into the database, the utility safety value is determined according to the initial total utility value of the database, a first preset percentage, and a second preset percentage, the first preset percentage is a set threshold according to which a high average utility threshold of the database is determined, the second preset percentage is a set threshold according to which a low average utility threshold of the database is determined, and the high average utility threshold is greater than the low average utility threshold;

8. The high average utility item set mining apparatus of claim 7, further comprising:

9. A computer device, comprising:

a processor and a memory;

wherein the processor is configured to execute a program stored in the memory;

the memory is to store a program to at least: