CN114691749B

CN114691749B - Method for parallel incremental mining of frequent item sets based on sliding window

Info

Publication number: CN114691749B
Application number: CN202210077060.8A
Authority: CN
Inventors: 马汉达; 方伟
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2024-03-19
Anticipated expiration: 2042-05-11
Also published as: CN114691749A

Abstract

The invention belongs to the field of data processing analysis, and particularly relates to a method for parallel incremental mining of frequent item sets based on a sliding window, which aims at the problem of low operation efficiency of the existing parallel incremental mining method in a big data environment. The main implementation steps of the invention are as follows: acquiring and preprocessing a data set; dividing the data set into a plurality of blocks of incremental data sets; mining frequent item sets and quasi-frequent item sets of a single batch data set; if the previous batch data set exists in the current window, merging and updating the mining result of the current batch data set and the mining result of the previous batch; otherwise, entering a frequent item set and a quasi-frequent item set after incremental updating in the persistence current window and outputting the frequent item set; thus, the incremental dataset continues to be input, and the incremental mining step is looped. According to the invention, by introducing the technologies of sliding windows and the like, the speed of judging whether the item set is frequent or not is increased, and the method has good mining efficiency by combining Spark parallel computing and Hadoop distributed storage.

Description

Method for parallel incremental mining of frequent item sets based on sliding window

Technical Field

The invention belongs to the field of data processing analysis, and particularly relates to a method for parallel incremental mining of frequent item sets based on a sliding window.

Background

Association rules are an important area of research in data mining, aimed at finding frequent patterns in a data set. Association rule mining has been widely used in the fields of shopping recommendation, website click analysis, e-commerce, finance, medical diagnosis, and the like. Static association rule mining is the discovery of frequent item sets with fixed data sets and support. The support degree and the data set are changed most of the time, the incremental association rule mining is the frequent pattern mining under the condition that the data set is increased, and the incremental mining of the frequent item set is the main part of the association rule incremental mining. When facing large-scale data sets, the way of reading the data sets into the memory at one time is often not preferable, which requires a large memory space and huge I/O overhead, and has low expandability and low performance.

At this time, the batch reading into the memory occurs, and the increment mining is carried out on frequent item sets, but the mode is severely dependent on the historical data set in the aspect of reckoning the candidate item sets after increment updating, and the task of scanning the data set after the whole increment becomes unusually heavy along with continuous increment input of the historical data set; there are also methods to accelerate whole delta mining through a distributed computing framework of Hadoop and Spark; in addition, when the frequent item set is updated in an increment mode, if the frequent item set is constructed as a mode tree according to the traditional item set support count, the item ordering in the mined frequent item set is ordered according to the support count, and after the support of each item of the same frequent item set is changed, the internal ordering is not guaranteed, so that the increment item set and the history item set are difficult to match when the frequent item set is updated.

Disclosure of Invention

The invention provides a method for parallel increment mining of frequent item sets based on a sliding window, which aims at solving the defects of the prior art, and further improves the efficiency when processing large-scale increment data by combining a parallelization calculation frame while optimizing a structure and reducing data scanning work.

The technical scheme of the invention is as follows:

a method for parallel incremental mining of frequent item sets based on sliding windows specifically comprises the following steps:

step 1, acquiring a data set;

step 2, data preprocessing is carried out on the acquired data set;

step 3, dividing the data set into n incremental data sets DB _k ；

Step 4, for the divided data set DB _k Incremental excavation is carried out according to the batch input sliding window;

step 5, mining the current single batch data set DB _k Frequent item sets and quasi-frequent item sets;

step 6, the current batch data set DB _k As a preface batch DB _1…k-1 The increment of the data sets, combining the frequent item sets and the quasi-frequent item sets mined by the current batch and the previous batch data sets in the sliding window;

and 7, acquiring all frequent item sets in the updated current sliding window.

As a further preferable scheme of the method for parallel increment mining of frequent item sets based on sliding windows, in step 2, data preprocessing comprises the numerical processing of transaction items in transaction data sets and removing dirty data.

As a means ofIn the step 3, the data set dividing mode is divided into n parts according to the total number of data set transactions, and each part of data set is marked as DB _k ，k∈[1,n]The method comprises the steps of carrying out a first treatment on the surface of the Since the transaction records of each data set have the same number, the transaction items of each transaction record have different numbers, and thus each data set DB is finally obtained _k Is not absolutely equal in size. As a further preferable scheme of the method for parallel incremental mining of frequent item sets based on sliding window of the present invention, in step 4, there is defined as follows:

definition 4.1, sliding window is defined as a fixed size window containing m batches of data sets, which behaves like a fixed size queue of length m, one head in and the other head out; only m batches of data sets are reserved in the sliding window, when the increment data set of the (m+1) th batch is input, the 1 st batch of data set at the other end of the window needs to be removed, and only m fixed batches of data sets are ensured in the window.

Definition 4.2 incremental mining in sliding Window is defined as single batch dataset DB in each input Window _k The incremental mining is to be performed on the basis of its preamble m-1 batch data sets.

As a further preferable scheme of the method for parallel incremental mining of frequent item sets based on sliding window of the present invention, in step 5, the following definitions and steps are included:

defining 5.1, wherein the frequent item set indicates that the support degree supitems of the item set exceeds the frequent minimum support degree minsup;

defining 5.2, wherein the quasi-frequent item set indicates that the support degree supitems of the item set exceeds the quasi-frequent minimum support degree semisup, but is smaller than the frequent minimum support degree minusup; wherein semisup < minusup needs to be satisfied;

step 5.3, single batch dataset DB _k As the current window input data, reading the data set through textFile;

step 5.4, counting the frequent 1 item set and the quasi-frequent 1 item set, mapping into a data set by performing a flatMap operation on each transaction record tran (item 1, item2 …, item q)The method comprises the steps of (1) forming a binary group, then accumulating the count item count of each 1 item set through a reduced ByKey aggregation operation, and mapping the count item count into a new binary group (item q, item count); finally, using a filter operation to screen semisup<＝minsup*|DB _k 1 item set of L as quasi-frequent 1 item set L _D1 ' frequent 1 item set L is screened out by using filter operation _D1 ；

Step 5.5, merging the frequent 1 item set and the quasi-frequent 1 item set to L _s1 ＝L _D1 +L _D1 ' sorting according to the dictionary sequence of the item set, and broadcasting to each computing node for use;

step 5.6, pruning the transaction data set; 1 item set L according to the statistics _s1 Rescanning data set DB _k For data set DB _k Is not in 1 item set L _s1 Removing the transaction record items in the database;

step 5.7, grouping transaction records according to dictionary sequence prefix item items; rereading pruned transaction set DB _k Ordering the items in each transaction record according to dictionary sequence, then executing a flat map operation on each transaction record, and enumerating the records according to the same suffix in each operation, wherein the records can be enumerated into three records with the same suffix (item 1, item2, item 3) (item 2, item 3) (item 3); grouping and aggregating the transaction records of the same prefix item into the same group through a groupByKey operation according to the prefix item;

step 5.8, frequent item set L is performed on the transaction record construction pattern tree Fp-tree in each prefix group by the foreach operation _D And quasi-frequent item set L _D ' mining, wherein the mining process is the same as the FPgrowth algorithm; in the construction process of the pattern tree, 1 item set L broadcasted in step 5.5 needs to be read _s1 As a pattern tree Header table;

as a further preferable scheme of the method for parallel incremental mining of frequent item sets based on sliding window of the present invention, in step 6, the method comprises the following steps:

step 6.1, reading a preamble batch data set DB in a window to which a current prefix belongs according to the prefix item groups on each computing node _1…k-1 Digging outFrequent item set PL _D And quasi-frequent item set PL _D ', and combining both PL _s ＝PL _D +PL _D ' constructing Item set prefix tree Item-PLTree in a mode of sharing prefix paths;

step 6.2, the increment frequent item set L mined in the step 5 is processed _D And quasi-frequent item set L _D ' merge into the Item set prefix tree Item-PLTree in a shared prefix path;

step 6.3, traversing the Item set prefix tree Item-PLTree in a preamble traversing mode, pruning node branches with the support degree count smaller than the quasi-frequent support degree semisup;

step 6.4, traversing the item set prefix tree in a preamble traversing mode, counting the nodes with the support degree greater than or equal to the frequent support degree minsup, and outputting the path from the node to the root node, namely the frequent item set WL of all batch data sets in the current window after incremental updating _D ；

Step 6.5, persisting the frequent item set WL stored in the item set prefix tree under the group in the current window _D And quasi-frequent item set WL _D ’；

Compared with the prior art, the invention has the following advantages:

according to the invention, on the basis of a traditional frequent item set parallel increment mining method, by introducing a sliding window, dictionary sequence prefix grouping of a transaction set, prefix tree updating of an item set and a lasting quasi-frequent item set, the dependence of the increment updated candidate item set on original data set scanning is weakened, and the speed of judging whether the part of candidate item set is the frequent item set is greatly accelerated; and simultaneously, spark parallel computing and Hadoop distributed storage are combined, so that the method has good mining efficiency and expandability.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of an item set prefix tree constructed during an item set update of the present invention;

fig. 3 is a schematic view of a sliding window according to the present invention.

FIG. 4 is a graph of the runtime simulation results of the present invention for measuring incremental mining of different methods under a batch incremental dataset.

FIG. 5 is a graph of the effect of the present invention on measuring run time simulation of a batch of incremental datasets with different numbers of operational nodes.

Detailed Description

As shown in fig. 1, the implementation steps of the specific technology of the present invention are as follows:

and (3) setting up an environment, namely setting up a distributed environment of Spark 2.4.3 and Hadoop 2.9.2 on a Linux cloud platform, wherein the distributed environment comprises a Master node Master and 5 Slave nodes Slave. Step 1, acquiring a data set webdocs. Dat, wherein the size of the data set webdocs. Dat is 1.37GB;

step 2, preprocessing the data of the acquired data set, including the numerical processing of transaction items in the transaction data set, and removing dirty data;

step 3, dividing the data set into n incremental data sets DB _k The data set dividing mode is divided into n parts according to the total number of data set transactions, and each part of data set is marked as DB _k ，k∈[1,n]The method comprises the steps of carrying out a first treatment on the surface of the Since the transaction records of each data set have the same number, the transaction items of each transaction record have different numbers, and thus each data set DB is finally obtained _k Is not absolutely equal in size; currently, n=4 is adopted, and each data set is about 340MB in size and is uniformly stored on a distributed file system HDFS;

Step 5, excavating the current single batchSecondary data set DB _k Frequent item sets and quasi-frequent item sets;

step 5.4, counting the frequent 1 item set and the quasi-frequent 1 item set, mapping the transaction record tran (item 1, item2 …, item q) in the data set into a binary group (item q, 1) by performing a flatMap operation, and then accumulating the count item count of each 1 item set by a reduced ByKey aggregation operation, and mapping the count item count into a new binary group (item q, item count); finally, using a filter operation to screen semisup<＝minsup*|DB _k 1 item set of L as quasi-frequent 1 item set L _D1 ' frequent 1 item set L is screened out by using filter operation _D1 ；

step 5.7, grouping transaction records according to dictionary sequence prefix item items; rereading pruned transaction set DB _k Ordering the items in each transaction record according to dictionary sequence, then executing a flat map operation on each transaction record, and enumerating the records according to the same suffix in each operation, wherein the records can be enumerated into three records with the same suffix (item 1, item2, item 3) (item 2, item 3) (item 3); then grouping and aggregating the transaction records of the same prefix item according to the prefix item through a groupByKey operationAll converged into the same packet;

step 6.1, reading a preamble batch data set DB in a window to which a current prefix belongs according to the prefix item groups on each computing node _1…k-1 Mined frequent item sets PL _D And quasi-frequent item set PL _D ', and combining both PL _s ＝PL _D +PL _D ' constructing Item set prefix tree Item-PLTree in a mode of sharing prefix paths;

Step 7, acquiring all the frequent item sets WL in the updated current sliding window _D 。

Simulation results:

as can be seen from the results of FIG. 4, with the incremental dataset DB _k The number of the historical data sets of the traditional increment mining method is increased, the size of the total data set to be scanned of the candidate set after each increment update is increased, so that the method consumes more and more time in the process of confirming the support degree of the candidate set after the increment update, as shown by an ascending line in the figure; in the method, due to the adoption of strategies such as sliding window, quasi-frequent item set, prefix item grouping update and the like, the data set DB is increased along with the increment _k The dependency of the incremental candidate set on the historical dataset decreases, tends to stabilize, and does not increase linearly in time, as indicated by the decreasing line.

From the results of fig. 5, it can be seen that under the incremental dataset of batches, the running time of the method tends to decrease under the condition that the number of computing nodes increases, which illustrates the effectiveness and scalability of the distributed parallel design of the method.

Claims

1. The method for parallel incremental mining of frequent item sets based on the sliding window is characterized by comprising the following steps:

step 1, acquiring a data set;

step 2, data preprocessing is carried out on the acquired data set;

step 3, dividing the data set into n incremental data sets DB _k ；

in step 4, there are the following definitions:

definition 4.1, sliding window is defined as a fixed size window containing m batches of data sets, which behaves like a fixed size queue of length m, one head in and the other head out; only m batches of data sets are reserved in the sliding window, when the increment data set of the (m+1) th batch is input, the 1 st batch of data set at the other end of the window needs to be removed, and only m fixed batches of data sets are ensured in the window;

definition 4.2 incremental mining definition in sliding WindowFor a single batch of data sets DB in each input window _k Incremental mining is to be performed on the basis of m-1 batch data sets of the preamble;

in step 5, the following definitions and steps are included:

step 5.7, grouping transaction records according to dictionary sequence prefix item items; rereading pruned transaction set DB _k Ordering the items in each transaction record according to dictionary sequence, and then executing the flat on each transaction recordMap operation, each record is enumerated according to the same suffix in each operation, for example, (item 1, item2, item 3) can be enumerated as (item 1, item2, item 3) (item 2, item 3) (item 3) records with three identical suffixes; grouping and aggregating the transaction records of the same prefix item into the same group through a groupByKey operation according to the prefix item;

in step 6, the steps of:

And 7, acquiring all frequent item sets in the updated current sliding window.

2. The method for parallel incremental mining of frequent item sets based on sliding windows according to claim 1, wherein: in step 2, the data preprocessing includes a numerical processing of transaction items in the transaction dataset, and dirty data is removed.

3. The method for parallel incremental mining of frequent item sets based on sliding windows according to claim 1, wherein: in step 3, the data set is divided into n parts based on the total number of data set transactions, and each part of data set is marked as DB _k ，k∈[1,n]The method comprises the steps of carrying out a first treatment on the surface of the Since the transaction records of each data set have the same number, the transaction items of each transaction record have different numbers, and thus each data set DB is finally obtained _k Is not absolutely equal in size.