CN110489411B

CN110489411B - Association rule mining method based on effective value storage and operation mode

Info

Publication number: CN110489411B
Application number: CN201910624715.7A
Authority: CN
Inventors: 任晓强; 李梦男
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2023-08-22
Anticipated expiration: 2039-07-11
Also published as: CN110489411A

Abstract

The invention discloses an association rule mining method based on effective value storage and operation modes, which belongs to the field of data mining, and aims to solve the technical problems of how to purposefully design a storage mode under the condition of more hollow values in a transaction database so as to effectively save storage, and design a corresponding mining algorithm so as to ensure that the efficiency of the algorithm is not reduced while saving the storage, and adopts the following technical scheme: the method comprises the following steps: s1, storing based on a set of effective values: setting a storage set, wherein the storage set stores the index position of things where frequent single items are located, namely the position value of an effective value 1; s2, connection operation based on the effective value storage structure: performing connection operation on the two storage sets to generate candidate item sets; the method comprises the following specific steps: s201, searching a frequent single set; s202, intersection sets are acquired from two storage sets to obtain a new set for storing the two sets; s203, producing frequent item sets.

Description

Association rule mining method based on effective value storage and operation mode

Technical Field

The invention relates to the field of data mining, in particular to an association rule mining method based on effective value storage and operation modes.

Background

The association rule problem was first proposed by Agrawal et al in 1993 to find interesting associations between data in large data sets. At present, association rule mining is widely applied in a plurality of fields. For example, by analyzing transaction data of supermarket commodities, consumers are guided to shop to increase sales volume; and analyzing the browsed content and the news content of the user, mining the news browsing mode and change rule of the user, and recommending news which is possibly interested for the user. The data is analyzed by using an effective association rule algorithm, so that a decision maker can be helped to make better decisions, and better benefits are obtained. Therefore, many scholars have conducted intensive research and improvement on association rule algorithms.

The Apriori algorithm is one of the most classical algorithms for association rule mining, with the major drawbacks of requiring the generation of a large number of candidate sets and the need to scan the database multiple times. Huang Ruiqiong et al propose an MBSA (Map-based BitSet Association Rule) algorithm based on remote sensing image association rule mining to Map a dataset into a set of bits, then utilize logical AND operations of the set of bits to improve mining efficiency, and scan the database only once without generating a candidate set. The MBSA algorithm has the following problems: storing a large number of data sets, the "0" value in the bitmap can cause waste of memory space, frequent decompression process, small time efficiency and relatively low efficiency.

From this, the algorithm has the following problems:

(1) The database is accessed frequently, so that the operation efficiency is low;

(2) When a large amount of data is stored, the bitmap is used for storing the 0 value, so that excessive memory space is occupied, and the waste of the memory space can be caused;

(3) Frequent decompression processes are required during the connection operation.

Disclosure of Invention

The invention aims to provide an association rule mining method based on effective value storage and operation modes, which aims to solve the problems of how to purposefully design a storage mode under the condition of more hollow values in a transaction database so as to effectively save storage and design a corresponding mining algorithm so as to ensure that the efficiency of the algorithm is not reduced while saving the storage.

The technical task of the invention is realized in the following way, namely, an association rule mining method based on effective value storage and operation modes, which comprises the following steps:

s1, storing based on a set of effective values: setting a storage set, wherein the storage set stores the index position of things where frequent single items are located, namely the position value of an effective value 1;

s2, connection operation based on the effective value storage structure: and performing connection operation on the two storage sets to generate candidate item sets.

Preferably, the specific steps of storing the set based on the valid value in the step S1 are as follows:

s101, scanning a transaction database once, and setting a storage set for each frequent single item in a transaction set;

s102, storing index positions, namely position values of effective values 1, of frequent single items in transactions in a storage set; the data is stored in the storage set, and the set is directly operated when pruning and connection operations are performed, so that repeated scanning of the database is avoided.

More preferably, the connection operation based on the valid value storage structure in the step S2 specifically includes the following steps:

s201, searching a frequent single set: after the database generates a storage set, finding out a frequent single set according to whether the support degree of the storage set reaches the minimum support degree;

s202, intersection of two storage sets is carried out to obtain a new set for storing the two sets: performing connection operation on the frequent single item set, namely performing intersection operation on the storage set storing the single item set to obtain a new set storing the two item sets;

s203, producing frequent item sets: connecting the storage sets generated in the step S1 to obtain two sets, judging whether the two sets are frequent sets or not through the minimum support degree, and sequentially and iteratively producing the frequent sets:

(1) if so, performing iterative operation in turn, that is, repeatedly executing step S203, until no frequent item set is found.

More preferably, in the step S2, the principle of selecting the candidate item set by performing the connection operation on the two storage sets is as follows: and executing intersection taking operation on the set Aarr and the set Barr, wherein the set Aarr and the set Barr are orderly arranged, and the specific algorithm is as follows:

(1) Setting an index value m of the traversal set Aarr as 0, and setting an index value n of the traversal set Barr as 0;

(2) Judging the size relation between Aarr.get (m) and Barr.get (n):

(1) if Aarr.get (m) is greater than Barr.get (n), jumping to the step (3);

(2) if Aarr.get (m) is less than Barr.get (n), jumping to the step (4);

(3) if Aarr.get (m) is equal to Barr.get (n), jumping to step (5);

(3) Let n equal n+1, determine if the value of n exceeds the number of elements in the set Barr:

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(4) Let m equal m+1, determine if the value of m exceeds the number of elements in the set Aarr:

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(5) Adding a value corresponding to n in the set Barr or a value corresponding to m in the set Aarr to the candidate item set Sarr, enabling n to be equal to n+1 and m to be equal to m+1, and judging whether the number of elements in the corresponding set is exceeded in the values of m and n or not:

(1) if yes, jumping to the step (6);

(2) if not, jumping to the value step (2);

(6) And (3) ending the operation to obtain the size of the candidate item set Sarr, namely the support degree count of the candidate item set.

More preferably, the storage sets are ordered according to a dictionary order, which is known as:

when Aar.get (m) is smaller than Barr.get (n), the numbers behind Barr.get (n) are larger than Aar.get (m) according to the dictionary sequence, so there is no value behind Barr (n) identical to Aar.get (m), aar.get (m) does not need to be compared with the value behind Barr.get (n) any more, and the values behind Barr.get (m+1) are compared with the values behind Barr.get (n) continuously.

More preferably, the specific steps for performing the connection operation based on the valid value storage structure in the step S2 according to the principle of performing the connection operation on two storage sets to select the candidate set are as follows:

(one), calculating intersection of the sets Aarr and Barr, and judging whether the two sets have the same position value or not:

(1) if the intersection candidate item set Sarr of the set Aarr and the set Barr is empty, the set Aarr and the set Barr are not provided with the same position value, and a new candidate item set cannot be generated in a combined mode;

(2) if the candidate item set Sarr is not empty, executing the step (two) next;

calculating the size of the candidate item set Sarr, dividing the obtained value of the candidate item set Sarr by the total number of transactions to obtain the support degree of the candidate item set Sarr;

thirdly, judging whether the item set corresponding to the candidate item set Sarr is a frequent item set according to the support degree and the minimum support degree of the candidate item set Sarr:

(1) if yes, reserving the candidate item set to be used as a next connection process, and jumping to the step (I) until the candidate item set cannot be searched for frequent item sets;

(2) if not, deleting the candidate item set Sarr to save the storage space.

Wherein, item set (itemset): a non-empty set of entries may be denoted (l 1, l2, … ln), where each lk represents an entry.

Transaction: a certain customer sets all items that occur in one transaction.

Frequent item sets: a term set has a support greater than or equal to a set support threshold, and is then considered to be a frequent term set.

Candidate item set: and the candidate item set used for acquiring the frequent item set is reserved for the item set meeting the support degree condition and is not discarded.

Support degree: the probability that the term set { X, Y } appears in the total term set.

The association rule mining method based on the effective value storage and operation mode has the following advantages:

firstly, using a set storage transaction, only storing index positions of an effective value of '1', wherein the index positions occupy more digits compared with a bitmap, but when a large number of '0' values exist in data, the use of the set storage avoids the waste of memory space caused by storing null values of '0', and effectively saves the memory space;

secondly, the data in the storage set directly participate in operation based on a connection algorithm designed in an effective value storage mode, so that a complicated decompression process is avoided, the process is simple and efficient, the space is saved, and meanwhile, the algorithm efficiency is not greatly reduced;

thirdly, after scanning the transaction database once, storing the transactions into a set, searching a frequent single set meeting the minimum support, wherein one frequent single set corresponds to one storage set, and storing the index position of the transaction where the frequent single set is located in the set;

and fourthly, the invention stores frequent single items into the storage set, and directly operates the set when pruning and connecting operations are performed, thereby avoiding repeated scanning of the database.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a connection operation for two storage sets to select candidate sets.

Detailed Description

An association rule mining method based on a valid value storage and operation mode according to the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

Example 1:

the invention relates to an association rule mining method based on effective value storage and operation modes, which comprises the following steps:

s1, storing based on a set of effective values: setting a storage set, wherein the storage set stores the index position of things where frequent single items are located, namely the position value of an effective value 1; the method comprises the following specific steps:

S2, connection operation based on the effective value storage structure: performing connection operation on the two storage sets to generate candidate item sets; the method comprises the following specific steps:

As shown in fig. 1, in step S2, the principle of selecting a candidate set by performing a connection operation on two storage sets is as follows: and executing intersection taking operation on the set Aarr and the set Barr, wherein the set Aarr and the set Barr are orderly arranged, and the specific algorithm is as follows:

(2) Judging the size relation between Aarr.get (m) and Barr.get (n):

(1) if Aarr.get (m) is greater than Barr.get (n), jumping to the step (3);

(2) if Aarr.get (m) is less than Barr.get (n), jumping to the step (4);

(3) if Aarr.get (m) is equal to Barr.get (n), jumping to step (5);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the value step (2);

Wherein, the storage sets are ordered according to the dictionary sequence, and the storage sets can be known as follows:

The specific steps for completing the connection operation based on the valid value storage structure in the step S2 according to the principle of performing connection operation on two storage sets to select candidate sets are as follows:

(2) if the candidate item set Sarr is not empty, executing the step (two) next;

(2) if not, deleting the candidate item set Sarr to save the storage space.

Example 2:

taking supermarket commodity data as an example, table 1 shows commodity purchase records, and for convenience of use and explanation, let a show Orange position; b represents a cowe; c represents a mill; d represents a window cleaner; e represents break and the simplified transaction database is shown in table 2.

Table 1 commodity purchase record table

Customer' s	Project
		1	Orange juice,coke,bread
2	Milk,orange juice,window
		3	orange juice,coke,bread
4	milk,bread
		5	Coke,window cleaner

Table 2 shows a simplified transaction database, tid represents the transaction, itemset represents the set of items contained in the transaction. Where item a appears in bit 1, bit 2, and bit 3, so only the index location where item a appears is stored in the collection. Item b appears in bit 1, bit 3, and bit 5, again only the index location where item b appears is stored in the collection. Item c and item d are the same, see in particular Table 3.

Table 2 transaction database

Tid	Itemset
		1	a b e
2	a c d
		3	a b e
4	0c e
		5	b d

Table 13 aggregate store data

In this embodiment, the connection process of association rule mining based on the valid value structure is completed by the following method:

the data in the transaction database is stored in the storage mode based on the set in the embodiment, frequent items are connected one by one in a mode of taking intersections, candidate item sets with two item sets can be obtained, the candidate item sets are judged to be frequent item sets according to the set minimum support degree, and if the candidate item sets are the frequent item sets, iterative connection is continued until the frequent item sets cannot be found.

As shown in Table 4, the single sets { a } and { b } are connected by taking the intersection, resulting in the 2-term set { ab }.

Table 4 set stored join operations

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The association rule mining method based on the effective value storage and operation mode is characterized in that the method is used for finding interesting association among data in big data sets such as supermarket transactions; the method comprises the following steps:

s1, storing based on a set of effective values: setting a storage set, wherein the storage set only stores index positions of things where frequent single items are located, namely position values of an effective value 1; the method comprises the following specific steps:

s102, storing index positions, namely position values of effective values 1, of frequent single items in transactions in a storage set;

(1) if yes, sequentially performing iterative operation, namely repeatedly executing the step S203 until no frequent item set is found;

the principle of connecting two storage sets to select candidate item sets is as follows: and executing intersection taking operation on the set Aarr and the set Barr, wherein the set Aarr and the set Barr are orderly arranged, and the specific algorithm is as follows:

(2) Judging the size relation between Aarr.get (m) and Barr.get (n):

(1) if Aarr. Get (m) is greater than Barr. Get (n), jumping to the step (3);

(2) if Aarr. Get (m) is less than Barr. Get (n), jumping to the step (4);

(3) if Aarr. Get (m) is equal to Barr. Get (n), jumping to step (5);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the step (2);

(1) if yes, jumping to the step (6);

(2) if not, jumping to the value step (2);

(6) The calculation is finished, and the size of the candidate item set Sarr is obtained and is the support degree count of the candidate item set;

the storage sets are ordered according to the dictionary sequence, and the storage sets can be known as follows:

when Aar.get (m) is smaller than Barr.get (n), the numbers behind Barr.get (n) are larger than Aar.get (m) according to the dictionary sequence, so that the same value as Aar.get (m) does not exist behind Barr (n), aar.get (m) does not need to be compared with the value behind Barr.get (n) any more, and the values behind Barr.get (m+1) are further compared with the values behind Barr.get (n);

(2) if the candidate item set Sarr is not empty, executing the step (two) next;

(2) if not, deleting the candidate item set Sarr to save the storage space.