CN105260442A

CN105260442A - Bit operation and inverted index based association rule mining algorithm

Info

Publication number: CN105260442A
Application number: CN201510644226.XA
Authority: CN
Inventors: 黄玉蕾; 郭运凯; 陈群英; 林青; 李艳
Original assignee: Xian Peihua University
Current assignee: Xian Peihua University
Priority date: 2015-10-08
Filing date: 2015-10-08
Publication date: 2016-01-20
Anticipated expiration: 2035-10-08
Also published as: CN105260442B

Abstract

The present invention discloses a bit operation and inverted index based association rule mining algorithm, and provides a bit operation and inverted index based frequent item set mining algorithm Apriori-BR by analysis on a classical Apriori algorithm. According to the Apriori algorithm, a database needs to be repeatedly scanned to calculate a support degree of a candidate item set, and different items also need to be repeatedly compared to determine whether the items exist during generation of the candidate item set, so that efficiency of the algorithm is influenced. According to the improved Apriori-BR algorithm, the database only needs to be scanned for twice in the mining process; and the improved Apriori-BR algorithm is used for obtaining a 1-item frequent item set and establishing inverted indexes of frequent items. The bit operation is utilized to accelerate detection of a subset so as to promote a connection speed. The support degree of the candidate item set can be obtained only by carrying out the bit operation on a column vector of each item of the candidate item set without scanning the database again, and in the calculation process, a non-frequent item set can be discovered as soon as possible to end the remaining operation, thereby better improving algorithm efficiency.

Description

A kind of association rules mining algorithm based on bit arithmetic and inverted index

Technical field

The present invention relates to a kind of database algorithms, particularly relate to a kind of association rules mining algorithm based on bit arithmetic and inverted index.

Background technology

Apriori algorithm is one of the most influential algorithm in association rules mining algorithm.For finding out the correlation rule meeting the given min confidence of user and minimum support in data centralization.After Apriori algorithm proposes, a large amount of association rule algorithms is suggested, and the large class of these point countings two, the first kind take Apriori algorithm as the relevant innovatory algorithm of representative, and this kind of algorithm realization is simple, but needs Multiple-Scan database.Equations of The Second Kind algorithm is the FP-Growth and innovatory algorithm thereof that propose based on Han [2], and this kind of algorithm does not produce Candidate itemsets, and adopts FP-Tree compressed database, and advantage is that efficiency of algorithm is high, and shortcoming is that algorithm realization is more complicated.For the shortcoming of class Apriori algorithm, scholar proposes multiple optimized algorithm measure: to number decoder and beta pruning, bottom-up matrix with delete affairs, reduce and be connected number of times, delete affairs.These algorithms optimize traditional Apriori algorithm in performance, and meanwhile, its algorithm complexity also improves thereupon.

Apriori algorithm adopts Bottom-up approach to complete the excavation of rule, namely utilizes k-item collection to produce (k+1)-item collection.For raising the efficiency, Apriori algorithm make use of a critical nature: in frequent item set, any subset is frequent item set, thus significantly reduces the search volume of algorithm.Its detailed step sees reference document [1].

Traditional Apriori algorithm weak point is as follows:

(1) produce time Candidate itemsets, need repeatedly to compare to judge whether condition of contact meets.

(2) subset judging a candidate whether time need repeatedly to compare.

(3) once must could obtain the support of Candidate itemsets by full dose scan database, each candidate needs full dose scan database once.

Summary of the invention

Object of the present invention is just to provide to solve the problem a kind of association rules mining algorithm based on bit arithmetic and inverted index.

The present invention is achieved through the following technical solutions above-mentioned purpose:

The present invention includes bit arithmetic, set up inverted index and be connected fast, described bit arithmetic comprises the following steps:

(1) column vector of each Ij in transaction database D is defined as R _j=(t _1j, t _2j..., t _nj), wherein n is the affairs sum of transaction database D, if the transaction packet of Tid=m is containing item I _j, then t _mj=1, no t _mj=0, to transaction database D, the column vector of item c is Rc=(111011) ^t, after introducing vector storage, then bit arithmetic can be adopted to carry out the computing of support number and subset detection;

(2) support number of computational item Ij and Ik, is equivalent to compute vector R _jand R _kthe number of " 1 " is comprised after step-by-step "AND";

(3) vectorial R is judged _jwhether R _ksubset, be equivalent to and judge R _j==(R _jaMP.AMp.Amp R _k), wherein " & " is step-by-step AND operation;

Described inverted index of setting up is: first carry out the scanning of full storehouse to database, and comparison is one by one carried out to every affairs, in the hope of its support number, thus increase the complexity of algorithm greatly, if the affairs number of database D is N, average transaction length is m, and when a calculating Candidate itemsets Ck, the complexity of algorithm is O (k*m*N), introduce inverted index to each project for this reason, obtain the Tid of its correspondence, and divide into groups by the transaction length at Tid place, set up inverted index;

Described quick connection is: to k-item Frequent Item Sets, 2 parts are divided into store, by its structure called after first and second, wherein first stores front k-1 item, second stores kth item, and front k-1 item is stored in a vector, when connecting, if the first of two item collection is equal, and the position of second power is unequal, then the candidate that directly generation one is new, and the position less part of power is stored in second, candidate new is in this instance fdbe, first be 10110, second is 1 power; Adopt the representation based on bit string, when generating candidate, previous section being compared, only needing once-through operation to complete, without the need to carrying out independent searching to each item, comparing, thus improve the efficiency of join algorithm.

Beneficial effect of the present invention is:

The present invention is a kind of association rules mining algorithm based on bit arithmetic and inverted index, compared with prior art, and the present invention

Accompanying drawing explanation

Fig. 1 is the inverted index figure that transaction database of the present invention is corresponding;

Fig. 2 is comparison diagram working time of the lower three kinds of algorithms of accidents data set of the present invention;

Fig. 3 is comparison diagram working time of the lower three kinds of algorithms of mushroom data set of the present invention;

Fig. 4 is comparison diagram working time of the lower three kinds of algorithms of T70I4D100K data set of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the invention will be further described:

For the improvement thought of algorithm is described, be described for a transaction database D here.Minsup=50%, then minimum support number=3.

Table 1 transaction database D

Then 1 Frequent Item Sets is L1={f:5, c:5, d:4, b:3, e:3}

The present invention includes bit arithmetic, set up inverted index and be connected fast, in order to promote the efficiency that subset detects, introducing bit arithmetic.Described bit arithmetic comprises the following steps:

Described inverted index of setting up is: first carry out the scanning of full storehouse to database, and comparison is one by one carried out to every affairs, in the hope of its support number, thus increase the complexity of algorithm greatly, if the affairs number of database D is N, average transaction length is m, and when a calculating Candidate itemsets Ck, the complexity of algorithm is O (k*m*N), introduce inverted index to each project for this reason, obtain the Tid of its correspondence, and divide into groups by the transaction length at Tid place, set up inverted index; Inverted index corresponding to the D of Transaction Information as shown in Figure 1.

The inverted index built at this is made up of 3 parts.Part 1 is the position power of each frequent item, for marking position during this frequent item 2 system bit representation below.Part 2 is frequent item head node, and its content comprises frequent item and support number thereof.3rd part is inverted index node, is made up of the Tid list of item length and affairs.Here, carry out point bucket to Tid by transaction length to preserve.

Quick connection: Apriori algorithm, when performing Connection Step and generating (k+1)-item Candidate itemsets, needs from L _kin find two item collection l ₁and l ₂, note symbol l _{i [j]}represent l _ijth item, if l ₁and l ₂front k-1 item all identical, and meet l _{1 [k]}<l _{2 [k]}.Then l ₁and l ₂attachable, connect l ₁and l ₂the result items produced is collection is l _{1 [1]}l _{1 [2]}... l _{1 [k]}l _{2 [k]}.Can see from this process, in order to generate a k+1-item candidate, optimal situation also needs to compare k time, the worst situation is a k-item frequent item item, compare k time with each collection in Lk, finally all do not have that the match is successful, make the efficiency of algorithm very low like this.

Therefore, to k-item Frequent Item Sets, 2 parts are divided into store, by its structure called after first and second, wherein first stores front k-1 item, and second stores kth item, and front k-1 item is stored in a vector, such as to frequent item set A=(fdb), then first is (10100), and second is 2 (position power); Frequent item set B=(fde), then first is (10100), and second is 1 (position power); When connecting, if the first of two item collection is equal, and the position of second power is unequal, the candidate that then directly generation one is new, and the position less part of power is stored in second, candidate new is in this instance fdbe, first is 10110, second is 1 power; Adopt the representation based on bit string, when generating candidate, previous section being compared, only needing once-through operation to complete, without the need to carrying out independent searching to each item, comparing, thus improve the efficiency of join algorithm.

Relevant nature:

Character 1: if having an affairs T in transaction database D, its length is less than (k+1), then length is greater than to the Candidate itemsets of k, affairs T is without the need to scanning.

Character 2: one Item Sets support number in a database, the number of " 1 " after equaling the step-by-step "AND" of the column vector of all items of this Item Sets.

Do as one likes matter 2 is known, the support number of calculated candidate project, by full dose Multiple-Scan database, is converted into the step-by-step AND operation of vector.That is, its support number equals the column vector of first item, the number of " 1 " after successively carrying out step-by-step AND operation with other column vector.

Optimisation strategy:

Do as one likes matter 1 and character 2, propose following optimisation strategy:

Strategy 1: when column vector generates, when excavating k-item Frequent Item Sets, only needs to be not less than the inverted index list of k from the transaction length of each project to generate.

Strategy 2: when calculating the support number of k-item candidate, if the number of " 1 " is less than k after the column vector of certain two item carries out step-by-step AND operation, then the support number of this candidate item is less than k, is not Frequent Item Sets, can directly delete.

Strategy 3: adopt the computing of bit arithmetic accelerated evolution.

Arthmetic statement:

The treatment scheme of Apriori-BR algorithm is: full dose scan database once, obtains 1-item Frequent Set, scan database again, generates inverted index.Then perform following steps: from k=1, iteration produces the set C that length is the candidate of k+1 _k+1.Concrete steps are as follows: the first step, and Lk is from connecting.From Lk, options integrates and connects the candidate of generation length as k+1, and it is all identical that the item collection participating in connecting must meet (k-1)-prefix.Second step is beta pruning, and Candidate itemsets is being added C _k+1time, if its k length be the subset of k not all in Lk, then delete this Candidate itemsets.When computational length is the support of the candidate of k+1, strategy 2 is adopted to carry out beta pruning.After beta pruning, all item of the k+1 frequently collection excavated are added in Lk+1.This process is until C _k+1terminate when equaling empty set.

Algorithm Apriori-BR inputs: transaction database D, and minimum support number threshold value minsup exports: Frequent Item Sets L

Algorithm example:

To the database D of Section 2, if minimum support threshold value is 50%, then minimum support number minsup=3, its mining process is as follows.

First scan 2 secondary data storehouses, generate inverted index as shown in Figure 1, and obtain 1-item Frequent Item Sets L1={f:5, c:5, d:4, b:3, e:3}.

Then start to connect, generate 2-item Candidate itemsets C2.When generating C2, for project f and c.F.first=(00000), f.second=16, c.fitst=(00000), c.second=8, meets f.first=c.first, then connect generation 2 Candidate itemsets { fc}, { fc}.first=(10000), { fc}.second=8.In like manner generate other all Candidate itemsets C2={fc, fd, fb, fe, cd, cb, ce, db, de, be}.When obtaining support, so that { fc} is with { fb} is described.To { inverted index of fc}, project f is (4,1,3,5,2) corresponding column vector is R1=(111110) ^t, the inverted index of project c is (1,3,5,2), and corresponding column vector is R2=(111010) ^t, R1 & R2=(111010) ^t, (R1 & R2) .count=4>3 is frequent item; To { inverted index of fb}, project b is (4,6,2), and corresponding column vector is R3=(010101) ^t, then R1 & R3=(010100) ^t, (R1 & R3) .count=2<3, is not Frequent Item Sets, then deletes.Excavate successively, obtain 2-item Frequent Item Sets L2={fc:4, fd:4cd:4}; In like manner excavate 3 Frequent Item Sets L3={fcd:4}, algorithm terminates.

Experimental result and analysis:

In order to the performance of test optimization algorithm, test under the environment of CPUi5-4200u2.5GHz, 2048MBDDRRAM, WindowsXPProfessionalSP2.Achieve Apriori with VC++6.0, Apriori-BR, Fp-growth algorithm, experimental data divides two kinds, True Data (accidents, mushroom) and generated data (T70I20D100K).Wherein True Data takes from the database of FIMI (http://fimi.ua.ac.be/data/).Generated data is generated by the maker in IBMAlmaden laboratory.

Accidents data set comprises the traffic hazard record in 1991 to 2000 Flanders areas.The size of this data set is 34678KB, totally 340184 affairs, has the item that 572 different, and average every bar transaction packet is containing 45 items.Mushroom is 557KB, and project adds up to 119, records several 8124, and average every bar transaction packet contains 23 different items.T70I20D100K represents that project adds up to 70, and average transaction length is 20, and record number is 100000.

Test result wherein on accidents data set as shown in Figure 2.Test result on mushroom database as shown in Figure 3.Test result on T70I20D100K database as shown in Figure 4.

Table 2 test database describes

As can be seen from experimental result, in different tests, working time of algorithm with minimum support diminish and elongated gradually.Can find out in fig. 2, the amplitude of variation of the working time of tradition Apriori algorithm is more much larger than two other algorithm, trace it to its cause, when support is less, Apriori generates a large amount of Candidate itemsets, and each candidate needs full dose scan database to obtain support, the plenty of time is consumed the connection at candidate, in the acquisition of subset test and support.FP-Growth algorithm in an experiment performance is more excellent, and it adopts the size of FP-Tree significantly compressed database, and without the need to generating Candidate itemsets, makes it on Dense Databases, play good performance.Apriori-BR algorithm in this paper obtains good performance, mainly be only to need scanning 2 secondary data storehouse, set up inverted index, and utilize bit arithmetic to accelerate the detection of subset, in the support number process obtaining Candidate itemsets, can obtain without the need to scan database, and stop remaining comparison process immediately after detecting Infrequent item-set, reduce algorithm complex, obtain good performance.

Conclusion

The present invention, by analysis to apriori traditional, proposes a kind of Frequent Itemsets Mining Algorithm Apriori-BR based on bit arithmetic and inverted index.Whether Apriori algorithm needs repeatedly scan database so that the support of calculated candidate item collection, and also need to exist different contrast projects repeatedly when generating Candidate itemsets, have impact on the efficiency of algorithm.Innovatory algorithm Apriori-BR algorithm only needs scanning 2 secondary data storehouse in mining process, for obtaining 1-item Frequent Item Sets, and sets up the inverted index of frequent episode.Utilize bit arithmetic to accelerate the detection of subset, promote connection speed.When obtaining the support of Candidate itemsets, without the need to scan database again, only need carry out bit arithmetic to each column vector of Candidate itemsets can obtain, and can find Infrequent item-set early in computation process, stop remaining computing, thus better boosting algorithm efficiency.

More than show and describe ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.Application claims protection domain is defined by appending claims and equivalent thereof.

Claims

1., based on an association rules mining algorithm for bit arithmetic and inverted index, it is characterized in that: comprise bit arithmetic, set up inverted index and be connected fast, described bit arithmetic comprises the following steps: