CN107798014A

CN107798014A - A kind of frequent item set data digging method for taking into account fractional sample

Info

Publication number: CN107798014A
Application number: CN201610802933.1A
Authority: CN
Inventors: 柴明亮; 高冰; 宋宝宇; 李连成; 刘宝权; 张岩; 宋君; 王靖震; 杨东晓; 费静
Original assignee: Angang Steel Co Ltd
Current assignee: Angang Steel Co Ltd
Priority date: 2016-09-06
Filing date: 2016-09-06
Publication date: 2018-03-13

Abstract

The present invention provides a kind of frequent item set data digging method for taking into account fractional sample, is arranged in order from high to low according to support, the competitive Principle that the quantity according to interception is accepted or rejected, and the total principle accepted or rejected to each sample item collection of form according to percentage.The frequent item collection generation of conceptual data sample is carried out successively, one ensemble average thresholding of local data's sample calculates, the frequent K item collections generation of conceptual data sample and local data's sample K item ensemble averages thresholding calculate.Frequent item set data mining algorithm LS Apriori algorithms of the invention based on Apriori properties, using the basic thought of Apriori algorithm, according to the Average Supports size of the Average Supports of fractional sample and population sample, competitive Principle is respectively adopted and total principle finds frequent item set, so as to take into account fractional sample data in Apriori algorithm, the defects of apriori traditional can not take into account local optimum well is efficiently solved.

Description

A kind of frequent item set data digging method for taking into account fractional sample

Technical field

The invention belongs to data digging method, more particularly to a kind of frequent item set data mining side for taking into account fractional sample Method.

Background technology

Apriori algorithm will be seen that the process of correlation rule is divided into two steps：The first step is met accident by iterative searching The item collection of all frequent item sets being engaged in database, i.e. support not less than the threshold value of user's setting；Second step is using frequent Item collection constructs the rule for meeting user's min confidence, wherein, it is the core of the algorithm to excavate and identify all frequent item sets, Occupy the major part of whole amount of calculation.Apriori algorithm is led to the thought of the subset necessarily frequent item set of frequent item set Cross known frequent item set and construct bigger item collection, and be referred to as candidate's frequent item set, only calculate the branch of post option collection later Degree of holding.Apriori algorithm thus exists by the way of thresholding is manually set and sets thresholding and reality according to the experience of people The problem of whether data mining matches, how the emphasis of recent researches is so that being manually set thresholding and actual data mining The problem of matching, fractional sample data how are taken into account for Apriori algorithm is studied very few.But in the application of reality, Apriori algorithm is it can be found that global frequentItemset, but the frequent item set of fractional sample can not be but embodied as, such existing As more and more.

The content of the invention

The present invention provides a kind of frequent item set data mining algorithm LS-Apriori algorithms based on Apriori properties, its Purpose is fully to take into account fractional sample data solve the defects of apriori traditional can not take into account local optimum very well.

Therefore, the technical solution that the present invention is taken is：

A kind of frequent item set data digging method for taking into account fractional sample, it is the frequent item set number based on Apriori properties According to mining algorithm LS-Apriori algorithms, its competitive Principle：It is arranged in order from high to low according to support, the quantity according to interception Accepted or rejected；Total principle：Each sample item collection is accepted or rejected according to the form of percentage.Its specific method and step are：

(1) the frequent item collection generation of conceptual data sample：Data sample is reconfigured, according to conceptual data sample, is calculated The item collection C of candidate one₁Support and Average Supports ZS₁, it is determined that frequent item collection L₁, L₁Quantity is counted as M₁。

(2) one ensemble average thresholding of local data's sample calculates：An ensemble average is calculated according to local data's sample to support Spend JS₁；If JS₁≥ZS₁, according to competitive Principle, redefine frequent item set；If JS₁＜ ZS₁, fractional sample average≤totality Sample average, the fractional sample for illustrating this part is weak support sample, in order to take into account the fractional sample of this part, according to sum Principle, frequent item set is redefined, sum is according to M₁/ 2 are handled.

(3) the frequent K item collections generation of conceptual data sample：Data sample is reconfigured, kth step, frequently k- is walked according to k-1 1 item collection L_k-1, the k item collections C that is selected after being produced according to Apriori_gen_kCollection；According to conceptual data sample, the item collection C of candidate one is calculated_k Support and Average Supports ZS_k, it is determined that frequent item collection L_k, L_kQuantity is counted as M_k。

(4) local data's sample K item ensemble averages thresholding calculates：K item collection Average Supports are calculated according to local data's sample JS_k；If JS_k≥ZS_k, according to competitive Principle, redefine frequent k item collections；If JS_k＜ ZS_k, then according to total principle, again It is determined that frequent k item collections, sum is according to M_k/ 2 are handled.

Beneficial effects of the present invention are：

The present invention proposes that a kind of new frequent item set data mining algorithm LS-Apriori based on Apriori properties is calculated Method, the basic thought of this algorithm application Apriori algorithm, according to the average branch of the Average Supports of fractional sample and population sample Degree of holding size, is respectively adopted competitive Principle and total principle finds frequent item set, so as to take into account part in Apriori algorithm Sample data, efficiently solve the defects of apriori traditional can not take into account local optimum well.

Brief description of the drawings

Fig. 1 is that LS-Apriori algorithms find frequent item set procedure chart；

Fig. 2 is LS-Apriori algorithm flow charts.

Embodiment

In order to illustrate the validity of LS-Ariori algorithms, the present invention have chosen one that Apriori algorithm finds frequent item set Individual classical example, transaction database such as table 1~4, there are 9 affairs in each sample database.

The item collection of 1 sample of table 1

The item collection of 2 sample of table 2

TID

T100

T200

T300

T400

T500

T600

T700

T800

T900

Item ID lists

I₂,I₅

I₁,I₄

I₁,I₃,I₅

I₁,I₂,I₅

I₂,I₃,I₅

I₁,I₃

I₂,I₄

I₁,I₃,I₄

I₁,I₂,I₄

The item collection of 3 sample of table 3

TID

T100

T200

T300

T400

T500

T600

T700

T800

T900

Item ID lists

I₁,I₅

I₂,I₅

I₂,I₃,I₅

I₁,I₃,I₄

I₁,I₂,I₅

I₄,I₅

I₂,I₃

I₁,I₂,I₃,I₄

I₁,I₂

The item collection of 4 sample of table 4

TID

T100

T200

T300

T400

T500

T600

T700

T800

T900

Item ID lists

I₂,I₃,I₄

I₂,I₅

I₂,I₃,I₄

I₁,I₃,I₅

I₁,I₂,I₄

I₃,I₅

I₂,I₄

I₁,I₂,I₃,I₅

I₁,I₅

Support counting in table 1 is support and the product of total things number.Using LS-Apriori algorithms, to table 1 ~4 data carry out frequently the mutually discovery of collection, its flow such as Fig. 2.Fig. 1 is that LS-Apriori algorithms find frequent item set process, The item collection of candidate one shares 5 in each sample, and fractional sample and population sample averagely support number result of calculation such as table 5.

The fractional sample of table 5 and population sample averagely support number

Sample sequence number	S₁	S₂	S₃	S₄	S
						Average Supports	100/36	92/36	92/36	96/36	95/36

According to LS-Apriori algorithm properties, sample S₁、S₄Number is averagely supported to be more than population sample S average support number, choosing Principle is taken to use competitive Principle.Sample S₂、S₃Number is averagely supported to be less than population sample S average support number, selection principle is using total Number principle.Frequent item set 11 just is found frequently with Apriori algorithm, in order to round, so finally determining an item collection, is adopted 6 are found with competition, 6 is found using sum, adds up to 12.Due to just frequently with Apriori algorithm when, only S₂Lack one , so increased 1 has been given sample S₂.The item collection of candidate two shares 12, just finds frequency frequently with equal thresholding Apriori algorithm Numerous item collection 8, sample S₂~S₄Using total selection principle, it should there is 4 frequent item sets；S₁Using competitive Principle, but due to S₁Only 3 item collections, so finally determining sample S₂~S₄There are 5 frequent item sets.

Claims

1. a kind of frequent item set data digging method for taking into account fractional sample, it is the frequent item set data based on Apriori properties The LS-Apriori algorithms of mining algorithm, it is characterised in that competitive Principle：It is arranged in order from high to low according to support, foundation The quantity of interception is accepted or rejected；Total principle：Each sample item collection is accepted or rejected according to the form of percentage；Its specific method and Step is：

(1) the frequent item collection generation of conceptual data sample：Data sample is reconfigured, according to conceptual data sample, calculates candidate One item collection C₁Support and Average Supports ZS₁, it is determined that frequent item collection L₁, L₁Quantity is counted as M₁；

(2) one ensemble average thresholding of local data's sample calculates：One item collection Average Supports JS is calculated according to local data's sample₁； If JS₁≥ZS₁, according to competitive Principle, redefine frequent item set；If JS₁＜ ZS₁, fractional sample average≤population sample is equal Value, the fractional sample for illustrating this part is weak support sample, in order to take into account the fractional sample of this part, according to total principle, Frequent item set is redefined, sum is according to M₁/ 2 are handled；

(3) the frequent K item collections generation of conceptual data sample：Data sample is reconfigured, kth step, frequently k-1 items are walked according to k-1 Collect L_k-1, the k item collections C that is selected after being produced according to Apriori_gen_kCollection；According to conceptual data sample, the item collection C of candidate one is calculated_kBranch Degree of holding and Average Supports ZS_k, it is determined that frequent item collection L_k, L_kQuantity is counted as M_k；

(4) local data's sample K item ensemble averages thresholding calculates：K item collection Average Supports JS is calculated according to local data's sample_k；Such as Fruit JS_k≥ZS_k, according to competitive Principle, redefine frequent k item collections；If JS_k＜ ZS_k, then according to total principle, frequency is redefined Numerous k item collections, sum is according to M_k/ 2 are handled.