CN106777182A

CN106777182A - A kind of data flow effective item set mining algorithm for reducing candidate

Info

Publication number: CN106777182A
Application number: CN201611202991.7A
Authority: CN
Inventors: 陈涛
Original assignee: Shaanxi University of Technology
Current assignee: Shaanxi University of Technology
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-05-31

Abstract

The invention discloses a kind of data flow effective item set mining algorithm of reduction candidate of present invention offer, first, a global tree is set up by the single pass of current window in data flow, and reduce the redundancy value of utility of head table entry and node in global tree；Then, based on overall situation tree generation candidate pattern, the candidate effectiveness based on growth algorithm reduction local tree；In candidate effectiveness, according to the order of transaction set, successively by the i in k-th affairs_jThe affairs weighting effectiveness of item is added, used as node i_jBe added to for secondary frequency effectiveness item in setting by the things weighting effectiveness summation of item, treatment time frequency effectiveness item collection；Then by introducing affairs effectiveness threshold value high with low affairs effectiveness threshold value, PTUV^DSecondary frequency effectiveness item collection in storage data set；Actual utility is finally calculated to determine final effective item collection.It is based on True Data stream test result indicate that, spatiotemporal efficiency of the invention and EMS memory occupation are than being superior to the effective pattern mining algorithm of other data flows.

Description

A kind of data flow effective item set mining algorithm for reducing candidate

Technical field

The invention belongs to data mining technology field, more specifically, more particularly to a kind of data for reducing candidate Stream effective item set mining algorithm.

Background technology

With cloud computing, the fast development of big data and internet, each face of each side during we live all be unable to do without calculating Machine technology is stored, excavated and analyze data.What we received is not only the less data of scale inside body series, but How cross interconnected unmeasured vastness information knowledge ocean between every profession and trade, obtain knowledge and information from the large-scale data for producing It is a huge challenge.In traditional information system data are carried out with additions and deletions and looks into that the operation such as to change and count new in this day instantly Month different society oneself through tending to out-of-date old stuff, carry out mining analysis by which type of technology data huge to amount of storage, The potential information existed between data is fast and effectively found, and oneself information through excavating is manager or decision-making by these Person provides the prediction of knowledge, effectively improves the utilization rate of resource, and this is only the technical research for meeting requirements of the times.Thus, from number Support that the research of data mining starts to occur according to discovery knowledge in storehouse and dominant technology, and quickly developed.Data mining Be exactly from substantial amounts of, incomplete, noisy, fuzzy, random real application data, extraction lie in it is therein, People in advance it is ignorant but really potentially useful information and the process of knowledge.Oneself is through there is many field of reality to exist now Using DM technologies, including manufacturing industry, retail business, finance, health care, engineering and science etc..Simultaneously in behavior recommendation, network carriage The aspects such as feelings monitoring system are widely applied very much.

Association rule mining has obtained the extensive of scholar as one in data mining technology very important research branch Research, it mainly excavates the associated degree between item collection, and wherein its core is frequent item set mining.Agrawal in 1993 Et al. the concept for proposing correlation rule first, Zhi Hou are fully studied by the Supermarket shopping baskets data message to Wal-Mart Many industries are applied.Such as shopping online platform (day cat, when work as), the correlation rule obtained by excavation can predict Gu The buying pattern of visitor and hobby, then can provide personalized buying experience with for every customer.But association rule mining The degree of association size between commodity is only analyzed without the consideration of other factors, such as quantity and profit of article, this will Still effectiveness item collection high is ignored less to make occurrence number.In order to solve this problem, scholar proposes effective item collection first (high utility itemsets) is excavated, and it increases to the quantity of article and profit value in Association Rules Model, works as item collection Total utility value it is bigger than previously given effectiveness threshold value when, it is just referred to as effective item collection.

But, with the fast development of database and network technology, the significantly lifting of memory data output causes data not It is again static, but builds up, changes.Such as the sales data of online platform, the message registration of CHINAUNICOM's movement, friendship Logical real-time monitoring data etc..Different from traditional association rule mining, the data in transaction set can be changed over time, more Data after new compare before it is more important, how correctly to consider the factor of these changes and fast and effeciently excavate true Real valuable knowledge and information, tightened up requirement and challenge is proposed to association rule mining.Traditional batch-type is frequent Item set mining algorithm can only produce new association item collection by rescaning the database after updating, and FUP is proposed in the prior art Algorithm, needs database after frequently scanning renewal when algorithm solves the problems, such as newly-increased transaction set than original transaction collection small scale. The concept of secondary Frequent Set and FP-tree combinations have devised prelarge-tree structures and effectively carry out Increment Mining.Then again Propose the concept that decrement is excavated and change is excavated.Value of utility is considered again on the basis of correlation rule Increment Mining afterwards It is interior, using the downward closure of affairs weighting effectiveness (TWU), constantly change on the basis of FUP algorithms and inferior frequent itemsets concept Enter, such as Lin et al propose FUP-HU works algorithm and carry out effective increment excavation based on FUP algorithms, but when an item collection exists In original data set it is low frequency effectiveness and still needs when being high frequency effectiveness on data set in the updated and rescan renewal Database afterwards.Given this Pre-HU works algorithm proposes that Two-Phase algorithms and Pre-large concepts are incorporated into effectiveness excavates In, the time of scan database is reduced using the downward closure of affairs.

Although these effective increment algorithms improve renewal efficiency, the number of times of scanning raw data base is effectively reduced, Still need to produce a large amount of useless candidate's frequency items, and be only suitable for processing the increase of transaction database, when item collection changes in former db transaction The database rescaned after updating is still needed to when becoming (reduce, modification etc.), can be reached by the present invention and effectively be reduced candidate frequently The purpose of item number, can not only process the increase of transaction set, and the change of transaction set can be processed again, while can also be efficiently completed dynamic Effectiveness mining task, this has been also adapted to the new demand excavated to effectiveness at this stage.

The content of the invention

The invention aims to solve shortcoming present in prior art, and a kind of reduction candidate for proposing Data flow effective item set mining algorithm.

To achieve the above object, the present invention provides following technical scheme：

A kind of data flow effective item set mining algorithm for reducing candidate, comprises the following steps：

S1, first, a global tree is set up by the single pass of current window in data flow, and head table enters in reducing global tree Mouthful with the redundancy value of utility of node；

S2 and then, based on the overall situation tree generation candidate pattern, based on growth algorithm reduction local tree candidate effectiveness；

S3, in candidate effectiveness, according to the order of transaction set, successively by the i in kth affairs_jThe affairs weighting of item Effectiveness is added, used as node i_jThe things weighting effectiveness summation of item, meanwhile, by item i_jPrefix be added to node i_jPrefix In item collection chained list, be added to secondary frequency effectiveness item in tree by treatment time frequency effectiveness item collection；

S4 and then by introducing affairs effectiveness threshold value high and low affairs effectiveness threshold value, three layers are divided into by affairs weighting utility scale, Alignment processing is layered in original transaction collection and newly-increased transaction set, using HTWU^DHigh frequency effectiveness item collection in storage data set, PTUV^DSecondary frequency effectiveness item collection in storage data set；

S5, finally calculate actual utility and determine final effective item collection.

Preferably, the method for building up of the global tree is as follows：

A, the affairs weighting effectiveness variable quantity for calculating each item collection in change affairs first；

B and then they are divided into high frequency effectiveness according to the item frequency of raw data base, secondary frequency effectiveness and low frequency effectiveness are come Construction PreHU-tree；

C, directly determine the frequency of n mono- finally by the affairs weighting effectiveness and prefix item collection chained list of search each nodes of PreHU-tree ；

D, the outside effectiveness with reference to the item collection support in prefix item collection chained list and item excavate varying type effective item collection.

Preferably, redundancy effectiveness reduction algorithm is stated as follows：

A, in a head table for overall situation HUS trees for each sets up a conditional pattern base, each divide search space head table In not include every terms of information, therefore from conditional pattern base produce candidate pattern when, without the utility information comprising project below；

B, hypothesisS={i ₁<i ₂<...<i _mIt is current sequence, whereini ₁Withi _mIt is respectively the top and bottom of global tree head table , it is assumed that from the beginning table selects one to excavate programi _pA conditional pattern base is set up, before only being included in sequence in conditional pattern base Severali ₁,i ₂,...,i _p-1, so without in addition below effectiveness to the effective of some.

Technique effect of the invention and advantage：A kind of data flow effective item collection of reduction candidate that the present invention is provided Mining algorithm, first, sets up a global tree, and reduce head table in global tree by the single pass of current window in data flow The redundancy value of utility of entrance and node；Then, based on overall situation tree generation candidate pattern, the time based on growth algorithm reduction local tree Set of choices effectiveness；Finally, effective pattern is selected from candidate pattern.It is based on True Data stream test result indicate that, this hair Bright spatiotemporal efficiency and EMS memory occupation are than being superior to the effective pattern mining algorithm of other data flows.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with specific embodiment, to this Invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, not For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.

Specifically, the method for building up of the global tree is as follows：

Specifically, redundancy effectiveness reduction algorithm is stated as follows：

In sum：A kind of data flow effective item set mining algorithm of reduction candidate that the present invention is provided, first, One global tree is set up by the single pass of current window in data flow, and it is superfluous with node to reduce head table entry during the overall situation is set Remaining value of utility；Then, based on overall situation tree generation candidate pattern, the candidate effectiveness based on growth algorithm reduction local tree；Most Eventually, effective pattern is selected from candidate pattern.It is based on True Data stream test result indicate that, spatiotemporal efficiency of the invention with EMS memory occupation is than being superior to the effective pattern mining algorithm of other data flows.

Finally it should be noted that：The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, Although being described in detail to the present invention with reference to the foregoing embodiments, for a person skilled in the art, it still may be used Modified with to the technical scheme described in foregoing embodiments, or equivalent carried out to which part technical characteristic, All any modification, equivalent substitution and improvements within the spirit and principles in the present invention, made etc., should be included in of the invention Within protection domain.

Claims

1. it is a kind of reduce candidate data flow effective item set mining algorithm, it is characterised in that comprise the following steps：

S3, in candidate effectiveness, according to the order of transaction set, successively by the i in kth affairs_jThe affairs weighting effect of item With addition, as node i_jThe things weighting effectiveness summation of item, meanwhile, by item i_jPrefix be added to node i_jPrefix In collection chained list, be added to secondary frequency effectiveness item in tree by treatment time frequency effectiveness item collection；

2. a kind of data flow effective item set mining algorithm for reducing candidate according to claim 1, its feature exists In：The method for building up of the global tree is as follows：

C, finally by search each nodes of PreHU-tree affairs weighting effectiveness and prefix item collection chained list directly determine n- frequency items；

3. a kind of data flow effective item set mining algorithm for reducing candidate according to claim 1, its feature exists In：The redundancy effectiveness reduction algorithm is as follows：