CN105930457A

CN105930457A - Distributed architecture-based data flow frequent item mining method

Info

Publication number: CN105930457A
Application number: CN201610254621.1A
Authority: CN
Inventors: 张玉; 徐敬东; 张建忠; 于博文; 陈正阳
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2016-04-21
Filing date: 2016-04-21
Publication date: 2016-09-07

Abstract

The invention discloses a distributed architecture-based data flow frequent item mining method. According to the method, a two-layer tree-shaped communication structure which comprises m leaf nodes and 1 root node is adopted, wherein the leaf nodes are responsible for processing data items in data flows and sending frequency increments to the root node when increments of frequencies of the data items exceed a threshold value; the root node is responsible for collecting updating transferred by the leaf nodes. The method is low in communication overhead, and can be used for responding frequent item query requests initiated by users in real time.

Description

Data stream frequent item method for digging based on distributed structure/architecture

Technical field

The invention belongs to data mining technology field, relate to a kind of frequent-item method, particularly relate to be applicable to the frequent of distributed structure/architecture Item method for digging.

Background technology

In the data mining problem such as association rule mining, sequential mode mining, relevant mining, multilayered schema excavation, frequent-item is Basic step, is also committed step.Having the much research of the frequent-item of data stream on single node at present, these researchs substantially can be drawn Point two classes: method based on enumerator and method based on sketch.

Frequent-item method based on enumerator can safeguard one group of enumerator frequency for statistical data item.Each enumerator comprises two ginsengs Number, is data item title and data item frequency respectively.When data stream has data item to arrive, if there is this number in the enumerator safeguarded According to item, increase the frequency that this enumerator is recorded the most accordingly；Otherwise increase a new enumerator to be used for storing new data item or replacing one Original enumerator.The expense of the frequent-item method process individual data item being typically based on enumerator is the lowest, and it is the most right to however it is necessary that All enumerators are ranked up.Frequent-item method based on enumerator has Frequent, Space Saving and Lossy Counting etc..

Method based on sketch uses the Hash table being made up of one-dimensional or bidimensional enumerator array to estimate the frequency of each data item in data stream Rate.This kind of method generally uses salted hash Salted by each maps data items to corresponding multiple enumerators, and single enumerator may be by many numbers Being shared according to item, the data item i.e. with identical cryptographic Hash shares same enumerator.When a data item arrives, correspondence only need to be revised The value of enumerator.When user submits inquiry request to, use the value of enumerator of correspondence to estimate frequency, it is possible to will by mistake with higher confidence level Difference controls within the scope of certain.It is said that in general, method based on Hash needs the support of additional data structures, as use heap come with The frequent episode of track candidate.CountSketch, CountMin Sketch and hCount etc. is had based on Skech frequent-item method.

Traditional frequent-item method the most only considers to process data mapping on single node, but at present should in a lot of reality In with, it is large-scale distributed for needing the data excavated, such as stream detection in network monitoring, DDOS attack detection etc..Currently for dividing The research of the frequent-item method under cloth stream environment mainly includes that what A.Manjhi proposed is applicable to the tree-like or topology knot of multipath figure The Tributary-Delta method of structure；The approximation side of a kind of frequent episode can followed the tracks of continuously in high speed distributed stream that G.Cormode proposes Method；The method etc. that the distributed top-k that B.Babcock and C.Olston proposes monitors.The main deficiency that these methods exist includes: communication is opened Sell excessive and do not support real-time query.

Summary of the invention

The present invention seeks to solve existing method and there is the problem that communication overhead is excessive and does not support real-time query, it is provided that be a kind of based on distributed The data stream frequent item method for digging of framework, may be used to improve tradition frequent-item method disposal ability on distributed structure/architecture.

The invention provides data stream frequent item method for digging based on distributed structure/architecture, the method is the distributed traffic frequency of a kind of Weighted Coefficients ε-the approximation method of numerous method for digging.The method uses 2 layers of tree-like communication structure, including m leaf node and 1 root node；Described Leaf node is responsible for processing the data item in data stream, the data item frequency in data stream is stored in the rickle of leaf node, and is counting During according to item frequency increment more than threshold value, data item frequency increment is sent to root node；Described root node is responsible for calculating data item in overall architecture In frequency estimation, data item frequency estimation is stored in the rickle of root node；The bar of storage in the rickle of described leaf node Mesh includes data item title, data item frequency and data item frequency increment；In the rickle of described root node, the entry of storage includes data item Title and data item frequency estimation.

Technical solution of the present invention:

Step 1), each leaf node i from the data stream received, take out data item successively, described data item includes data item title v_tAnd number According to item frequency c_{V, t}；

Step 2), update data item frequency sum N of described leaf node_i=N_i+c_{V, t}And the increment of data item frequency sum Δ_i=Δ_i+c_{V, t}, equal sign therein statement assignment, lower same；

Step 3), according to step 1) data item title v of data item taken out_tWith data item frequency c_{V, t}Minimum at described leaf node Heap H_iIn find out suitable entry, and be the data item title in this entry, data item frequency and data item frequency increment assignment；This step Including:

Step 3-1), judge described in leaf node rickle in whether there is data item name and be referred to as v_tEntry, if exist perform next step, Otherwise, step 3-5 is performed)；

Step 3-2), judge described in the rickle H of leaf node_iThe fullest, if the fullest, perform next step, otherwise, perform step 3- 4)；

Step 3-3), from the rickle H of described leaf node_iEntry item that middle taking-up data item frequency is minimum_min, this entry is composed again Value, then performs step 4)；Wherein, this entry assignment is included:

Make v=v_t, c_v=c_v+c_{V, t}, Δ_v=c_{V, t}；

Described v represents the data item title taking out entry, described v_tRepresent the data item title of the data item taken out, described c_vExpression takes Shaping purpose data item frequency, described c_{V, t}Represent the data item frequency of the data item taken out, described Δ_vRepresent the data item frequency taking out entry Rate increment；

Step 3-4), create a new entry be new entry assignment, by the rickle H of the leaf node described in the insertion of new entry_iIn, then Perform step 4)；Wherein, new entry assignment is included:

Make v=v_t, c_v=c_{V, t}, Δ_v=c_{V, t}；

Step 3-5), from the rickle H of described leaf node_iThe already present entry of middle taking-upAnd this entry is updated, so Rear execution step 4)；Wherein, the renewal of this entry is included:

Make c_v=c_v+c_{V, t}, Δ_v=Δ_v+c_{V, t}；

Step 4), judge described in the increment of data item frequency sum and the data item frequency increment of described entry whether more than threshold value, if More than threshold value, update to root node transmission；This step includes:

Step 4-1), judge described in the increment Delta of data item frequency sum_iWhether meet Δ_i＞ β_iN_i, if it is satisfied, perform next step, no Then, step 4-3 is performed)；Wherein,

Described β_iRepresent the renewal retardation coefficient of user-defined leaf node, described N_iRepresent described leaf node data item frequency it With；

Step 4-2), described leaf node sends 0-msg to root node and updates, then by Δ_iValue be set to 0；Wherein

Described 0-msg updates the content sent and includes the increment Delta of described data item frequency sum_i；

Step 4-3), the data item frequency increment Δ of described entry_vWhether meet Δ_v＞ β_iN_i, if it is satisfied, perform next step, otherwise, Perform step 5)；

Step 4-4), described leaf node send data item update to root node, then by Δ_vValue be set to 0；Wherein

The content that described data item update sends includes the data item title of described entry and the data item frequency increment Δ of described entry_v；

Step 5), described root node from the renewal that described leaf node sends, take out renewal successively, and according to the updating maintenance phase taken out The data answered；This step includes:

Step 5-1), judge the type of the renewal that the described leaf node that root node takes out sends, if 0-msg updates, perform next step, If data item update, perform step 5-3)；

Step 5-2), update described in the estimated value N=N+ Δ of data item frequency sum of root node_i, wherein equal sign represents assignment, then Perform step 6)；Wherein,

Described N represents the estimated value of the data item frequency sum of root node, described Δ_iThe 0-msg that leaf node described in expression sends updates Frequency；

Step 5-3), update described in the rickle H of root node₀；This step includes:

Step 5-3-1), take out described in leaf node send renewal in data item title v_tAnd data item frequency increment Δ_{V, t}；

Step 5-3-2), judge described in root node rickle in whether there is data item name and be referred to as v_tEntry item_vIf existing under performing One step, otherwise, performs step 5-3-4)；

Step 5-3-3), take out described in entry item_v, to and this entry is updated, then perform step 6)；Wherein, to this Mesh updates and includes:

Make c_v=c_v+Δ_v,t, wherein equal sign represents assignment；

Described v represents the data item title taking out entry, described c_vRepresent the data item frequency taking out entry, described Δ_{V, t}Represent and take out The data item frequency increment of data item update；

Step 5-3-4), judge described in the rickle H of root node₀The fullest, if the fullest, perform next step, otherwise perform 5-3-6)；

Step 5-3-5), take out described in the rickle H of root node₀Entry item that middle data item frequency is minimum_min, this entry is composed again Value, then performs step 6)；Wherein, this entry assignment is included:

Make v=v_t, c_v=c_v+Δ_{V, t}；

Described v_tRepresent the data item title of the data item update taken out；

Step 5-3-6), create a new entry be new entry assignment, by the rickle of the root node maintenance described in the insertion of new entry, so Rear execution step 6)；Wherein, new entry assignment is included:

Make v=v_t, c_v=Δ_{V, t}；

Step 6), according to the request of user, the rickle H described in root node traversal₀, return all data item frequenciesBar Mesh is frequent episode to be excavated.

The present invention is in described step 4) and step 5) between also include the rickle H to leaf node_iCarry out carrying out according to the frequency of data item The operating procedure of sequence.

And, the step 5 described) and step 6) between also include the rickle H to root node₀Carry out arranging according to the frequency of data item The operating procedure of sequence.

Advantages of the present invention and beneficial effect:

Error between estimated value and the actual value of the data item frequency of the frequent episode of the method output that the present invention provides is not more than ε N, single link On maximum communication expense be not more thanThe frequent episode inquiry that user is real-time can be supported.

Accompanying drawing explanation

Fig. 1 is the communication structure of data stream frequent item method for digging based on distributed structure/architecture.

Fig. 2 is the average relative error of data stream frequent item method for digging based on distributed structure/architecture.

Fig. 3 is the single link communications expense of data stream frequent item method for digging based on distributed structure/architecture.

Fig. 4 is the initialization time of data stream frequent item method for digging based on distributed structure/architecture.

Detailed description of the invention

For the apparent method expressing the present invention intuitively, thin to data stream frequent item method for digging based on distributed structure/architecture below Joint is described in detail:

1. determine parameter

Distributed frequent-item method it needs to be determined that parameter include:

(1) Item-supportAnd degree of error

(2) leaf node data m；

(3) each leaf node i and the rickle H of root node_iAnd H₀Size beWith

(4) delay of each leaf node i updates factor beta_i(0 ＜ β_i＜ ε).

The size of the rickle of leaf node quantity m=8 in the present embodiment, leaf node and root node is 10000, i.e. α₀=α_i=0.0001, β_i∈ [0.001,0.005], ε ∈ [0.0001,0.0005],

2. initialize

The initialization time determining each leaf node i isB_i＞ 0 is the bandwidth of link, unit between leaf node i and root node For packets/second.In the initial moment that the inventive method is run, each leaf node i can process receive from data stream S_iData item update, But without any message being passed to its root node until init state terminates.

3. the data item during leaf node processes data stream

As data stream S_iIn have new data item (v, c_{V, t}) arrive leaf node i time, first update data stream S_iMiddle data item frequency sum N_i=N_i+c_{V, t}, wherein equal sign represents assignment, lower same, and the frequency estimation increment Delta of data item frequency sum_i=Δ_i+c_{V, t}.Secondly Update corresponding data item frequency: if v is ∈ H_i, then the frequency estimation c of corresponding data item entry is increased_v=c_v+c_{V, t}With data item v Frequency estimation increment Delta_v=Δ_v+c_{V, t}；Otherwise find H_iData item item that medium frequency estimated value is minimum_min, by item_minData item Name replaces with v, and updates its data item frequency c_v=c_v+c_{V, t}With frequency estimation increment Delta_v=c_{V, t}.Finally check whether meet condition to Root node transmission data item update: if Δ_i＞ β_iN_i, then send renewal (0, Δ_i) give root node, and reset frequency estimation increment Delta_i=0； If Δ_v＞ β_iN_i, then renewal (v, Δ are sent_v) give root node, and reset frequency estimation increment Delta_v=0.

4. root node processes the renewal that leaf node sends

When root node receives data item update (v, the Δ of leaf node transmission_v) time, if v is ∈ H₀, then the frequency of corresponding data item entry is increased Estimated value c_v=c_v+Δ_v；Otherwise find H₀Data item item that medium frequency estimated value is minimum_min, by item_minData item name replace with v, And update its data item frequency c_v=c_v+Δ_v.When root node receive leaf node transmission 0-msg update (0, Δ_i) time, update root node pair Estimated value N of data item frequency sum₀=N₀+Δ_i。

5. root node processes the inquiry of Client-initiated frequent episode

When user submits frequent episode inquiry to root node, root node scans rickle H₀Each data item entry (v, the c of middle maintenance_v).If cv_≥ωN₀, then it is assumed that v is frequent episode and is exported by v, and wherein ω is output threshold value, has For user-defined Item-support, α_max=max (α_i), β_max=max (β_i)。

The present invention uses the mode of real data and computer simulation to implement.

The present invention selects the network flow data collection gathered under 3 groups of real network environments as the data source in embodiment.This 3 group data set Respectively: CERNET data set, it is to gather on the OC-48 link of CERNET (China Education and Research Network) TCP bi-directional data collection；CAIDA48 data set, is the anonymous data collection gathered on OC-48west coast peering link； CAIDA192 data set, is the individual event anonymous data collection gathered on OC-192 link.Network flow data is concentrated IP data by the present invention The five-tuple (source IP address, purpose IP address, source port, destination interface, transport layer protocol) of bag is defined as data item name, by packet The length of load is defined as data item frequency.

Definition updates the relative value of retardation coefficient βFig. 2 shows that the inventive method processes the average relative of 3 groups of different Network data sets Error.Can observe, when ε ∈ [0.0001,0.0005], the average relative error of method is respectively less than value and the N product of current ε.

Fig. 3 shows that the inventive method processes the single-link expense of 3 groups of different Network data sets.For every width subgraph of Fig. 3, have respectively Article 4, curve, from top to bottom every curve represents the theoretical maximum of the single link communications expense processing current network data set respectively Actual maximum, actual mean value and the actual minimum of single link communications expense.Can observe, the actual maximum in single link is led to Letter expense is not more than

Fig. 4 shows the initialization time of the inventive method.Can observe, the relative value updating retardation coefficient β is the biggest, the inventive method institute The initialization time needed is the fewest.

Claims

1. a data stream frequent item method for digging based on distributed structure/architecture, the method uses 2 layers of tree-like communication structure, including m Leaf node and 1 root node；Described leaf node is responsible for processing the data item in data stream, the data item frequency in data stream is stored in In the rickle of leaf node, and when data item frequency increment is more than threshold value, data item frequency increment is sent to root node；Described joint Point is responsible for calculating data item frequency estimation in overall architecture, is stored in by data item frequency estimation in the rickle of root node；Described In the rickle of leaf node, the entry of storage includes data item title, data item frequency and data item frequency increment；Described root node is In rickle, the entry of storage includes data item title and data item frequency estimation；The method includes:

Make v=v_t, c_v=c_v+c_{V, t}, Δ_v=c_{V, t}；

Make v=v_t, c_v=c_{V, t}, Δ_v=c_{V, t}；

Make c_v=c_v+c_{V, t}, Δ_v=Δ_v+c_{V, t}；

Make c_v=c_v+Δ_{V, t}, wherein equal sign represents assignment；

Make v=v_t, c_v=c_v+Δ_{V, t}；

Described v_tRepresent the data item title of the data item update taken out；

Make v=v_t, c_v=Δ_{V, t}；

Data stream frequent item method for digging based on distributed structure/architecture the most according to claim 1, it is characterised in that in described step 4) and step 5) between also include the rickle H to leaf node_iCarry out the operating procedure being ranked up according to the frequency of data item.

Data stream frequent item method for digging based on distributed structure/architecture the most according to claim 1, it is characterised in that in described step 5) and step 6) between also include the rickle H to root node₀Carry out the operating procedure being ranked up according to the frequency of data item.