CN102073732A - Method for mining frequency episode from event sequence by using same node chains and Hash chains - Google Patents

Method for mining frequency episode from event sequence by using same node chains and Hash chains Download PDF

Info

Publication number
CN102073732A
CN102073732A CN2011100201562A CN201110020156A CN102073732A CN 102073732 A CN102073732 A CN 102073732A CN 2011100201562 A CN2011100201562 A CN 2011100201562A CN 201110020156 A CN201110020156 A CN 201110020156A CN 102073732 A CN102073732 A CN 102073732A
Authority
CN
China
Prior art keywords
plot
node
over
episode
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100201562A
Other languages
Chinese (zh)
Other versions
CN102073732B (en
Inventor
林树宽
乔建忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201110020156.2A priority Critical patent/CN102073732B/en
Publication of CN102073732A publication Critical patent/CN102073732A/en
Application granted granted Critical
Publication of CN102073732B publication Critical patent/CN102073732B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for mining the lowest occurrence frequency episode from an event sequence, which is characterized by extending the low-order frequency episode step by step so as to directly generate a high-order frequency episode. In the method for finding and counting the lowest occurrence frequency of an episode, the lowest occurrence frequency of a 2-episode is found and counted by establishing an episode matrix and setting corresponding modification states on episode matrix elements, and the lowest occurrence frequency of a k-episode is found and counted by carrying out the timestamp-queue based extension on a frequency 2-episode. In the method for mining the episode by establishing an episode tree and using same node chains and Hash chains, the episode extension time and the occupied memory space are saved, and data need to be scanned once in the process of mining without generating a candidate episode set so that the mining efficiency is high and the less memory space is occupied. The method for mining the episode by establishing the episode tree and using same node chains and Hash chains has good characteristic that the mining time and the mining cost do not change obviously along with the frequency number threshold and can be further used for mining the episode from an event flow.

Description

Based on the frequent plot method for digging of the sequence of events of same node chain and hash chain
Technical field
The invention belongs to the temporal data digging technology, be specifically related to the method and system that the frequent plot of a kind of sequence of events based on same node chain and hash chain is excavated.
Background technology
Along with sensor and radio frequency identification (Radio Frequency Identification, electronic data gathering equipment (Electronic Data Gathering Equipment such as RFID), EDGE) in numerous areas such as supply chain management, environmental monitoring and Internet of Things, be widely used, the data of a large amount of event types have been produced, complicated event is handled (Complex Event Processing, CEP) technology more and more receives publicity and payes attention to, and becomes database field new research focus after data stream gradually.It is the important research content of CEP that frequent plot is excavated, its method and technology can be applied in a lot of aspects, as network invasion monitoring, financial incident and stock trend analysis, communication network warning and Internet of Things etc., by excavating the frequent plot in the sequence of events, can set up corresponding correlation rule, thereby excavate the valuable information that is hidden in the event data.For example, in the various monitoring of Internet of Things are used, can produce a large amount of monitor datas by sensor and RFID equipment, these monitor datas form a sequence of events.Incident in the sequence is not independently, some time point event may with the necessary relation that has of incident on other time points, that is to say, in sequence of events, have correlation rule.We can produce these rules by the frequent plot of excavating in the sequence of events, thereby grasp the rule that event correlation concerns and incident takes place that is hidden in the sequence of events.
Incident among the CEP is defined as (eventtype, time, location, attr 1, attr 2..., attr n) data pattern, wherein, eventtype is the event type name, time is the timestamp that incident takes place, location is a locale, is embodied as the numbering of the sensor or the RFID reader of the incident of detecting, attr 1, attr 2..., attr nBe some event attributes.Incident constitutes sequence of events according to time territory ascending order.So-called plot is the partial ordering set of event on the sequence of events.From above definition as can be seen, frequent plot on the sequence of events is excavated the Frequent Pattern Mining that is different from the transaction database, the latter is actually the excavation to frequent item set, do not need to consider the sequencing of a mode internal between every, and the incident in the sequence of events has very strong time step response, incident in the similar events as set will constitute different plots with different occurring in sequence, and therefore, plot is excavated the order that must consider that inner each incident of plot takes place; Frequent plot is excavated and also is different from sequential mode mining, though the two all is that the data of event type are handled, the former excavates on single sequence of events, and the latter excavates on one group of sequence of events.Therefore, existing frequent mode and sequential mode mining method can not be used for frequent plot excavation.
Usually, frequent plot is divided into two big classes according to the mode that produces, and a class is based on the frequent plot of window; Another kind of is the minimum frequent plot that takes place.There is following problem based on the frequent plot of window in excavation for setting up the plot rule: the several times generation of (1) the three unities in a window only is counted once; (2) the once effective of the three unities takes place to be counted repeatedly owing to being comprised in the different windows.These problems are unfavorable for setting up the plot rule.Excavate the number that the minimum frequent plot that takes place is considered the actual frequency of plot rather than comprised the window of plot, help setting up corresponding plot rule; In addition, excavate the minimum frequent plot that takes place and pay close attention to the plot that begins the latest, finishes the earliest (promptly minimum the generation) takes place, the frequent plot that takes place based on minimum is set up regular, helps the earliest prediction is made in future event.The present invention excavates the frequent plot that minimum takes place.
With regard to method for digging, existing plot is excavated the main class Apriori method that adopts.The Apriori algorithm is the classic algorithm that is used for excavating frequent mode on transaction database or data stream.Class Apriori plot method for digging on the sequence of events is when considering the event data characteristics, used for reference the excavation thought of Apriori algorithm, they have kept the characteristics of Apriori algorithm, need through produce the set of high-order candidate plot by the frequent plot of low order, again by candidate's plot being estimated the process that iterates that produces the frequent plot of high-order, it is very consuming time to carry out such iterative process repeatedly, and each the rank candidate's plot that generates need take a large amount of memory headrooms, causes the time and the space poor-performing that excavate; And these class methods can only be confined to static data is handled owing to need repeatedly scan data, can not be extended on the flow of event and excavate.
Summary of the invention
Problem at existing method for digging exists for the minimum frequent plot that takes place on the sequence of events, the invention provides the frequent plot method for digging of a kind of sequence of events based on same node chain and hash chain.
Excavate the minimum method that frequent plot takes place on the sequence of events that the present invention proposes, by the frequent plot of low order being carried out directly generate the frequent plot of high-order based on the extension step by step of same node chain and hash chain.The minimum of discovery plot provided by the invention takes place and to its method of counting, by setting up the plot matrix and the minimum generation of corresponding modification status discovery 2-plot being set on matrix element and realizing counting, by frequent 2-plot being carried out find that based on the extension of timestamp formation the minimum of k-plot (k>2) takes place and realization is counted.
Frequent plot method for digging based on same node chain and hash chain provided by the invention, its step comprises:
(1) related data structures is carried out initialization.Specifically comprise: the event type that comprises in the sequence of events is encoded according to the order that natural number increases progressively; Structural array epi_1 that comprises 1-plot information and the plot matrix that comprises 2-plot information are carried out initialization;
(2) whether the decision event sequence has scanned, if scanned, then changes step (6) over to; Otherwise enter step (3);
(3) on sequence of events, read scan event (e, t);
(4) the generation counting with incident e adds 1, and the timestamp t of generation is recorded in the corresponding epi_1 array element;
(5) call function GenMinOcc (e, t), generate with incident (e, t) the minimum generation information of relevant 2-plot and being recorded in the plot matrix changes step (2) over to;
(6) in array epi_1, select frequent 1-plot, and it is counted descending sort according to generation, form event queue queue;
(7) queue formation index j assignment is 1;
(8) in tree, set up the child node ce of root, get queue[j] the relevant information assignment give the related data territory of ce, and give fe as the further father node of extension the ce assignment;
(9) child node numbering i assignment is 1;
(10) if fe has extended to finish, set up all child nodes, promptly i>n (n is the quantity of the event type that comprises in the sequence of events) then changes step (18) over to; Otherwise enter step (11);
(11) judge 2-plot e iWhether-fe.name is frequent, if frequent, then enters step (12); Otherwise, change step (17) over to;
(12) in tree, set up the child node ce of fe, and with 2-plot e iThe relevant information assignment of-fe.name is given the related data territory of ce;
(13) the timestamp formation of ce is encoded;
If timestamp formation tq=is (t 1, t 2..., t q), then its coding tqcode is:
tqcode = hash ( tq ) = mod ( ( Σ i = 1 q t i ) / 100 )
(14) call function GenModLink (ce) sets up or revises same node chain and the hash chain of ce;
(15) if the rreturn value of function G enModLink (ce) is 1, represent that then the same node chain of ce exists, ce need not further to extend the child node that generates it, then changes step (17) over to; Otherwise still there is not the same node chain of ce in expression, then enters step (16);
(16) ce is done further extension, give fe as father node the ce assignment, and call the child node that subprocess GenChild (fe) generates fe;
(17) child node numbering i adds 1, changes step (10) over to;
(18) whether decision event formation queue has got tail, if get tail, then changes step (20) over to; Otherwise, enter step (19);
(19) queue formation index j adds 1, changes step (8) over to;
(20) the plot semanteme according to storage in the plot tree carries out plot output, and the storage organization of the frequent plot tree in coding back as shown in Figure 6.
The present invention is by carrying out the extension step by step based on same node chain and hash chain to the frequent plot of low order, directly generate whole frequent plots, in this process, only need scan-data once, need not to generate the set of candidate's plot, significantly improved time and space performance that plot is excavated, and the plot that can be used on the flow of event is excavated.By setting up same node chain and hash chain, this method has time cost not with the superperformance of frequent several threshold value significant changes.
Description of drawings
Fig. 1 is the general flow chart of the inventive method;
Fig. 2 is the process flow diagram that subprocess GenMinOcc generates the minimum generation information of 2-plot;
Fig. 3 is the process flow diagram that function G enModLink set up or revised same node chain and hash chain;
Fig. 4 is the process flow diagram that subprocess GenChild generates child node;
Fig. 5 is that function EpiExtend generates the process flow diagram that extends counting and extend the timestamp formation;
Fig. 6 is the storage node composition of the frequent plot tree in coding back.
Embodiment
The present invention is further detailed explanation below in conjunction with accompanying drawing and example.
As shown in Figure 1, the step of the inventive method comprises:
(1) related data structures is carried out initialization.Comprise:
1. the event type that comprises in the sequence of events is encoded according to the order that natural number increases progressively.If event type e is encoded to m, then in the following description it is expressed as e m, wherein, 1≤m≤n, n are the quantity of event type in the sequence of events;
2. count among the structural array epi_1 that comprises 1-plot information and time territory are carried out initialization, promptly carry out epi_1[m] .count=0; Epi_1[m] .time=0, wherein, 1≤m≤n;
Epi_1 is that length is the structural array of n, element epi_1[m] presentation code is the 1-plot of m, comprises 2 data fields:
Count domain representation 1-plot counting;
The timestamp that time domain representation 1-plot takes place.
3. the plot matrix that comprises 2-plot information is carried out initialization.The plot matrix is presented as a two-dimensional structure array epi_2 in program, the length of each dimension is n, each array element epi_2[p] [q] comprise 3 territories:
Count: expression 2-plot e p-e qCount;
Tq: be 2-plot e p-e qThe timestamp formation;
State: representing matrix element [e p, e q] the modification state, it only gets 2 values, " 0 " representative " can be revised " state; " 1 " representative " can not be revised " state.
To the initialization of plot matrix is exactly initial setting up to these three territories, promptly carries out epi_2[p] [q] .count=0, epi_2[p] [q] .tq=NULL, epi_2[p] [q] .state=1,1≤p, q≤n;
(2) whether the decision event sequence has scanned, if scanned, then changes step (6) over to; Otherwise enter step (3);
(3) (e t), gets event type e corresponding codes m to read scan event;
(4) array element epi_1[m] the count territory add 1, time territory assignment is t;
(5) (e t), changes step (2) over to call subprocess GenMinOcc;
(e, function t) is for generating and incident (e, t) the minimum generation information of relevant 2-plot, and being recorded in the plot matrix for subprocess GenMinOcc.
What the present invention excavated is the minimum frequent plot that takes place.Given plot EP=e 1-e 2-...-e n, for its generation ep=(e 1, t 1) (e 2, t 2) ... (e n, t n) (t wherein 1<t 2<...<t nSatisfy ordering relation), if there is no generation ep '=(e of any EP 1, t 1') (e 2, t 2') ... (e n, t n') satisfy: t 1' 〉=t 1, t n〉=t n' and t n'-t 1'<t n-t 1, claim that then ep is the minimum generation of plot EP.
Subprocess GenMinOcc (e, execution flow process t) specifies as follows as shown in Figure 2:
(5.1) the line index r of plot matrix is initialized as 1, gets the coding m of event type e;
(5.2) judge whether that all matrix provisional capitals handle,, then change step (5.8) over to if handle; Otherwise, enter step (5.3);
(5.3) judgment matrix element [e r, e] the modification state whether be " can revise ", promptly judge epi_2[r] whether the value of [m] .state equal " 0 ", if then enter step (5.4); Otherwise change step (5.7) over to;
(5.4) 2-plot e rThe generation counting of-e adds 1, promptly carries out epi_2[r] [m] .count++;
(5.5) timestamp t is appended to timestamp formation epi_2[r] afterbody of [m] .tq;
(5.6) with matrix element [e r, e] state be set to " can not revise ", promptly carry out epi_2[r] [m] .state=1;
(5.7) line index r adds 1, changes step (5.2) over to;
(5.8) plot matrix column index c is initialized as 1;
(5.9) judge whether that all rectangular arrays all handle, if handle, then (e returns in t) from GenMinOcc; Otherwise, enter step (5.10);
(5.10) judgment matrix element [e, e c] the modification state whether be " can not revise ", promptly judge epi_2[m] whether the value of [c] .state be " 1 ", if then enter step (5.11); Otherwise change step (5.12) over to;
(5.11) with matrix element [e, e c] state be set to " can revise ", promptly carry out epi_2[m] [c] .state=0;
(5.12) column index c adds 1, changes step (5.9) over to;
(6) in array epi_1, select frequent 1-plot, promptly select the element of territory count 〉=min_sup among the array epi_1 (min_sup is frequent several threshold values of presetting), and with its pairing incident according to the descending sort of count territory, form event queue queue;
The element of territory count 〉=min_sup will constitute frequent 1-plot among the array epi_1, and therefore, formation queue has comprised whole frequent 1-plots.Formation queue is presented as a structural array in program, its length is the frequent 1-plot quantity that comprises in the sequence of events, and each array element comprises 2 data fields:
Name: presentation of events type name
Count: represent corresponding 1-plot counting
(7) be 1 with queue formation index j assignment;
(8) in plot tree, set up the child node ce of root, get queue[j] the relevant information assignment give name and the count territory of ce, and give fe as the father node that further extends the ce assignment, i.e. execution: ce.name=queue[j] .name; Ce.count=queue[j] .count; Fe=ce;
All frequent plots on the storage sequence of events in the plot tree, therefore, the process of achievement is exactly the process that frequent plot is excavated.K layer node is called k-plot node in the tree, representing length is the frequent plot of k, the first incident of this k-plot is a k-plot node itself, tail event is a 1-plot node, and k-plot node each node to the 1-plot node path is according to the intermediate node that is made of plot high level to the order of low layer.
The different layers node has different data fields as required in the tree, and wherein, the 1st layer of node comprises 3 territories, is respectively name, count and next; K (k>1) layer node comprises 5 territories, is respectively name, count, tq/tqcode, next and samenext.The implication of each data field is as follows:
Name: event type name
Count: the counting of the plot that node is represented
Tq: timestamp formation (comprising this territory before the coding)
Tqcode: timestamp formation coding (comprising this territory behind the coding)
Next: the pointer that points to child's node
Samenext: the pointer that points to next node in the same node chain
Timestamp formation tq is defined as the formation that the timestamp by all minimum first incidents that take place of plot forms according to ascending order.The present invention will extend the low order plot based on timestamp formation tq, generate the frequent plot of random length.
Queue[j] be frequent 1-plot, therefore, this step adds in the plot tree it as the 1st layer of node.After child node ce sets up, give fe so that in the step of back, this 1-plot is extended assignment.
(9) child node numbering i assignment is 1;
Numbering i carries out the numbering that its child node is set up in horizontal expansion to fe.
(10) if fe has extended to finish, set up all child nodes, promptly i>n then changes step (18) over to; Otherwise enter step (11);
(11) judge 2-plot e iWhether-fe.name is frequent, promptly gets the numbering m of incident fe.name, if epi_2[i] [m] .count 〉=min_sup, then show 2-plot e i-fe.name is frequent, enters step (12); Otherwise, change step (17) over to;
(12) in tree, set up the child node ce of fe, and with 2-plot e iThe relevant information assignment of-fe.name is given name, count and the tq territory of ce, is empty with its next and samenext territory assignment;
In this step, child node ce belongs to the 2nd layer of node, represents 2-plot e i-fe.name, its each territory specifically is set to:
ce.name=e i;ce.count=epi_2[i][m].count;ce.tq=epi_2[i][m].tq;
ce.next=NULL;ce.samenext=NULL;
(13) the timestamp formation tq to ce encodes.
If timestamp formation tq=is (t 1, t 2..., t q), then its coding tqcode is:
tqcode = hash ( tq ) = mod ( ( Σ i = 1 q t i ) / 100 )
(14) call function GenModLink (ce) sets up or revises same node chain and the hash chain of ce;
The present invention is provided with 100 hash chains altogether according to the coding difference of node in the tree, and the identical node of encoding is linked in the same hash chain.The owner pointer of hash chain leaves array hashlink[100 in] in, hashlink[0]~hashlink[99] to deposit cryptographic hash respectively be 0~99 chain owner pointer, each hash chain of original state is set up as yet, therefore, hashlink[k is set]=NULL, 0≤k<100.Each node comprises 5 class data fields in the hash chain:
Name: event type name
Count: plot counting
Tq: timestamp formation
Next: the pointer that points to next node in the hash chain
Psame: the pointer that points to same node chain first-in-chain(FIC) node
Among the present invention, event type, counting in the plot tree are called same node with all identical node of timestamp formation, are interconnected to form the same node chain with all same node of incident type.According to the event type difference, may there be many same node chains in the plot tree.
The execution flow process of function G enModLink (ce) specifies as follows as shown in Figure 3:
(14.1) indexed variable flag is put 0;
Variable flag indicates with the corresponding same node chain of ce and whether sets up that flag=0 represents the foundation as yet of this chain; Flag=1 represents that this chain sets up.
Because of not having any same node chain at first, therefore, this step variable flag is set to 0.
(14.2) judge whether to exist the hash chain of ce, promptly judge hashlink[ce.tqcode] value whether be " NULL ", if be empty, the hash chain of expression ce is set up as yet, then enters step (14.3), otherwise changes step (14.4) over to;
(14.3) hash chain node n ode of application, and make hashlink[ce.tqcode]=node, set up corresponding hash chain, change step (14.6) over to;
(14.4) whether the information of judging ce in corresponding hash chain, if in chain, then changes step (14.7) over to; Otherwise, enter step (14.5);
(14.5) add new node n ode in that hash chain is in hot pursuit;
(14.6) each territory assignment of node n ode is specially: node.name=ce.name; Node.count=ce.count; Node.tq=ce.tq; Node.next=NULL; Node.psame=ce; Function G enModLink (ce) returns then, and at this moment, function return value is flag=0;
(14.7) ce is appended to corresponding same node last-of-chain;
(14.8) indexed variable flag is put 1, function G enModLink (ce) returns then, and at this moment, function return value is flag=1.
(15) if the rreturn value of function G enModLink (ce) is 1, represent that then the same node chain of ce exists, ce need not further to extend the child node that generates it, then changes step (17) over to; Otherwise still there is not the same node chain of ce in expression, then enters step (16);
(16) give fe as father node the ce assignment, and call the child node that subprocess GenChild (fe) generates fe;
Subprocess GenChild (fe) generates the process of the child node of fe and exactly the fe node is constantly extended, and generates the process that all are the high-order plot node of suffix with fe, comprises two kinds of horizontal expansion and longitudinal extensions.Horizontal expansion is the extension of fe on width, mainly is that the frequent 2-plot of tail event is carried out based on the represented plot of fe with fe; Longitudinal extension is the extension of fe on the degree of depth, realizes by recursive call subprocess GenChild (fe).The execution flow process of subprocess GenChild (fe) specifies as follows as shown in Figure 4:
(16.1) horizontal expansion numbering hi is initialized as 1;
(16.2) whether horizontal expansion finishes to judge fe, judges whether that promptly hi>n, n are the quantity of event type in the sequence of events.If hi>n then changes step (16.12) over to; Otherwise, the horizontal expansion of not finishing fe as yet is described, enter step (16.3) and proceed horizontal expansion;
(16.3) judge with fe to be the 2-plot e of tail event HiWhether-fe is frequent, if frequent, then enters step (16.4); Otherwise, change step (16.11) over to;
(16.4) event type of getting fe.name is numbered m, and call function EpiExtend (fe.tq, ep_2[hi] [m] .tq) fe is carried out horizontal expansion;
Function EpiExtend (tq1, input parameter tq1=fe.tq tq2), tq2=ep_2[hi] [m] .tq, be respectively the timestamp formation and the 2-plot e of the represented plot of fe HiThe timestamp formation of-fe by the two being compared the horizontal expansion of realization fe, produces it and extends plot, and counting is extended in generation and the formation of extension timestamp is exported as function.
If k-plot (k 〉=2)
Figure BDA0000044167070000081
The 2-plot
Figure BDA0000044167070000082
If
Figure BDA0000044167070000083
Then the extension plot of EP1 and EP2, i.e. k+1 plot
Figure BDA0000044167070000084
If the extension of EP counting ext_count 〉=min_sup illustrates that then EP is frequent, node fe need set up child node.
Function EpiExtend (tq1, execution flow process tq2) specifies as follows as shown in Figure 5:
(16.4.1) correlated variables initialization comprises: extend counting ext_count and be initialized as 0; Index p, the q of tq2, tq1 formation are initialized as 0; Extend timestamp formation ext_tq and be initialized as sky;
(16.4.2) judge whether tq1 or tq2 have arrived afterbody,, then enter step (16.4.3) if do not arrive afterbody; Otherwise (tq 1, returns in tq2) from function EpiExtend;
(16.4.3) compare tq2[p] and tq1[q], if tq2[p] 〉=tq1[q], then enter step (16.4.4); Otherwise, change step (16.4.5) over to;
(16.4.4) tq1 formation index q adds 1, repeated execution of steps (16.4.3);
(16.4.5) extend counting ext_count and add 1, tq2 formation index p adds 1;
(16.4.6) compare tq2[p] and tq1[q], if tq2[p]<tq1[q], then enter step (16.4.7); Otherwise, change step (16.4.8) over to;
(16.4.7) tq2 formation index p adds 1, repeated execution of steps (16.4.6);
(16.4.8) with tq2[p-1] be appended to the afterbody that extends timestamp formation ext_tq, change step (16.4.2) over to.
(16.5) judge that whether the extension plot is frequent, and whether the counting of extension ext_count 〉=min_sup is promptly arranged, if frequent, then enters step (16.6); Otherwise, change step (16.11) over to;
(16.6) in tree, set up the child node ce of fe, and ce.name=e is set HiCe.count=ext_count; Ce.tq=ext_tq; Ce.next=NULL; Ce.samenext=NULL;
(16.7) according to the coding formula of step (13) ce.tq is encoded;
(16.8) call function GenModLink (ce) sets up or revises same node chain and the hash chain of ce; The concrete execution in step of GenModLink (ce) is seen (14.1)-(14.8);
(16.9) if the rreturn value of function G enModLink (ce) is 1, then ce need not to carry out longitudinal extension again, changes step (16.11) over to; Otherwise, enter step (16.10);
(16.10) give fe with the ce assignment, recursive call process GenChild (fe) generates the child node of fe and realizes longitudinal extension;
(16.11) horizontal expansion numbering hi adds 1, changes step (16.2) over to;
(16.12) return the father node of fe, assignment is given fe; Its horizontal expansion numbering assignment is given variable hi;
(16.13) judge whether fe is 1-plot node, if not, step (16.14) then entered; If fe is a 1-plot node, all extensions of expression fe are all finished, are that all frequent plots of suffix all generate with fe, and therefore, subprocess GenChild (fe) carries out end.
(16.14) recursive call subprocess GenChild (fe), and hi is numbered in horizontal expansion add 1, change step (16.2) over to.
(17) child node numbering i increases 1, changes step (10) over to;
(18) whether decision event formation queue has got tail, if got tail, complete plot tree has been set up in expression, and all frequent plots have all been excavated out and have been stored in the tree on the sequence of events, then change step (20) over to; Otherwise, enter step (19);
(19) the index j of queue formation adds 1, changes step (8) over to, continue to excavate with queue[j] .name is the frequent plot of suffix;
(20) the plot semanteme according to storage in the plot tree carries out plot output.
The plot semanteme represented according to tree, each node representative is by the frequent plot that himself constitutes to 1-plot node place branch in the tree, and therefore, each non-1-plot node has just constituted whole frequent plots to the path of 1-plot node process in the tree.During concrete output, non-1-plot node is divided into two kinds of situations, and the plot output procedure of every kind of situation is as follows:
1. non-1-plot node is the first node in the same node chain
In such cases, all is to be extended by the upper strata node to generate by non-1-plot node to each node on the 1-plot node path, therefore, directly the node on the outgoing route gets final product, promptly by high-rise node as the first node of plot, node on the path according to by high level to the order of low layer successively as intermediate node, the 1-plot node on the path carries out plot output as the caudal knot point.
2. non-1-plot node is the non-first node in the same node chain
In such cases, non-1-plot node is omitted owing to setting up the same node chain to the extension process of its child node, it will share identical child node with other node on the same node chain, therefore, the node that this non-1-plot node is comprised to the 1-plot node path can only the frequent plot of component part, and the frequent plot of another part need generate by the same node chain.For this reason, need the first-in-chain(FIC) of location same node chain, concrete localization method is:
If non-1-plot node is node1, then search hash chain hashlink[node1.tqcode according to node1.name], establishing the Hash node that finds is node2, what then node2.psame was pointed is exactly the first-in-chain(FIC) node of node1 place same node chain.
The plot outgoing route comprises two parts, and according to being made of the order of high level to low layer, another part path is made of the path of first-in-chain(FIC) node to the 1-plot by all child nodes of first-in-chain(FIC) node in a part of path.All nodes in the fullpath have all been represented corresponding frequent plot, can export respectively.
Use the inventive method on lot of data, to test, verified that it compares the advantage on time and space performance with traditional class Apriori method.It below is one of them example.
The hardware environment of test: CPU is INTEL 2.66GHz, in save as the PC of 3.5GB.
Software environment: operating system is Windows XP Professional 2002, and development language is Visual C++6.0.
Test data: the data (http://db.csail.mit.edu/labdata/labdata.html) that 54 wireless sensers arranging from Intel Berkeley laboratory produced from April 5,28 days to 2004 February in 2004, each incident comprises attributes such as sensor number, timestamp, temperature.Therefrom select 27 sensors during test, and the temperature property value is divided into 4 intervals, formed 27*4=108 event type, sequence of events length comprises 100,000 incidents at most.
Testing scheme and result: the present invention tests the method that is proposed from following two aspects.
1. for the sequence of events of regular length, set different frequent several threshold values respectively, the excavation time of test the inventive method, and compare with excavation time of traditional class Apriori method;
From describing, top step as can be seen, in the plot mining process, use frequent number threshold value min-sup to weigh the frequency of plot always.Because sequence of events is longer, frequently several threshold value min-sup values are bigger, in order to express convenience, will frequently count threshold value min-sup conversion for frequently spending threshold value min_fre in following result describes, and translation method is:
min_fre=min-sup/len_seq
Wherein, len_seq is a sequence of events length.
Following table has provided the situation of change of excavation time with frequent degree threshold value min_fre
min_fre(%) 2.5 2.9 3.3 3.7 4.1 4.5
The inventive method (s) 30.219 19.362 14.063 9.827 5.578 4.803
Class Apriori method (s) 221.861 112.513 45.643 20.189 5.718 5.159
As can be seen from the above table, when the min_fre value was very big, the excavation time ratio of the inventive method and class Apriori method was more approaching.When the min_fre value is very big, there is frequent plot hardly, therefore, the advantage of the inventive method is also not obvious.In actual excavation, it is nonsensical getting excessive min_fre; When min_fre value when not being excessive, the excavation time of the inventive method has remarkable advantages than class Apriori method, and the inventive method has the excavation time not with the superperformance of frequent degree threshold value significant change.
2. for the sequence of events of regular length, set different frequent several threshold values respectively, test the inventive method required memory headroom that takies, and compare with the required memory headroom of traditional class Apriori method.
Following table has provided memory headroom and has taken situation of change with frequent degree threshold value min_fre
?min_fre(%) 2.5 2.9 3.3 3.7 4.1 4.5
The inventive method (KB) 899.797 698.423 453.367 376.103 296.141 286.572
Class Apriori method (KB) 5403.780 4212.039 3433.19 2497.941 1638.850 1103.417
As can be seen from the above table, the inventive method has remarkable advantages at memory headroom aspect taking, and it is more little frequently to spend threshold value min_fre, and advantage is obvious more.

Claims (5)

1. frequent plot method for digging of the sequence of events based on same node chain and hash chain may further comprise the steps:
(1) related data structures is carried out initialization, comprising:
1. the event type that comprises in the sequence of events is encoded according to the order that natural number increases progressively;
2. count among the structural array epi_1 that comprises 1-plot information and time territory are carried out initialization;
3. the plot matrix that comprises 2-plot information is carried out initialization;
(2) whether the decision event sequence has scanned, if scanned, then changes step (6) over to; Otherwise enter step (3);
(3) on sequence of events, read scan event (e, t);
(4) the generation counting with event type e adds 1, and the timestamp t of generation is recorded in the corresponding epi_1 array element;
(5) call function GenMinOcc (e, t), generate with incident (e, t) the minimum generation information of relevant 2-plot and being recorded in the plot matrix changes step (2) over to;
(6) in array epi_1, select frequent 1-plot, and it is counted descending sort according to generation, form event queue queue;
(7) be 1 with queue formation index j assignment;
(8) in plot tree, set up the child node ce of root, get queue[j] the relevant information assignment give the related data territory of ce, and give fe as the further father node of extension the ce assignment;
(9) child node numbering i assignment is 1;
(10) if fe has extended to finish, set up all child nodes, i.e. i〉n (n is the quantity of the event type that comprises in the sequence of events), then change step (18) over to; Otherwise enter step (11);
(11) judge 2-plot e iWhether-fe.name is frequent, if frequent, then enters step (12); Otherwise, change step (17) over to;
(12) in tree, set up the child node ce of fe, and with 2-plot e iThe relevant information assignment of-fe.name is given the related data territory of ce;
(13) the timestamp formation of ce is encoded;
If timestamp formation tq=is (t 1, t 2..., t q), then its coding tqcode is:
Figure 2011100201562100001DEST_PATH_IMAGE001
(14) call function GenModLink (ce) sets up or revises same node chain and the hash chain of ce;
(15) if the rreturn value of function G enModLink (ce) is 1, represent that then the same node chain of ce exists, ce need not further to extend the child node that generates it, then changes step (17) over to; Otherwise still there is not the same node chain of ce in expression, then enters step (16);
(16) ce is done further extension, give fe as father node the ce assignment, and call the child node that subprocess GenChild (fe) generates fe;
(17) child node numbering i adds 1, changes step (10) over to;
(18) whether decision event formation queue has got tail, if get tail, then changes step (20) over to; Otherwise, enter step (19);
(19) queue formation index j adds 1, changes step (8) over to;
(20) the plot semanteme according to storage in the plot tree carries out plot output.
2. method according to claim 1, it is characterized in that calling described in the step (5) subprocess GenMinOcc (e, execution flow process t) is:
(5.1) the line index r of plot matrix is initialized as 1, gets the coding m of event type e;
(5.2) judge whether that all matrix provisional capitals handle,, then change step (5.8) over to if handle; Otherwise, enter step (5.3);
(5.3) judgment matrix element [e r, e] the modification state whether be " can revise ", promptly judge epi_2[r] whether the value of [m] .state equal " 0 ", if then enter step (5.4); Otherwise change step (5.7) over to;
(5.4) 2-plot e rThe generation counting of-e adds 1, promptly carries out epi_2[r] [m] .count++;
(5.5) timestamp t is appended to timestamp formation epi_2[r] afterbody of [m] .tq;
(5.6) with matrix element [e r, e] state be set to " can not revise ", promptly carry out epi_2[r] [m] .state=1;
(5.7) line index r adds 1, changes step (5.2) over to;
(5.8) plot matrix column index c is initialized as 1;
(5.9) judge whether that all rectangular arrays all handle, if handle, then (e returns in t) from GenMinOcc; Otherwise, enter step (5.10);
(5.10) judgment matrix element [e, e c] the modification state whether be " can not revise ", promptly judge epi_2[m] whether the value of [c] .state be " 1 ", if then enter step (5.11); Otherwise change step (5.12) over to;
(5.11) with matrix element [e, e c] state be set to " can revise ", promptly carry out epi_2[m] [c] .state=0;
(5.12) column index c adds 1, changes step (5.9) over to.
3. method according to claim 1 is characterized in that the execution flow process of call function GenModLink (ce) described in the step (14) is:
(14.1) indexed variable flag is put 0;
(14.2) judge whether to exist the hash chain of ce, promptly judge hashlink[ce.tqcode] value whether be " NULL ", if be empty, the hash chain of expression ce is set up as yet, then enters step (14.3), otherwise changes step (14.4) over to;
(14.3) hash chain node n ode of application, and make hashlink[ce.tqcode]=node, set up corresponding hash chain, change step (14.6) over to;
(14.4) whether the information of judging ce in corresponding hash chain, if in chain, then changes step (14.7) over to; Otherwise, enter step (14.5);
(14.5) add new node n ode in that hash chain is in hot pursuit;
(14.6) each territory assignment of node n ode is specially: node.name=ce.name; Node.count=ce.count; Node.tq=ce.tq; Node.next=NULL; Node.psame=ce; Function G enModLink (ce) returns then, and at this moment, function return value is flag=0;
(14.7) ce is appended to corresponding same node last-of-chain;
(14.8) indexed variable flag is put 1, function G enModLink (ce) returns then, and at this moment, function return value is flag=1.
4. method according to claim 1 is characterized in that the execution flow process of calling subprocess GenChild (fe) described in the step (16) is:
(16.1) horizontal expansion numbering hi is initialized as 1;
(16.2) whether horizontal expansion finishes to judge fe, promptly judges whether hi〉n, n is the quantity of event type in the sequence of events, if hi〉n, then change step (16.12) over to; Otherwise, enter step (16.3) and proceed horizontal expansion;
(16.3) judge with fe to be the 2-plot e of tail event HiWhether-fe is frequent, if frequent, then enters step (16.4); Otherwise, change step (16.11) over to;
(16.4) event type of getting fe.name is numbered m, and call function EpiExtend (fe.tq, ep_2[hi] [m] .tq) fe is carried out horizontal expansion;
(16.5) judge that whether the extension plot is frequent, and whether the counting of extension ext_count 〉=min_sup is promptly arranged, if frequent, then enters step (16.6); Otherwise, change step (16.11) over to;
(16.6) in tree, set up the child node ce of fe, and ce.name=e is set HiCe.count=ext_count; Ce.tq=ext_tq; Ce.next=NULL; Ce.samenext=NULL;
(16.7) according to the coding formula of step (13) ce.tq is encoded;
(16.8) call function GenModLink (ce) sets up or revises same node chain and the hash chain of ce; The concrete execution in step of GenModLink (ce) is undertaken by (14.1)-(14.8);
(16.9) if the rreturn value of function G enModLink (ce) is 1, then ce need not to carry out longitudinal extension again, changes step (16.11) over to; Otherwise, enter step (16.10);
(16.10) give fe with the ce assignment, recursive call process GenChild (fe) generates the child node of fe and realizes longitudinal extension;
(16.11) horizontal expansion numbering hi adds 1, changes step (16.2) over to;
(16.12) return the father node of fe, assignment is given fe; Its horizontal expansion numbering assignment is given variable hi;
(16.13) judge whether fe is 1-plot node, if not, step (16.14) then entered; If fe is a 1-plot node, subprocess GenChild (fe) carries out end;
(16.14) recursive call subprocess GenChild (fe), and hi is numbered in horizontal expansion add 1, change step (16.2) over to.
5. method according to claim 4, it is characterized in that call function EpiExtend described in the step (16.4) (tq1, execution flow process tq2) is:
(16.4.1) correlated variables initialization comprises: extend counting ext_count and be initialized as 0; Index p, the q of tq2, tq1 formation are initialized as 0; Extend timestamp formation ext_tq and be initialized as sky;
(16.4.2) judge whether tq1 or tq2 have arrived afterbody,, then enter step (16.4.3) if do not arrive afterbody; Otherwise (tq1 returns in tq2) from function EpiExtend;
(16.4.3) compare tq2[p] and tq1[q], if tq2[p] 〉=tq1[q], then enter step (16.4.4); Otherwise, change step (16.4.5) over to;
(16.4.4) tq1 formation index q adds 1, repeated execution of steps (16.4.3);
(16.4.5) extend counting ext_count and add 1, tq2 formation index p adds 1;
(16.4.6) compare tq2[p] and tq1[q], if tq2[p]<tq1[q], then enter step (16.4.7); Otherwise, change step (16.4.8) over to;
(16.4.7) tq2 formation index p adds 1, repeated execution of steps (16.4.6);
(16.4.8) with tq2[p-1] be appended to the afterbody that extends timestamp formation ext_tq, change step (16.4.2) over to.
CN201110020156.2A 2011-01-18 2011-01-18 Method for mining frequency episode from event sequence by using same node chains and Hash chains Expired - Fee Related CN102073732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110020156.2A CN102073732B (en) 2011-01-18 2011-01-18 Method for mining frequency episode from event sequence by using same node chains and Hash chains

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110020156.2A CN102073732B (en) 2011-01-18 2011-01-18 Method for mining frequency episode from event sequence by using same node chains and Hash chains

Publications (2)

Publication Number Publication Date
CN102073732A true CN102073732A (en) 2011-05-25
CN102073732B CN102073732B (en) 2014-04-30

Family

ID=44032271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110020156.2A Expired - Fee Related CN102073732B (en) 2011-01-18 2011-01-18 Method for mining frequency episode from event sequence by using same node chains and Hash chains

Country Status (1)

Country Link
CN (1) CN102073732B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104182528A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence pattern based educational information course association pattern discovery method and system
CN106055672A (en) * 2016-06-03 2016-10-26 西安电子科技大学 Method for mining frequent episodes of signal sequence with time constraint
CN106203631A (en) * 2016-07-05 2016-12-07 中国科学院计算技术研究所 The parallel Frequent Episodes Mining of description type various dimensions sequence of events and system
CN106294824A (en) * 2016-08-17 2017-01-04 广东工业大学 Manufacture Internet of Things towards the complex events detecting methods of uncertain data stream and system
CN107562865A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Multivariate time series association rule mining method based on Eclat
CN107590231A (en) * 2017-09-06 2018-01-16 北京大有中城科技有限公司 A kind of implementation method for solving to be actually needed by platform things chain
CN103744904B (en) * 2013-12-25 2018-02-16 北京京东尚科信息技术有限公司 A kind of method and device that information is provided
CN108563757A (en) * 2018-04-16 2018-09-21 泰州学院 Pervasive sequence of events Frequent Episodes Mining
CN109327311A (en) * 2018-08-03 2019-02-12 克洛斯比尔有限公司 A kind of Hash timestamp creation method, equipment and readable storage medium storing program for executing
CN111858925A (en) * 2020-06-04 2020-10-30 国家计算机网络与信息安全管理中心 Script extraction method and device for telecommunication network fraud event

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938702A (en) * 2004-04-27 2007-03-28 诺基亚公司 Processing data in a computerised system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938702A (en) * 2004-04-27 2007-03-28 诺基亚公司 Processing data in a computerised system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林树宽等: "《An Efficient Frequent Pattern Mining Algorithm for Data Stream》", 《INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION (ICICTA), 2008 INTERNATIONAL CONFERENCE》, 22 October 2008 (2008-10-22), pages 751 - 761 *
林树宽等: "基于2-情节矩阵和频繁情节树的串行情节挖掘", 《2010年中国计算机大会论文集》, 11 October 2010 (2010-10-11) *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744904B (en) * 2013-12-25 2018-02-16 北京京东尚科信息技术有限公司 A kind of method and device that information is provided
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104182528A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence pattern based educational information course association pattern discovery method and system
CN104182528B (en) * 2014-08-27 2017-07-07 广西教育学院 IT application in education sector course association mode based on partial order pattern finds method and system
CN106055672A (en) * 2016-06-03 2016-10-26 西安电子科技大学 Method for mining frequent episodes of signal sequence with time constraint
CN106055672B (en) * 2016-06-03 2019-05-03 西安电子科技大学 A kind of signal sequence Frequent Episodes Mining with time-constrain
CN106203631B (en) * 2016-07-05 2019-04-30 中国科学院计算技术研究所 The parallel Frequent Episodes Mining and system of description type various dimensions sequence of events
CN106203631A (en) * 2016-07-05 2016-12-07 中国科学院计算技术研究所 The parallel Frequent Episodes Mining of description type various dimensions sequence of events and system
CN106294824A (en) * 2016-08-17 2017-01-04 广东工业大学 Manufacture Internet of Things towards the complex events detecting methods of uncertain data stream and system
CN106294824B (en) * 2016-08-17 2019-06-11 广东工业大学 Manufacture complex events detecting methods and system of the Internet of Things towards uncertain data stream
CN107562865A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Multivariate time series association rule mining method based on Eclat
WO2019041628A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for mining multivariate time series association rule based on eclat
CN107590231A (en) * 2017-09-06 2018-01-16 北京大有中城科技有限公司 A kind of implementation method for solving to be actually needed by platform things chain
CN108563757A (en) * 2018-04-16 2018-09-21 泰州学院 Pervasive sequence of events Frequent Episodes Mining
CN108563757B (en) * 2018-04-16 2021-05-28 泰州学院 Universal event sequence frequent plot mining method
CN109327311A (en) * 2018-08-03 2019-02-12 克洛斯比尔有限公司 A kind of Hash timestamp creation method, equipment and readable storage medium storing program for executing
CN111858925A (en) * 2020-06-04 2020-10-30 国家计算机网络与信息安全管理中心 Script extraction method and device for telecommunication network fraud event
CN111858925B (en) * 2020-06-04 2023-08-18 国家计算机网络与信息安全管理中心 Script extraction method and device of telecommunication phishing event

Also Published As

Publication number Publication date
CN102073732B (en) 2014-04-30

Similar Documents

Publication Publication Date Title
CN102073732B (en) Method for mining frequency episode from event sequence by using same node chains and Hash chains
Meynard et al. Disentangling the drivers of metacommunity structure across spatial scales
Laffan et al. Range‐weighted metrics of species and phylogenetic turnover can better resolve biogeographic transition zones
Wu et al. Mining closed+ high utility itemsets without candidate generation
Harris et al. Mapping beach morphodynamics remotely: a novel application tested on South African sandy shores
De Bello et al. Predictive value of plant traits to grazing along a climatic gradient in the Mediterranean
Walker et al. Modeling spatial decisions with graph theory: logging roads and forest fragmentation in the Brazilian Amazon
Nimmo et al. Predicting the century‐long post‐fire responses of reptiles
Cantidio et al. Aridity, soil and biome stability influence plant ecoregions in the Atlantic Forest, a biodiversity hotspot in South America
Faith et al. Practical application of biodiversity surrogates and percentage targets for conservation in Papua New Guinea
Del Moral et al. Vegetation patterns 25 years after the eruption of Mount St. Helens, Washington, USA
Liu et al. Comparing the random forest with the generalized additive model to evaluate the impacts of outdoor ambient environmental factors on scaffolding construction productivity
CN107194498B (en) Hydrologic monitoring network optimization method
Smith et al. Estimating the influence of land management change on weed invasion potential using expert knowledge
Drechsler Probabilistic approaches to scheduling reserve selection
Leppig et al. Conservation of peripheral plant populations in California
Nguyen et al. Vegetation trends associated with urban development: The role of golf courses
Hairah et al. Borneo biodiversity: Exploring endemic tree species and wood characteristics
Deshmukh et al. Physio‐climatic controls on vulnerability of watersheds to climate and land use change across the US
Moir et al. Diversity, endemism and species turnover of millipedes within the south‐western Australian global biodiversity hotspot
Faith et al. A biodiversity conservation plan for Papua New Guinea based on biodiversity trade-offs analysis
Fontalvo-Herazo et al. Simulating harvesting scenarios towards the sustainable use of mangrove forest plantations
Zhou et al. Comprehensive evaluation of land reclamation and utilisation schemes based on a modified VIKOR method for surface mines
Freitag et al. Dealing with established reserve networks and incomplete distribution data sets in conservation planning
Verde Arregoitia et al. Diversity, extinction, and threat status in Lagomorphs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140430

Termination date: 20160118

EXPY Termination of patent right or utility model