CN109101530B - High-utility event sequence pattern mining method - Google Patents

High-utility event sequence pattern mining method Download PDF

Info

Publication number
CN109101530B
CN109101530B CN201810650504.6A CN201810650504A CN109101530B CN 109101530 B CN109101530 B CN 109101530B CN 201810650504 A CN201810650504 A CN 201810650504A CN 109101530 B CN109101530 B CN 109101530B
Authority
CN
China
Prior art keywords
utility
mode
mining
events
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810650504.6A
Other languages
Chinese (zh)
Other versions
CN109101530A (en
Inventor
张春慨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201810650504.6A priority Critical patent/CN109101530B/en
Publication of CN109101530A publication Critical patent/CN109101530A/en
Application granted granted Critical
Publication of CN109101530B publication Critical patent/CN109101530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a high-utility event sequence pattern mining method, which comprises the following steps: s1, defining a safety event; s2, dividing a transaction database; s3, mining an incremental high-efficiency safety event sequence; and S4, mining the parallelized incremental high-utility security event sequence. The invention has the beneficial effects that: the parallelization can be adopted to accelerate the mining time, better utilize hardware resources, realize the mining of high-utility event sequence patterns and accelerate the data mining speed.

Description

High-utility event sequence pattern mining method
Technical Field
The invention relates to data mining, in particular to a high-utility event sequence pattern mining method.
Background
The current network security event correlation analysis technology mainly comprises an analysis method based on probability similarity between security events, a correlation analysis method based on causal relationship between time behavior results and prerequisites, an attack graph-based method, a data mining and machine learning-based method and the like, wherein the data mining and machine learning-based method is the most basic and the most effective correlation analysis method. Association rule mining is widely applied to an association analysis model of network security events as a typical data mining method, but with the coming of big data era, the application field of the traditional association rule mining method is narrower and narrower, so that a large number of scholars propose an improved algorithm of the association rule mining algorithm.
At present, some improvements on the traditional association rule mining algorithm are mainly improved aiming at the purpose of the traditional association rule mining algorithm, the traditional rule mining only aiming at a transaction data commodity set is broken through, and the traditional rule mining is applied to the application with more complex conditions.
Povinelli et al, in the field of association rule algorithm research on Time Series, propose a Time Series Data Mining framework (TSDM) based on Time Series, which is called Time Series Data Mining. The Zenghaiquan provides a time sequence mining and similarity searching technology based on a mutual association successor tree model. The Lushan proposes a financial time sequence prediction technology of nonlinear dynamics based on nonlinear time sequence phase space reconstruction; from the current Time Series Data Mining study, Montmann states that Time Series Data Mining may be more generally defined as Time Series Data Mining (TSDM), which extracts the internal rules of a Time sequence from the Time sequence for numerical values, periods, trend analysis and prediction of the Time sequence; gasp et al propose a method to discover rules from a time series. The method comprises the following steps that (1) Gas firstly adopts a sliding window method (mobile windows method) proposed by Baltzersen to carry out standardized preprocessing on time sequence data, converts a time sequence into a time sequence sample, and completes the discretization and symbolization processing process of the time sequence data; secondly, clustering the standardized time series data sample set; thirdly, reconstructing the original time sequence data by using the obtained classes; and finally, carrying out rule mining on the reconstructed time sequence data set. However, the method only applies the data mining processing method to the time sequence analysis in a flexible and hard way, does not consider the time characteristics of the time sequence and the knowledge background problem, and does not provide a reasonable theoretical explanation. Han et al used data mining techniques to perform periodic and partial periodic segment studies on time sequences in a time sequence database in order to discover periodic patterns (referring to patterns that occur regularly at fixed time intervals).
Currently, mining on association rules is based on existing data sets, i.e. given a set of transactions. In mining based on event sequences, however, it is first necessary to convert the event sequences into a transaction set containing events. Currently, most of the conversions are performed based on a sliding window. However, the current method divides the number of events as a fixed window size. This is obviously not reasonable in the event sequence, even if the time interval between two events is large, the two events are divided into the same transaction in the method, and the two events with large time interval are associated to a small extent or even not associated, so that the dividing method ignores the fact and forcibly introduces the transaction containing the two events, which is obviously not reasonable. There is therefore a need for improvements in the method of partitioning.
In addition, for the event sequence mode, since the events are generated continuously, the transaction set generated by the corresponding partition is also dynamically changed. Considering that a new transaction set is added to the original transaction set, the traditional method is to combine two transaction sets into a large transaction set, and then adopt the previous method to mine again on the basis of the large transaction set. This results in a disadvantage: with the continuous expansion of the transaction set size, the mining time will be continuously expanded, and finally, the mining time will be huge and even the mining cannot be completed. This method does not take into account patterns that have been previously mined, but is re-mined each time. In practical applications, this method is obviously not reasonable.
For the mining of sequence patterns, most algorithms are serialized, and in most algorithms, the sequence patterns mined before and after are not fundamentally related, i.e. the mining of some sequence patterns does not depend on partial patterns. Therefore, how to adopt parallelization to accelerate the mining time and better utilize hardware resources is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a high-utility event sequence pattern mining method.
The invention provides a high-utility event sequence pattern mining method, which comprises the following steps:
s1, defining a safety event;
s2, dividing a transaction database;
s3, mining an incremental high-efficiency safety event sequence;
and S4, mining the parallelized incremental high-utility security event sequence.
As a further improvement of the present invention, in step S1, different events use the attack type labels as labels, in order to consider the influence of the remaining attributes on the event, the utility values of the events are calculated by calculating the attribute values, and then the utility values of the attributes are accumulated to be the final utility value of the event; the utility value corresponding to the attribute value is given manually, so that the utility value can be changed to endow different positions and events of different IPs with different degrees of importance.
As a further improvement of the present invention, in step S2, the event set is divided into transaction sets by means of sliding window, and the window represents the event set from tsTo teThe time span of each window is the same, i.e. (t)e-ts) The same is true.
As a further improvement of the present invention, in step S2, according to the time-sequenced events, the original events are divided by using a sliding window with the same time interval; events in the same window form a transaction sequence, the window slides to the time point of the next time each time, if the events are closer, the events are considered as simultaneous events, namely the sequence of the events is not considered; and when the merged event is positioned in the first item of the current window, in order to make up for the influence generated by event merging, multiplying the original utility value by the merged number to obtain a new utility value.
As a further improvement of the present invention, in step S3, let the original data set be D1The newly added data set is D2The high-utility security event sequence pattern set in the original data set D1 is HUSEP1, and the new data set D2The high and medium utility safety time sequence mode is HUSEP 2; by definition: HUSEP1 has a minimum utility value of δ × u (D)1) HUSEP2 has a minimum utility value of δ × u (D)2) (ii) a The database formed by merging the original data set D1 and the new added data set D2 is denoted as D3, the high-utility security event sequence pattern set of the database D3 formed by merging the original data set D1 and the new added data set D2 is HUSEP3, and obviously, the minimum utility value of HUSEP3 is more than or equal to delta x u (D2)3)=δ×u(D1)+δ×u(D2)。
As a further development of the invention, in step S3, for a HUSEP3 being a subset of HUSEP1 £ HUSEP2, it is evident that HUSEP3 has occurred at least in HUSEP1 or HUSEP2, and if the original HUSEP3 has not occurred in both HUSEP1 and HUSEP2, the utility values of the corresponding patterns in D1, D2 and D3 are u1, u2 and u3, respectively, by definition: u. of1<δ×u(D1) And u2<δ×u(D2) And (3) pushing out: u. of3=u1+u2<δ×u(D1)+δ×u(D2)=δ×u(D3) It is clear that this pattern should not produce a contradiction in HUSEP3, and therefore HUSEP3 is a subset of HUSEP1 {. HUSEP 2.
As a further improvement of the present invention, in step S3,
for the pattern in HUSEP1 and the pattern in HUSEP2, there are 4 cases:
5) the mode is not a high utility mode in neither D1 nor D2;
6) the mode is a high utility mode in both D1 and D2;
7) the mode is a high utility mode in D1, and is not a high utility mode in D2;
8) the mode is not a high utility mode in D1, and is a high utility mode in D2;
for case 1), the mode is not a high utility mode in D3;
for case 2), the mode is the high utility mode in D3,
analogy case 1) has u1≥δ×u(D1),u2≥δ×u(D2) Therefore u is3=u1+u2≥δ×u(D1)+δ×u(D2)=δ×u(D3);
For cases 3) and 4), it cannot be directly deduced whether the pattern is a high utility pattern in D3, and the utility value of the pattern in D3 needs to be calculated for judgment.
For the case 3), since the mode is already the high utility mode in D1, it is only necessary to calculate the utility value of the mode in D2 for judgment;
for case 4), since the pattern is already a high utility pattern in D2, it is only necessary to calculate the utility value of the pattern in D1 for judgment.
As a further improvement of the present invention, in step S4, in the mining process using the HUSP-Miner algorithm, firstly, a candidate 1 item set whose effective upper bound is greater than the threshold needs to be found, then, on the basis of this, a k +1 item set is generated from the k item set by sequence growth, and a pruning strategy is used to reduce the search space.
As a further improvement of the present invention, the pruning strategy is to reduce the search space by continuously shrinking the database: firstly, reading a database into a memory, and as the mode increases, the transaction set containing the mode is continuously reduced, namely the corresponding projection database is continuously reduced; since no changes are made to the database during mining, it can be considered that the mining process of each schema is performed independently after the database is given.
As a further improvement of the present invention, in step S4, parallel mining is performed in a multi-thread manner:
1) when the thread I finishes the task of excavation, the thread I is in a waiting state;
2) if the thread J does not complete the mining at this time, the mode needing processing at present is transferred to the thread I for processing, and the thread J executes the next mode to be processed.
The invention has the beneficial effects that: by the scheme, the mining time can be shortened by adopting parallelization, hardware resources are better utilized, the mining of a high-utility event sequence mode is realized, and the data mining speed is increased.
Drawings
Fig. 1 is a schematic diagram of dividing security events based on time in the high-utility event sequence pattern mining method of the present invention.
FIG. 2 is a graph showing the results of the experiment.
FIG. 3 is a graph showing the results of the second experiment.
FIG. 4 is a graph showing the results of the three experiments.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
A high-utility event sequence pattern mining method comprises the following steps:
definition of security event
Before researching the pattern mining of the network security events, the definition of the network security events, namely the attributes of the network security events, is given first. According to past experience, the following typical attributes are extracted from the present invention to define a network security event. Table 1 gives the definition of a network security event and table 2 lists the common types of network attacks.
Figure GDA0003133222850000061
Table 1 security event attributes
Figure GDA0003133222850000062
Figure GDA0003133222850000071
TABLE 2 common attack types
Different events take the attack type labels as marks, in order to consider the influence of the residual attributes on the events, the utility values of the events are calculated by the attribute values, and then the utility values of the attributes are accumulated to be used as the final utility value of the events. The utility value corresponding to the attribute value is given manually, so that the utility value can be changed to endow different positions and events of different IPs with different degrees of importance.
Second, partitioning of a transaction database
Because an event set is obtained, the set cannot be mined by directly applying a traditional pattern mining algorithm, and needs to be converted into a transaction set suitable for the pattern mining algorithm. It should be noted that the representation of the transaction in the conventional high-utility sequence pattern mining is slightly different from that in the high-utility sequence pattern mining based on the security event. In high utility model mining, the utility of a project is affected by both external and internal utilities. Internal utility generally refers to quantity, external utility generally refers to profit corresponding to the project and external utility is the same for the same project. Since each event is affected by other attributes, its utility value may be different for different numbered events. This makes it possible toTransactions in the high-utility sequence mining based on the security events are differentiated from transactions in the traditional high-utility sequence mining in form. For example, consider a transaction<[(e1:ue1)],[(e2:ue2)]>In the safety event mining, the Ue1 and Ue2 in the transaction represent the utility value of the corresponding event, while the Ue1 and Ue2 in the conventional high utility sequence pattern mining correspond to the number of times the items e1 and e2 appear in the corresponding ranges. Although formally different, conventional high utility sequential pattern mining algorithms can be applied to the transaction database generated by the security events. The internal utility and the external utility of the traditional high-utility sequence pattern mining are only used for calculating the utility value of the project, and in addition, the internal utility and the external utility have no substantial influence on the mining process.
The event set is divided into transaction sets in a sliding window mode. Window representation from tsTo teThe time span of each window is the same, i.e. (t)e-ts) The same is true. The following illustrates the detailed steps of event partitioning. There is a set of safe times D1, as shown in table 3.
Attack ID Time(s) Location Source IP Destination IP
e1 t1 l1 s1 d1
e2 t2 l2 s2 d2
e3 t3 l1 s7 d2
e4 t4 l1 s2 d1
e5 t5 l2 s5 d1
e6 t6 l2 s5 d1
Table 3 set of security events D1
The events sorted by time are shown in fig. 1, and the original events are divided by using a sliding window with the same time interval. Events in the same window form a transaction sequence, the window slides to the time point of the next time each time, if several events are closer, the events are considered to be simultaneous events, namely, the sequence of the events is not considered. And when the merged event is positioned in the first item of the current window, in order to make up for the influence generated by event merging, multiplying the original utility value by the merged number to obtain a new utility value. Taking e4 and e5 in fig. 1 as an example, they can be regarded as an event because of their close distance. The transactions divided out using the sliding window should be:<[(e4:ue4)(e5:ue5)],[(e6:ue6)]>considering the merging effect, the original utility value needs to be multiplied by the number of the merged events, and the processed transaction should be<[(e4:2*ue4)(e5:2*ue5)],[(e6:2*ue6)]>。
Figure GDA0003133222850000091
TABLE 4 partitioned transaction sets
After the security events are divided into the transaction database, the transaction database can be mined by adopting the existing high-utility sequence pattern mining algorithm, which is a HUSP miner algorithm and is not described in more detail herein.
Three, incremental high-efficiency safety event sequence mining
In the actual application process, the security events are generated in real time, and therefore, the partitioned database is also dynamically increased. For such a dynamically growing database, if mining is performed again each time the contents of the database are updated, a lot of resources are consumed. Moreover, with the continuous increase of the size of the database, the original mining algorithm cannot obtain results even due to the excessively large size. There is a need to find the relationship of the original data set to the new data set to simplify the mining process.
Let original data set D1The newly added data set is D2The high-utility security event sequence pattern set in the original data set D1 is HUSEP1, and the high-utility security time sequence pattern set in the newly added data set is HUSEP 2. By definition: HUSEP1 has a minimum utility value of δ × u (D)1) HUSEP2 has a minimum utility value of δ × u (D)2). The database formed by combining D1 with D2 is denoted as D3, and it is obvious that the minimum utility value of HUSEP3 is larger than or equal to delta x u (D2)3)=δ×u(D1)+δ×u(D2)。
It can further be derived that: HUSEP3 for D3 is a subset of HUSEP1 {. U.HUSEP 2. Obviously, HUSEP3 was present at least in HUSEP1 or HUSEP2, and if HUSEP3 was not present in HUSEP1 or HUSEP2, the utility values of the corresponding patterns in databases D1, D2 and D3 were u1, u2 and u3, respectively. By definition, there should be: u. of1<δ×u(D1) And u2<δ×u(D2) And (3) pushing out:
u3=u1+u2<δ×u(D1)+δ×u(D2)=δ×u(D3) It is clear that this pattern should not be in HUSEP3, creating a contradiction. Thus HUSEP3 is a subset of HUSEP1 U.U.HUSEP 2.
For the pattern in HUSEP1 and the pattern in HUSEP2, there are 4 cases:
9) the mode is not a high utility mode in neither D1 nor D2
10) The mode is a high utility mode in both D1 and D2
11) The mode is a high utility mode in D1, and is not a high utility mode in D2
12) The mode is not the high utility mode in D1, and is the high utility mode in D2
For case 1), the pattern is certainly not a high utility pattern in D3, the proving process is similar to the proving that HUSEP3 is a subset of HUSEP1 £ HUSEP2, which is not repeated here since the correlation has been given above.
For case 2), the mode must be the high utility mode in D3, an analogous case1) Has u1≥δ×u(D1),u2≥δ×u(D2) Thus, therefore, it is
u3=u1+u2≥δ×u(D1)+δ×u(D2)=δ×u(D3)。
For cases 3) and 4), it cannot be directly deduced whether the pattern is a high utility pattern in D3, and the utility value of the pattern in D3 needs to be calculated for judgment.
For case 3), since the pattern is already a high utility pattern in D1, it is only necessary to calculate the utility value of the pattern in D2 for judgment.
Similarly, for case 4), since the pattern is already a high utility pattern in D2, it is only necessary to calculate the utility value of the pattern in D1 for judgment.
Definition 1: item ijThe utility value in the q-term set v is defined as
u(ij,v)=q(ij,v)×pr(ij)
Wherein q (i)jV) is ijThe number in v, pr (i)j) Is ijThe profit of (1).
Definition 2: the utility value of the q-term set v is defined as
Figure GDA0003133222850000111
Definition 3: the utility value of the q-sequence s is defined as
Figure GDA0003133222850000112
Definition 4: given the q-sequence s ═ v1,v2,...,vdAnd the sequence t ═ w1,w2,...,wr> -, if d ≦ r and for 1 ≦ k ≦ d, v is satisfiedkAnd wkAnd if the two are identical, s is the matching of t and is marked as s-t.
Definition 5: the utility value of the sequence t in the q-sequence s is defined as
Figure GDA0003133222850000113
Wherein, t to skDenotes skIs a match for t.
Definition 6: the utility value of the sequence t in the quantization database D is
Figure GDA0003133222850000114
Definition 7: defining the utility value of the quantitative database D as
Figure GDA0003133222850000115
Definition 8: if the utility value of the sequence t in the quantization database D is not lower than the user-defined minimum threshold value δ × u (D), then t is the High Utility Sequence Pattern (HUSP) and is noted as
HUSP←{t|u(t)≥u(D)×δ}
Based on the above definitions, high utility sequence pattern mining can be defined as: given the quantitative sequence database D and the minimum utility threshold delta (decimal between 0 and 1), finding out all sequence patterns with utility values not lower than delta x u (D).
Four, parallelization incremental high-utility safety event sequence mining
In the mining process by adopting the HUSP-Miner algorithm, firstly a candidate 1 item set with an effective upper bound larger than a threshold value needs to be found, then on the basis, a k +1 item set is generated from a k item set by sequence growth (two growth modes), and a proper pruning strategy is adopted to reduce a search space. One pruning strategy is to reduce the search space by continually narrowing the database: the algorithm first reads the database into the memory, and as the pattern grows, the transaction set containing the pattern will shrink continuously, i.e. the corresponding projection database becomes smaller continuously. Since no changes are made to the database during mining, it can be considered that the mining process of each schema is performed independently after the database is given.
Therefore, after finding out all candidate 1 item sets, one item set can be divided and then mined in parallel in a multithreading mode. It is noted that, because the number of high utility patterns that can be generated by each 1 item set is different, some threads may end too early, and some threads may have longer execution time. Thus, the total running time may be far from the expected running time due to the difference of the execution time among different threads. To solve this situation, the following improvements are made:
1) when the thread I finishes the mining task, the thread I is in a waiting state.
2) If the thread J does not complete the mining at this time, the mode needing processing at present is transferred to the thread I for processing, and the next mode to be processed is executed by the thread J.
Through the strategy, the loads among the threads can be relatively balanced, and the mining time is effectively reduced.
Results and analysis of the experiments
The experimental data set was derived from a randomly generated set of events according to the given partitioning method. The sequence set after the division has 9752 transactions, and the different kinds of events have 1000 kinds. The experiment is mainly divided into three parts, as shown in fig. 2, the experiment is to test that the size of the data set is changed under the condition that the delta is the same; as shown in fig. 3, the second experiment is to change δ under the condition that the data set is not changed; as shown in fig. 4, experiment three compares the mining algorithm of multiple threads with a single thread.
And testing the incremental database by using the first experiment and the second experiment. The increment 1 refers to that the original data set and the newly added data set are respectively mined, and the results of the two mining are merged by using the method introduced before. And the increment 2 is that the result obtained by mining the original database is combined with the result obtained by mining the new data set. In experiment one, the newly added data set is unchanged, and is the data set generated previously in the experiment, and the original data set is spliced from the generated data (i.e. the original transaction set is copied to multiple copies and then spliced, and the original transaction set is not spliced and repeated), and the sizes of the newly added data set are respectively 1, 2, 3 and 4 times of the generated data set.
In the second experiment, when δ is 0.0005, the method of increment 1 is slower than the original method, because the number of high-utility modes obtained by mining is large under the condition that δ is small. The more common sending and combining method is adopted in the text, so that certain time may be spent in combining. The results of the first experiment and the second experiment show that when the original mining result is known, for the newly added data set, the incremental mining algorithm is much faster than merging the old data set and the new data set for mining.
Experiment three is a comparison of a mining algorithm using multiple threads and a mining algorithm using a single thread, where four threads are used. The experimental data set was unchanged and the two methods were compared by varying δ. As can be seen from the figure, the smaller the δ, the more distinct the difference between the two. This is because the smaller δ, the more modes that need to be considered, the more computation is performed, and the advantages of multithreading can be better displayed. Therefore, when the data volume is large, it can be considered to adopt a multi-thread mode to accelerate the mining speed.
The following table shows the partial results of mining, the data set is formed by splicing four original event sets, the delta is 0.0008, and the corresponding threshold value is 7273. The result of the mining is a high utility sequence pattern with utility values greater than a specified threshold, where each entry corresponds to a time ID. Taking the third example in the table, the mining pattern is [ (132) ], [ (577) ], [ (936) ], [ (825) ], [ (531) ], [ (646) ], [ (24) ], [ (505) (644) ], [ (710) ], which indicates that these events may have a certain relation directly, and it is noted that two events with event ID 505,644 are concurrent, there may not be a relation between them, but there may be a relation with the subsequent events. Further research can be conducted on the high-utility event sequence patterns obtained by mining to discover potential associations between the events.
Figure GDA0003133222850000131
Figure GDA0003133222850000141
Table 5 partial mining results
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (4)

1. A high-utility event sequence pattern mining method is characterized by comprising the following steps:
s1, defining a safety event;
s2, dividing a transaction database;
s3, mining an incremental high-efficiency safety event sequence;
s4, excavating a parallelization incremental high-utility security event sequence;
in step S1, different events use the attack type labels as labels, in order to consider the influence of the remaining attributes on the event, the utility values of the events are calculated by calculating the attribute values, and then the utility values of the attributes are accumulated to be the final utility value of the event; the utility values corresponding to the attribute values are manually given, so that different positions and different IP events can be endowed with different importance degrees by changing the utility values;
in step S2, the event set is divided into transaction sets by means of sliding window, where the window represents the time from tsTo teThe time span of each window is the same, i.e. (t)e-ts) The same;
in step S2, dividing the original events by using sliding windows with the same time interval according to the time-sequenced events; events in the same window form a transaction sequence, the window slides to the time point of the next time each time, if the events are closer, the events are considered as simultaneous events, namely the sequence of the events is not considered; when the merged event is positioned in the first item of the current window, in order to make up for the influence generated by the merging of the events, multiplying the original utility value by the merged number to be used as a new utility value;
in step S3, let the original data set be D1The newly added data set is D2The high-utility security event sequence pattern set in the original data set D1 is HUSEP1, and the new data set D2The high and medium utility safety time sequence mode is HUSEP 2; by definition: the minimum utility value of HUSEP1 is equal to or greater than a user-defined minimum threshold value δ × u (D)1) HUSEP2 has a minimum utility value equal to or greater than a user-defined minimum threshold value δ × u (D)2) (ii) a The database formed by merging the original data set D1 with the new added data set D2 is denoted as D3, the high-utility security event sequence pattern set of the database D3 formed by merging the original data set D1 with the new added data set D2 is HUSEP3, and obviously, the minimum utility value of HUSEP3 is more than or equal to the minimum threshold value delta x u (D3) defined by the user3)=δ×u(D1)+δ×u(D2);
In step S3, for a subset of HUSEP3 that is HUSEP1 {. U HUSEP2, it is clear that HUSEP3 has appeared at least in HUSEP1 or HUSEP2, and if the original HUSEP3 has not appeared in both HUSEP1 and HUSEP2, the utility values of the corresponding pattern in D1, D2 and D3 are u1, u2 and u3, respectively, by definition, there should be: u. of1<δ×u(D1) And u2<δ×u(D2) And (3) pushing out: u. of3=u1+u2<δ×u(D1)+δ×u(D2)=δ×u(D3) It is clear that this pattern should not produce a contradiction in HUSEP3, so HUSEP3 is a subset of HUSEP1 {. HUSEP 2;
in the step S3, in step S3,
for the pattern in HUSEP1 and the pattern in HUSEP2, there are 4 cases:
1) the mode is not a high utility mode in neither D1 nor D2;
2) the mode is a high utility mode in both D1 and D2;
3) the mode is a high utility mode in D1, and is not a high utility mode in D2;
4) the mode is not a high utility mode in D1, and is a high utility mode in D2;
for case 1), the mode is not a high utility mode in D3;
for case 2), the mode is the high utility mode in D3,
analogy case 1) has u1≥δ×u(D1),u2≥δ×u(D2) Therefore u is3=u1+u2≥δ×u(D1)+δ×u(D2)=δ×u(D3);
For cases 3) and 4), it cannot be directly deduced whether the mode is a high utility mode in D3, and the utility value of the mode in D3 needs to be calculated for judgment;
for the case 3), since the mode is already the high utility mode in D1, it is only necessary to calculate the utility value of the mode in D2 for judgment;
for case 4), since the pattern is already a high utility pattern in D2, it is only necessary to calculate the utility value of the pattern in D1 for judgment.
2. The high-utility event sequence pattern mining method according to claim 1, characterized in that: in step S4, in the mining process using the HUSP-Miner algorithm, first, a candidate 1 item set whose effective upper bound is greater than the threshold needs to be found, then, on the basis, a k +1 item set is generated from the k item set by sequence growth, and a pruning strategy is used to reduce the search space.
3. The high-utility event sequence pattern mining method according to claim 2, characterized in that: pruning strategies reduce the search space by continually scaling down the database: firstly, reading a database into a memory, and as the mode increases, the transaction set containing the mode is continuously reduced, namely the corresponding projection database is continuously reduced; since no changes are made to the database during mining, it can be considered that the mining process of each schema is performed independently after the database is given.
4. The high-utility event sequence pattern mining method according to claim 3, characterized in that: in step S4, parallel mining is performed in a multithread manner:
1) when the thread I finishes the task of excavation, the thread I is in a waiting state;
2) if the thread J does not complete the mining at this time, the mode needing processing at present is transferred to the thread I for processing, and the thread J executes the next mode to be processed.
CN201810650504.6A 2018-06-22 2018-06-22 High-utility event sequence pattern mining method Active CN109101530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810650504.6A CN109101530B (en) 2018-06-22 2018-06-22 High-utility event sequence pattern mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810650504.6A CN109101530B (en) 2018-06-22 2018-06-22 High-utility event sequence pattern mining method

Publications (2)

Publication Number Publication Date
CN109101530A CN109101530A (en) 2018-12-28
CN109101530B true CN109101530B (en) 2021-09-21

Family

ID=64844854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810650504.6A Active CN109101530B (en) 2018-06-22 2018-06-22 High-utility event sequence pattern mining method

Country Status (1)

Country Link
CN (1) CN109101530B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857758A (en) * 2018-12-29 2019-06-07 天津南大通用数据技术股份有限公司 A kind of association analysis method and system based on neighbours' window
CN113886396B (en) * 2021-10-20 2022-03-29 电子科技大学 Power system fault detection method and system based on high-utility frequent pattern mining
CN115964415B (en) * 2023-03-16 2023-05-26 山东科技大学 Pre-HUSPM-based database sequence insertion processing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Distributed and Parallel High Utility Sequential Pattern Mining;Morteza Zihayat等;《2016 IEEE International Conference on Big Data (Big Data)》;20161231;853-862 *
Efficiently updating the discovered high average-utility itemsets with transaction insertion;Jerry Chun-Wei Lin等;《Engineering Applications of Artificial Intelligence》;20180630;140-143 *
基于数据挖掘的网络故障告警相关性研究;徐前方;《中国博士学位论文全文数据库信息科技辑(月刊)》;20071016(第5期);I136-13 *

Also Published As

Publication number Publication date
CN109101530A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
Duong et al. Efficient high utility itemset mining using buffered utility-lists
Lin et al. Mining high utility itemsets in big data
Lin et al. Efficient closed high-utility pattern fusion model in large-scale databases
CN109101530B (en) High-utility event sequence pattern mining method
Xu et al. Distributed formal concept analysis algorithms based on an iterative MapReduce framework
Duong et al. An efficient method for mining frequent itemsets with double constraints
Han et al. Efficient top-k high utility itemset mining on massive data
Sumalatha et al. Distributed mining of high utility time interval sequential patterns using mapreduce approach
Lin et al. Efficient chain structure for high-utility sequential pattern mining
Wang et al. An efficient algorithm of frequent itemsets mining based on mapreduce
Liu et al. Incremental mining of high utility patterns in one phase by absence and legacy-based pruning
Demri et al. Two-variable separation logic and its inner circle
Abbasghorbani et al. Survey on sequential pattern mining algorithms
Lin et al. Mining high-utility sequential patterns from big datasets
Song et al. Parallel incremental association rule mining framework for public opinion analysis
Outrata A Lattice-Free Concept Lattice Update Algorithm based on* CbO.
Davoodabadi et al. A new method for discovering subgoals and constructing options in reinforcement learning.
Vu et al. FTKHUIM: A Fast and Efficient Method for Mining Top-K High-Utility Itemsets
Lin et al. Mining of high average-utility patterns with item-level thresholds
Grahne et al. Computing NFA Intersections in Map-Reduce.
Filou et al. Towards proved distributed algorithms through refinement, composition and local computations
Wang et al. Using a projection-based approach to mine frequent inter-transaction patterns
Xiao et al. PSON: a parallelized SON algorithm with MapReduce for mining frequent sets
CN111026862B (en) Incremental entity abstract method based on formal concept analysis technology
Soulet et al. Exact and approximate minimal pattern mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant