Disclosure of Invention
The first object of the present invention is to overcome the drawbacks and disadvantages of the existing event log sampling methods, and to provide a business process event log sampling method, which solves the problems that the existing event log sampling method cannot process a large-scale event log or has low processing efficiency, and the like, and by taking the large-scale event log as an input, a sample log with enough representativeness is obtained, the sample log is much smaller than the original log, and the processing efficiency is also higher.
A second object of the present invention is to provide a business process event log sampling system.
A third object of the present invention is to provide a storage medium.
It is a fourth object of the present invention to provide a computing device.
The first object of the invention is achieved by the following technical scheme: the business process event log sampling method comprises the following steps:
1) Acquiring three sets of log event logs, namely a log direct following active relation set, a starting point set and an ending point set;
2) Judging whether the intersection of the track starting point and the starting point set, the intersection of the track ending point and the ending point set and the intersection of the track direct following active relation set and the log direct following active relation set are empty sets or not according to the three sets obtained in the step 1); if the judgment results are all empty sets, finishing track traversal of the event log, and outputting a sample log; if the judgment result is not the empty set, any one of four event log sampling methods including a complete traversal sampling method, a set coverage sampling method, a sampling method based on track length and a sampling method based on track frequency is selected;
3) And (3) forming a new log by selecting the track according to the event log sampling method selected in the step (2), wherein the new log is the sample log.
Further, in step 1), the event log is composed of cases, the cases are composed of events, the events in the cases are represented by tracks, the events have a plurality of attributes, the events are represented by activities, and the set is defined as follows:
a. the direct following activity means that in one track of the event log, the condition that the activity b follows the activity a is marked as < a, b >, and the log direct following activity relation set is a direct following activity set of each track in the log and marked as dfrSetLog;
b. the starting points of each track form a starting point set, and the starting point set of the log is recorded as StartSet;
c. the end points of each track form an end point set, and the end point set of the log is marked as EndSet;
further, in step 3), if the full traversal sampling method is selected, sequentially traversing a first track of the event log, adding the track to the sample log when at least one of a track start point and start point set intersection, a track end point and end point set intersection, a track direct following active relation set and a track direct following active relation set intersection is not an empty set, deleting a track direct following active relation set intersection and a track direct following active relation set intersection in the track direct following relation set, a track end point and end point set intersection in the start point set, and stopping track traversal until the track direct following relation set, the start point set and the end point set are empty sets;
if the selection set covers the sampling method, traversing all tracks in the log, selecting a track with the largest intersection between the track direct following active relation set and the log direct following active relation set, adding the track into the sample log under the condition that the intersection between a track starting point and a starting point set, the intersection between a track ending point and an ending point set and the intersection between the track direct following active relation set and the log direct following active relation set is not an empty set is met, deleting the intersection between the track direct following active relation set and the track direct following active relation set in the log direct following active relation set, the intersection between a starting point and the starting point set in the starting point set and the intersection between a track ending point and the ending point set in the ending point set, and stopping track traversing until the intersection between the track direct following relation set, the starting point set and the ending point set is an empty set;
if a sampling method based on track length is selected, wherein the track length refers to the number of activities contained in a track, firstly counting all track lengths in an event log and carrying out descending order sequencing, secondly traversing sequentially from the track with the longest length, adding the track into a sample log when at least one of the track starting point and starting point set intersection, the track ending point and ending point set intersection and the track direct following activity relation set intersection is not an empty set, and deleting the track direct following activity relation set and the track direct following activity relation set intersection in the log direct following relation set, the starting point and starting point set intersection in the starting point set and the track ending point set intersection in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets;
if a sampling method based on track frequency is selected, wherein the track frequency refers to the track occurrence number in track traversal of an event log, firstly counting the track frequency of the event log and performing deduplication operation, wherein the deduplication operation refers to only keeping tracks with the largest frequency in the same track, finally descending and sorting according to the track frequency, traversing sequentially from the track with the largest track frequency, and stopping track traversal when at least one of track starting point and starting point set intersection, track ending point and ending point set intersection, track direct following active relation set and track direct following active relation set intersection is not blank, and deleting the track direct following active relation set and track direct following active relation set intersection in the log direct following relation set, track ending point and ending point set intersection in the starting point set until the track direct following relation set, the starting point set and the ending point set are blank.
The second object of the invention is achieved by the following technical scheme: the business process event log sampling system comprises an event log data acquisition module, a track set intersection judgment module, an event log sampling selection module and a sample log track selection module;
the event log data acquisition module is used for acquiring a log to directly follow the active relation set, the starting point set and the ending point set;
the track set intersection judgment module is used for judging whether the intersection of a track starting point and a starting point set, the intersection of a track ending point and an ending point set, the intersection of a track direct following active relation set and a log direct following active relation set are empty sets or not;
the event log sampling selection module is used for selecting one of four event log sampling methods, namely a full traversal sampling method, a set coverage sampling method, a sampling method based on track length and a sampling method based on track frequency, or directly finishing track traversal of the event log, and outputting a sample log;
the sample log track selection module is used for selecting tracks to form a new log, and the new log is the sample log.
Further, the event log data acquisition module performs the following operations:
acquiring a starting point set, an ending point set and a log directly following an activity relation set of an event log, wherein the event log consists of cases, the cases consist of events, the events in the cases are represented in the form of tracks, the events have a plurality of attributes, the events are represented by activities, and the set is defined as follows:
a. the direct following activity means that in one track of the event log, the condition that the activity b follows the activity a is marked as < a, b >, and the log direct following activity relation set is a direct following activity set of each track in the log and marked as dfrSetLog;
b. the starting points of each track form a starting point set, and the starting point set of the log is recorded as StartSet;
c. the end points of each track constitute an end point set, which is noted EndSet for the log.
Further, the track set intersection judgment module performs the following operations:
and judging whether the intersection of the track starting point and the starting point set, the intersection of the track ending point and the ending point set and the intersection of the track direct following active relation set and the log direct following active relation set are empty sets or not according to the log direct following active relation set, the starting point set and the ending point set obtained by the data acquisition module.
Further, the event log sampling selection module performs the following operations according to the determination result obtained by the trace set intersection determination module:
a. if the judgment result is an empty set, finishing track traversal of the event log, and outputting a sample log;
b. if the judgment result is not the empty set, one of four event log sampling methods is selected, wherein the four event log sampling methods are respectively as follows: a full traversal sampling method, a set coverage sampling method, a sampling method based on track length and a sampling method based on track frequency.
Further, the sample log trace selection module performs the following operations:
a. if a complete traversal sampling method is selected, traversing the first track of the event log in sequence, adding the track into the sample log when at least one of the intersection of a track starting point and a starting point set, the intersection of a track ending point and an ending point set and the intersection of a track direct following active relation set and a log direct following active relation set is not an empty set, deleting the intersection of the track direct following active relation set and the track direct following active relation set in the log direct following relation set, the intersection of a starting point and the starting point set in the starting point set and the intersection of a track ending point and the ending point set in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets, and stopping track traversal;
b. if the selection set covers the sampling method, traversing all tracks in the log, selecting a track with the largest intersection between the track direct following active relation set and the log direct following active relation set, adding the track into the sample log under the condition that the intersection between a track starting point and a starting point set, the intersection between a track ending point and an ending point set and the intersection between the track direct following active relation set and the log direct following active relation set is not an empty set is met, deleting the intersection between the track direct following active relation set and the track direct following active relation set in the log direct following active relation set, the intersection between a starting point and the starting point set in the starting point set and the intersection between a track ending point and the ending point set in the ending point set, and stopping track traversing until the intersection between the track direct following relation set, the starting point set and the ending point set is an empty set;
c. if a sampling method based on track length is selected, wherein the track length refers to the number of activities contained in a track, firstly counting all track lengths in an event log and carrying out descending order sequencing, secondly traversing sequentially from the track with the longest length, adding the track into a sample log when at least one of the track starting point and starting point set intersection, the track ending point and ending point set intersection and the track direct following activity relation set intersection is not an empty set, and deleting the track direct following activity relation set and the track direct following activity relation set intersection in the log direct following relation set, the starting point and starting point set intersection in the starting point set and the track ending point set intersection in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets;
d. if a sampling method based on track frequency is selected, wherein the track frequency refers to the track occurrence number in track traversal of an event log, firstly counting the track frequency of the event log and performing deduplication operation, wherein the deduplication operation refers to only keeping tracks with the largest frequency in the same track, finally descending and sorting according to the track frequency, traversing sequentially from the track with the largest track frequency, and stopping track traversal when at least one of track starting point and starting point set intersection, track ending point and ending point set intersection, track direct following active relation set and track direct following active relation set intersection is not blank, and deleting the track direct following active relation set and track direct following active relation set intersection in the log direct following relation set, track ending point and ending point set intersection in the starting point set until the track direct following relation set, the starting point set and the ending point set are blank.
The third object of the invention is achieved by the following technical scheme: a storage medium storing a program which, when executed by a processor, implements the business process event log sampling method described above.
The fourth object of the invention is achieved by the following technical scheme: a computing device comprising a processor and a memory for storing a program executable by the processor, the processor implementing the business process event log sampling method described above when executing the program stored by the memory.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention realizes that for the large-scale event log, the sample log obtained by adopting a more efficient business process event log sampling method is adopted to sample the large-scale event log, so that the completeness of the log can be ensured;
2. the invention uses the more efficient business process event log sampling method to sample, and greatly improves the sampling efficiency of the event log on the premise of ensuring the model mining quality, thereby providing four new sampling methods for the process mining field;
3. the method can be deployed on a distributed system in combination with the big data field, and can process the ultra-large-scale event log more efficiently;
4. the method has wide use space in the aspect of process discovery of large-scale logs, has strong practicability, and has wide prospect in the process discovery, consistency check and other process mining fields.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1
The embodiment discloses a business process event log sampling method which is realized in a plug-in mode in a Prom tool, as shown in fig. 2; as shown in fig. 1, an original event log is input, the log of the obtained event log directly follows an active relation set, a start point set and an end point set, and after one of four event log sampling methods is selected, sampling is performed according to different sampling strategies to obtain a sample event log, which specifically includes the following steps:
1) A set of direct-following activity relationships, a set of initial points, and a set of end points of the event log are obtained. Wherein the event log is composed of cases, the cases are composed of events, the events in the cases are represented by tracks, the events have a plurality of attributes, the events are represented by activities, and the set definition is as follows: directly following an activity means that in one track of the event log, it is satisfied that activity b follows activity a immediately, denoted < a, b >, the initial point of each track constitutes an initial point set, and the end point constitutes an end point set. The three set of determinations for this step are therefore as follows:
the example event log L contains 9 tracks for a total of 6 activities. Wherein, record sigma (1) =<a,d,e>,σ (2) =<a,b,c,e>,σ (3) =<b,c,e,f>,σ (4) =<b,d,f>,σ (5) =<c,d>,σ (6) =<a,c,d>,σ (7) =<b,c,d>,σ (8) =<a,d,e>,σ (9) =<b,c,e,f>。L=[<a,d,e>,<a,b,c,e>,<b,c,e,f>,<b,d,f>,<c,d>,<a,c,d>,<b,c,d>]. As shown in fig. 4, the original event log input when the present invention is used can finally obtain the sample log shown in fig. 5 through an event log sampling method.
a. The direct-following set of activity relationships of the log is noted dfrSetLog, dfrSetLog = [ < a, d >, < d, e >, < a, b >, < b, c >, < c, e >, < e, f >, < b, d >, < d, f >, < c, d >, < a, c > ];
b. the starting point set of the log is StartSet, startSet = [ a, b, c ];
c. the set of end points of the log is denoted by EndSet, endset= [ e, f, d ];
2) Judging whether the intersection of the track starting point and the start point set, the intersection of the track ending point and the end point set, the intersection of the track direct following active relation set and the log direct following active relation set are empty sets or not; if the judgment results are all empty sets, finishing track traversal of the event log, and outputting a sample log; if the judgment result is not the empty set, firstly selecting a business process event log sampling plug-in (named Business Process Event Log Sampling Plugin) in the Prom6 platform, and secondly selecting one of four event log sampling methods, wherein the four event log sampling methods are respectively as follows: (1) fully traversing the sampling method (Brute Force Sampling); (2) aggregate coverage sampling (Set Coverage Sampling); (3) Track Length-based Sampling method (track Length-based Sampling); (4) A Sampling method (Trace Frequency-based Sampling) based on Trace Frequency, as shown in fig. 3, which is a selection interface of the Sampling method;
3) According to the event log sampling method selected in the step 2), the selection tracks form a new log, and the new log is a sample log, and specifically comprises the following steps:
a. if a complete traversal sampling method is selected, traversing the first track of the event log in sequence, adding the track into the sample log when at least one of the intersection of a track starting point and a starting point set, the intersection of a track ending point and an ending point set and the intersection of a track direct following active relation set and a log direct following active relation set is not an empty set, deleting the intersection of the track direct following active relation set and the track direct following active relation set in the log direct following relation set, the intersection of a starting point and the starting point set in the starting point set and the intersection of a track ending point and the ending point set in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets, and stopping track traversal; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
b. If the selection set covers the sampling method, traversing all tracks in the log, selecting a track with the largest intersection between the track direct following active relation set and the log direct following active relation set, adding the track into the sample log under the condition that the intersection between a track starting point and a starting point set, the intersection between a track ending point and an ending point set and the intersection between the track direct following active relation set and the log direct following active relation set is not an empty set is met, deleting the intersection between the track direct following active relation set and the track direct following active relation set in the log direct following active relation set, the intersection between a starting point and the starting point set in the starting point set and the intersection between a track ending point and the ending point set in the ending point set, and stopping track traversing until the intersection between the track direct following relation set, the starting point set and the ending point set is an empty set; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
c. If a sampling method based on track length is selected, wherein the track length refers to the number of activities contained in a track, firstly counting all track lengths in an event log and carrying out descending order sequencing, secondly traversing sequentially from the track with the longest length, adding the track into a sample log when at least one of the track starting point and starting point set intersection, the track ending point and ending point set intersection and the track direct following activity relation set intersection is not an empty set, and deleting the track direct following activity relation set and the track direct following activity relation set intersection in the log direct following relation set, the starting point and starting point set intersection in the starting point set and the track ending point set intersection in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
d. If a sampling method based on track frequency is selected, wherein the track frequency refers to the track occurrence number in track traversal of an event log, firstly counting the track frequency of the event log and performing deduplication operation, wherein the deduplication operation refers to only keeping tracks with the largest frequency in the same track, finally descending and sorting according to the track frequency, traversing sequentially from the track with the largest track frequency, and stopping track traversal when at least one of track starting point and starting point set intersection, track ending point and ending point set intersection, track direct following active relation set and track direct following active relation set intersection is not blank, and deleting the track direct following active relation set and track direct following active relation set intersection in the log direct following relation set, track ending point and ending point set intersection in the starting point set until the track direct following relation set, the starting point set and the ending point set are blank. The resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
Example 2
The embodiment discloses a business process event log sampling system, as shown in fig. 6, which comprises an event log data acquisition module, a track set intersection judgment module, an event log sampling selection module and a sample log track selection module;
the event log data acquisition module is used for acquiring a log to directly follow the active relation set, the starting point set and the ending point set;
the track set intersection judgment module is used for judging whether the intersection of a track starting point and a starting point set, the intersection of a track ending point and an ending point set, the intersection of a track direct following active relation set and a log direct following active relation set are empty sets or not;
the event log sampling selection module is used for selecting one of four event log sampling methods, namely a full traversal sampling method, a set coverage sampling method, a sampling method based on track length and a sampling method based on track frequency, or directly finishing track traversal of the event log, and outputting a sample log;
the sample log track selection module is used for selecting tracks to form a new log, and the new log is the sample log.
The event log data acquisition module performs the following operations:
the method comprises the steps of obtaining a starting point set, an ending point set and a log directly following an activity relation set of an event log, wherein the event log consists of cases, the cases consist of events, and the events in the cases are represented in the form of tracks. Events have a number of attributes, and the events are represented by the activities in the present invention, and the three aggregate concrete solutions are as follows: the example event log L contains 9 tracks for a total of 6 activities. Wherein, record sigma (1) =<a,d,e>,σ (2) =<a,b,c,e>,σ (3) =<b,c,e,f>,σ (4) =<b,d,f>,σ (5) =<c,d>,σ (6) =<a,c,d>,σ (7) =<b,c,d>,σ (8) =<a,d,e>,σ (9) =<b,c,e,f>。L=[<a,d,e>,<a,b,c,e>,<b,c,e,f>,<b,d,f>,<c,d>,<a,c,d>,<b,c,d>]。
a. The direct-following set of activity relationships of the log is noted dfrSetLog, dfrSetLog = [ < a, d >, < d, e >, < a, b >, < b, c >, < c, e >, < e, f >, < b, d >, < d, f >, < c, d >, < a, c > ];
b. the starting point set of the log is StartSet, startSet = [ a, b, c ];
c. the set of end points of the log is denoted by EndSet, endset= [ e, f, d ];
the track set intersection judging module executes the following operations:
and judging whether the intersection of the track starting point and the starting point set, the intersection of the track ending point and the ending point set and the intersection of the track direct following active relation set and the log direct following active relation set are empty sets or not according to the log direct following active relation set, the starting point set and the ending point set obtained by the data acquisition module.
The event log sampling selection module performs the following operations:
a. if the judgment result is an empty set, finishing track traversal of the event log, and outputting a sample log;
b. if the judgment result is not an empty set, firstly selecting a business process event log sampling plug-in (named Business Process Event Log Sampling Plugin) in the Prom6 platform, and secondly selecting one of four event log sampling methods, wherein the four event log sampling methods are respectively as follows: (1) fully traversing the sampling method (Brute Force Sampling); (2) aggregate coverage sampling (Set Coverage Sampling); (3) Track Length-based Sampling method (track Length-based Sampling); (4) Track Frequency based Sampling method (track Frequency-based Sampling).
The sample log trace selection module performs the following operations:
a. if a complete traversal sampling method is selected, traversing the first track of the event log in sequence, adding the track into the sample log when at least one of the intersection of a track starting point and a starting point set, the intersection of a track ending point and an ending point set and the intersection of a track direct following active relation set and a log direct following active relation set is not an empty set, deleting the intersection of the track direct following active relation set and the track direct following active relation set in the log direct following relation set, the intersection of a starting point and the starting point set in the starting point set and the intersection of a track ending point and the ending point set in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets, and stopping track traversal; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
b. If the selection set covers the sampling method, traversing all tracks in the log, selecting a track with the largest intersection between the track direct following active relation set and the log direct following active relation set, adding the track into the sample log under the condition that the intersection between a track starting point and a starting point set, the intersection between a track ending point and an ending point set and the intersection between the track direct following active relation set and the log direct following active relation set is not an empty set is met, deleting the intersection between the track direct following active relation set and the track direct following active relation set in the log direct following active relation set, the intersection between a starting point and the starting point set in the starting point set and the intersection between a track ending point and the ending point set in the ending point set, and stopping track traversing until the intersection between the track direct following relation set, the starting point set and the ending point set is an empty set; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
c. If a sampling method based on track length is selected, wherein the track length refers to the number of activities contained in a track, firstly counting all track lengths in an event log and carrying out descending order sequencing, secondly traversing sequentially from the track with the longest length, adding the track into a sample log when at least one of the track starting point and starting point set intersection, the track ending point and ending point set intersection and the track direct following activity relation set intersection is not an empty set, and deleting the track direct following activity relation set and the track direct following activity relation set intersection in the log direct following relation set, the starting point and starting point set intersection in the starting point set and the track ending point set intersection in the ending point set until the track direct following relation set, the starting point set and the ending point set are empty sets; the resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
d. If a sampling method based on track frequency is selected, wherein the track frequency refers to the track occurrence number in track traversal of an event log, firstly counting the track frequency of the event log and performing deduplication operation, wherein the deduplication operation refers to only keeping tracks with the largest frequency in the same track, finally descending and sorting according to the track frequency, traversing sequentially from the track with the largest track frequency, and stopping track traversal when at least one of track starting point and starting point set intersection, track ending point and ending point set intersection, track direct following active relation set and track direct following active relation set intersection is not blank, and deleting the track direct following active relation set and track direct following active relation set intersection in the log direct following relation set, track ending point and ending point set intersection in the starting point set until the track direct following relation set, the starting point set and the ending point set are blank. The resulting sample log L 'of the example event log is thus L' = [ < a, d, e >, < a, b, c, e >, < b, c, e, f >, < b, d, f >, < c, d >, < a, c, d > ].
Example 3
The present embodiment discloses a storage medium storing a program that, when executed by a processor, implements the business process event log sampling method described in embodiment 1.
The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a usb disk, a removable hard disk, or the like.
Example 4
The embodiment discloses a computing device, which comprises a processor and a memory for storing a program executable by the processor, wherein the method for sampling the business process event log is implemented when the processor executes the program stored by the memory.
The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, a programmable logic controller (PLC, programmable Logic Controller), or other terminal devices with processor functionality.
In summary, after the above scheme is adopted, the invention provides a new way for the existing event log sampling method to not effectively process the information in the large-scale event log or not, and the inefficiency of the discovery process model is caused, so that the sample log with enough representativeness can be effectively obtained through sampling, the practical popularization value is realized, and the popularization is worth.
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, so variations in shape and principles of the present invention should be covered.