CN110032494B

CN110032494B - Double-granularity noise log filtering method based on incidence relation

Info

Publication number: CN110032494B
Application number: CN201910218832.3A
Authority: CN
Inventors: 孙笑笑; 侯文杰; 俞东进; 潘建梁
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2020-05-26
Anticipated expiration: 2039-03-21
Also published as: CN110032494A

Abstract

The invention discloses a dual-granularity noise log filtering method based on an incidence relation. The method obtains the mixed dependency degree based on the calculation of the local dependency degree and the global dependency degree, and can simultaneously realize the fine-grained filtration of noise events in the log and the coarse-grained filtration of noise tracks. Compared with the traditional log filtering method, the log filtering method has the following benefits: 1. a double-particle filtering mechanism is adopted, and different filtering mechanisms are used for different noise scenes, so that an excellent filtering effect is realized under the condition that original log data is kept as much as possible; 2. the filtered log file is used for process mining, so that the accuracy of the process discovery model can be greatly improved, and the understandability of the model is enhanced.

Description

Double-granularity noise log filtering method based on incidence relation

Technical Field

The invention relates to the field of process mining, in particular to a dual-granularity noise log filtering method based on an incidence relation.

Background

Process mining aims to extract useful information from event logs recorded by process-aware information systems to help stakeholders understand the actual execution of the process. The process discovery is an important part of process mining, and the effect of the process discovery is to construct a process model which can reproduce event logging behaviors. The high-precision model can intuitively show the actual execution condition of the business process.

In a business process management system, the activities of a business process are performed according to a well-designed process model, and the execution of these activities is recorded in a log to help stakeholders analyze and monitor the execution of the process. In real life, most business processes have no standardized process model, or the process model has a great difference from the current business process along with the continuous evolution of the business process, so people need to extract the actual execution behavior of the process from the log generated by the process by means of a process discovery technology. However, the noise present in the log can negatively impact the quality of the flow discovery model. If the flow discovery technology is used for carrying out flow discovery on the log containing the noise, the discovery model of the log can generate invisible tasks and non-freely selected structures, and therefore complexity and understandability of the mining model are increased. Common log noise is of the following types: missing type noise events (some events in the flow are not logged for some reason), redundant type noise events (some events in the flow are repeatedly logged multiple times), and misplaced type noise events (some events are logged incorrectly in the order in which they occur in the flow trace).

The noise filtering algorithm can effectively filter noise events in the log, and the accuracy of the process discovery model is greatly improved. The current log noise filtering algorithm can be roughly divided into two types according to the filtering granularity, namely coarse-grained filtering and fine-grained filtering. Where coarse-grained filtering removes the traces containing noise events directly from the original log, removing the entire trace may produce large changes to the mined model structure for smaller-scale log data. The fine-grained filtering only removes the noise event and keeps other events on the trajectory, but the noise event is removed, and meanwhile, the behavior cannot be guaranteed to bring new noise to the trajectory, and meanwhile, the algorithm cannot solve the problem of the missing noise event.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a dual-granularity noise log filtering method based on an incidence relation, which can effectively solve the problems. The technical scheme adopted by the invention is as follows:

a dual-granularity noise log filtering method based on incidence relation comprises the following steps:

(1) inputting an original log file, preprocessing the original log file to generate a log set consisting of a plurality of process tracks sigma

Each flow track is composed of a plurality of flow events e_iComposition σ ═<e₁,…,e_n>Recording the set of all flow events e in all flow tracks as epsilon, namely e belongs to epsilon;

(2) statistics Log aggregation

Frequency dependency DFD (e) between two process events in all process traces_i,e_j)；

(3) Further calculating the local dependence Dep between every two events_local(e_i,e_j) Global dependency Dep_global(e_i,e_j) And mixed dependencies Dep_mixed(e_i,e_j)；

The local dependency Dep_local(e_i,e_j) The calculation formula is as follows:

wherein C is₁、C₂Is a constant number, D_suc(e_i) Indicating subsequent density, i.e. event e_iAverage frequency of occurrence of all subsequent events of (a); d_pre(e_j) Representing precursor density for representing event e_jAverage frequency of occurrence of all precursor events; the calculation formulas of the successor density and the predecessor density are as follows:

D_pre(e_k)＝N_pre(e_k)/|U_pre(e_k)|

D_suc(e_k)＝N_suc(e_k)/|U_suc(e_k)

wherein D_pre(e_k) As an event e_kPrecursor density of (D)_suc(e_k) As an event e_kSubsequent density of (2), N_pre(e_k) To be by an event e_kNumber of following relations for subsequent events, N_suc(e_k) As an event e_kNumber of following relations for predecessor events, U_pre(e_k) As an event e_kIs a precursor set, | U_pre(e_k) L is event e_kNumber of events in the precursor set, U_suc(e_k) As an event e_kIs connected with the successor set of, | U_suc(e_k) I event e_kThe number of event categories in the successor set of (1);

the global dependency Dep_global(e_i,e_j) The calculation formula is as follows:

θ＝Max{DFD(e_x,e_y)}

where ζ is the global noise factor used to partition global noise events.

The mixed dependency Dep_mixed(e_i,e_j) The calculation formula is as follows:

Dep_mixed(e_i,e_j)＝α*Dep_local(e_i,e_j)+(1-α)*Dep_global(e_i,e_j)

wherein α weighs factors that balance the occupancy of global and local dependencies.

(4) Constructing log set according to the mixed dependencies calculated in the last step

Mixed dependency matrix of all process events in

(5) The method for filtering log noise specifically comprises the following steps:

51) constructing an empty Log set

For storing the filtered tracks;

52) fetching a Log set

A trace of sigma, a discard value of sigma

Initializing to 1;

53) get start event e of σ_startAnd will start event e_startAdding to an empty sequence of events sigma_filterPerforming the following steps;

54) fetching a current event e according to the sequence of events in sigma_i；

55) Taking out the next event e of the current event in the track_i+1；

56) In that

In search to e_iAnd e_i+1Mixed dependency of Dep_mixed(e_i,e_i+1) First, fine-grained filtering of events is performed, if Dep_mixed(e_i,e_i+1) Is not less than the mixedness threshold β, event e_i+1Is determined as a normal event, and is added to the trajectory σ_filter，e_i+1Becoming the current event, subscript i ═ i +1, and returning to step 55); if Dep_mixed(e_i,e_i+1) Is less than the mixedness threshold β, event e_i+1Is determined as a noise event, and a penalty function is used to modify the discard value of the trajectory sigma

The penalty function is formulated as follows:

wherein

Determining the punishment degree of a punishment function as a punishment factor;

if the corrected abandon value is not lower than the set abandon threshold value

Return to step 55); if the corrected abandon value

Below the abandon threshold

Then coarse-grained filtering operation of the track is executed, the track sigma is judged as a noise track, and the step 52) is returned;

57) if event e_i+1End event e for current trajectory σ_endThen the trace σ will be filtered_filterAdding to a filtered Log set

Performing the following steps;

58) repeating the steps 52) to 57) until all the tracks in the original log set are taken out;

59) outputting a filtered log set

(6) Filtering log sets from output

And regenerating the log file.

Preferably, the log collection described in step (1)

All the execution examples of the business process are included, that is, each process track sigma corresponds to one execution example of the business process, and the process track sigma is composed of a plurality of process events eAnd (4) ordered sequence, wherein the flow event e is a record of the execution activity of the business flow.

Preferably, the frequency-dependent DFD (e) described in step (2)_i,e_j) Indicating degree of direct follow, i.e. event e in all flow instances_jFollowing event e_iThe total frequency of occurrence.

Preferably, the global noise factor ζ described in step (3) is 0.02.

Preferably, the blending degree threshold β in step (5) is 0.5.

Preferably, the value of the trade-off factor α in step (5) is 0.5.

Preferably, the penalty factor described in step (5)

0.8 is taken.

Preferably, the abandon threshold is set forth in step (5)

Take 0.7.

The filtering method provided by the invention considers the dependency relationship between events from the global and local angles, and judges whether the events are noise events according to the dependency relationship. Compared with the traditional log filtering method, the log filtering method has the following benefits: 1. a double-particle filtering mechanism is adopted, and different filtering mechanisms are used for different noise scenes, so that an excellent filtering effect is realized under the condition that original log data is kept as much as possible; 2. the filtered log file is used for process mining, so that the accuracy of the process discovery model can be greatly improved, and the understandability of the model is enhanced.

Drawings

FIG. 1 is a flow chart of a dual-granularity noise log filtering method based on an incidence relation according to the present invention;

FIG. 2 is a schematic diagram of an example of noise filtering according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

As shown in fig. 1, the method for filtering a dual-granularity noise log based on an association relationship of the present invention includes the following steps:

log collection

All the execution examples of the business process are included, that is, each process track sigma corresponds to one execution example of the business process, each process track sigma is an ordered sequence composed of a plurality of process events e, and the process events e are one record of the business process execution activity.

(2) Statistics Log aggregation

Frequency dependency DFD (e) between two process events in all process traces_i,e_j)。

Frequency dependence DFD (e)_i,e_j) Indicating degree of direct follow-up, i.e. in all process instancesMiddle event e_jFollowing event e_iThe total frequency of occurrence.

The local dependency Dep_local(e_i,e_j) The calculation formula is as follows:

D_pre(e_k)＝N_pre(e_k)/|U_pre(e_k)|

D_suc(e_k)＝N_suc(e_k)/|U_suc(e_k)|

θ＝Max{DFD(e_x,e_y)}

where ζ is the global noise factor, which is used to partition global noise events, and is taken to be 0.02.

The mixed dependency Dep_mixed(e_i,e_j) The calculation formula is as follows:

Dep_mixed(e_i,e_j)＝α*Dep_local(e_i,e_j)+(1-α)*Dep_global(e_i,e_j)

wherein α is a weighing factor used to balance the occupancy of global and local dependencies, which is taken to be 0.5.

Mixed dependency matrix of all process events in

51) constructing an empty Log set

For storing the filtered tracks;

52) fetching a Log set

A trace of sigma, a discard value of sigma

Initializing to 1;

55) Taking out the next event e of the current event in the track_i+1；

56) In that

In search to e_iAnd e_i+1Mixed dependency of Dep_mixed(e_i,e_i+1) First, fine-grained filtering of events is performed, if Dep_mixed(e_i,e_i+1) Is not less than the mixedness threshold β (taken to be 0.5), event e_i+1Is determined as a normal event, and is added to the trajectory σ_filter，e_i+1Becoming the current event, subscript i ═ i +1, and returning to step 55); if Dep_mixed(e_i,e_i+1) Is less than the mixedness threshold β, event e_i+1Is determined as a noise event, and a penalty function is used to modify the discard value of the trajectory sigma

The penalty function is formulated as follows:

wherein

Determining the punishment degree of a punishment function as a punishment factor, and taking 0.8;

Return to step 55); if the corrected abandon value

Below the abandon threshold

A coarse-grained filtering operation of the trajectory is performed and the trajectory sigma is determined to be a noisy trajectory, returning to step 52). Abandon threshold

Take 0.7.

Performing the following steps;

59) outputting a filtered log set

(6) Filtering log sets from output

And regenerating the log file.

Based on the above method flow, the technical effects are further shown by the embodiments.

Examples

The steps in this embodiment are the same as those in the previous embodiment, and are not described herein again. The following shows some of the implementation processes and implementation results:

data source acquisition: the original log file used in this embodiment reads the log file using java toolkit JDOM, obtains a root node root of the log file, obtains a child node element named Process from the root node, and further obtains all child node elements named Process instance from the Process node. A ProcessInstance node contains all the information of a process one-time execution instance, and usually has a plurality of node elements named audiotrailentry, and the detailed information of each event occurring in the process instance is recorded in one audiotrailentry node element, and these audiotrailentry nodes contain many event attributes, such as a timestamp attribute, an event name attribute, a resource attribute, and the like. Screening the event information, eliminating redundant information in the event information, reserving event name attributes of the events, sequencing the events of the same instance according to the starting timestamp attributes, and finally storing the events as a flow track sigma<e₁,…,e_n>And endowing the track with the id attribute of the ProcessInstance node element corresponding to the track as the track id, and using a plurality of sets formed by all tracks in the log, namely an original log set

And (5) storing.

Fig. 2 shows in detail a specific process of performing dual-granularity noise log filtering based on an association relationship on two tracks (example 1 and example 2) by using the method of the present invention:

example 1 trajectory σ₁＝<ABCDEFGH>

1) Obtaining sigma₁And adds it to the empty trajectory sequence σ_fPerforming the following steps;

2) taking out the next event B of the event A, and calculating the mixed relevance Dep of the event AB_mixed(a, B) ═ 0.80, greater than the mixedness threshold 0.5, so event B is a normal event (non-noise event), added to the sequence σ_fPerforming the following steps;

3) taking out the next event C of the event B, and calculating the mixed relevance Dep of the event BC_mixed(B, C) ═ 0.75, greater than the mixedness threshold of 0.5, due toThis event C is a normal event (non-noise event) which is added to the sequence σ_fPerforming the following steps;

4) taking out the next event D of the event C, and calculating the mixed association degree Dep of the event CD_mixed(C, D) ═ 0.85, greater than the mixedness threshold 0.5, so event D is a normal event (non-noise event), added to the sequence σ_fPerforming the following steps;

5) taking out the next event E of the event D, and calculating the mixed relevance Dep of the event DE_mixed(D, E) ═ 0.87, greater than the mixedness threshold 0.5, so event E is a normal event (non-noise event) added to the sequence σ_fPerforming the following steps;

6) taking out the next event F of the event E, and calculating the mixed relevance Dep of the event EF_mixed(E, F) ═ 0.26, small mixedness threshold 0.5, so event F is a noise event, which is not added to sequence σ_fPerforming the following steps; modifying trajectory sigma using penalty function₁Value of abandonment

Calculated to be 0.9, greater than the discard threshold of 0.7, so σ₁A normal trajectory (non-noisy trajectory);

7) taking out the next event G of the event F, and calculating the mixed relevance Dep of the event EG_mixed(E, G) ═ 0.87, greater than the mixedness threshold 0.5, so event G is a normal event (non-noise event), added to the sequence σ_fPerforming the following steps;

8) taking out the next event H of the event G, and calculating the mixed relevance Dep of the event GH_mixed(G, H) ═ 0.85, greater than the mixedness threshold 0.5, so event H is a normal event (non-noise event), added to the sequence σ_fPerforming the following steps;

9) event H is the current trajectory σ₁The trajectory filtered by the method is sigma_f＝<ABCDEGH>It is added to the filter log set.

Example 2 track σ₂＝<ABCEGH>

1) Obtaining sigma₂And adds it to the empty trajectory sequence σ_fPerforming the following steps;

3) taking out the next event C of the event B, and calculating the mixed relevance Dep of the event BC_mixed(B, C) ═ 0.75, greater than the mixedness threshold 0.5, so event C is a normal event (non-noise event) added to the sequence σ_fPerforming the following steps;

4) taking out the next event E of the event C, and calculating the mixed relevance Dep of the event CE_mixed(C, E) ═ 0.26, less than the mixedness threshold 0.5, so event E is a noise event, which is not added to the sequence σ_fPerforming the following steps; modifying trajectory sigma using penalty function₂Value of abandonment

Calculated to be 0.9, greater than the discard threshold of 0.7, so σ₂A normal trajectory (non-noisy trajectory);

5) taking out the next event G of the event E, and calculating the mixed association degree Dep of the event CG_mixed(C, G) ═ 0.01, less than the mixedness threshold 0.5, so event G is a noise event, which is not added to the sequence σ_fPerforming the following steps; modifying trajectory sigma using penalty function₂Value of abandonment

Calculated as 0.72, greater than the discard threshold of 0.7, so σ₂A normal trajectory (non-noisy trajectory);

taking out the next event H of the event G, and calculating the mixed relevance Dep of the event CH_mixed(C, H) ═ 0.01, less than the mixedness threshold 0.5, so event H is a noise event, which is not added to the sequence σ_fPerforming the following steps;

modifying trajectory sigma using penalty function₂Value of abandonment

Calculated as 0.58, less than the discard threshold of 0.7, and thereforeσ₂To noise traces, they are not added to the noise log set.

Claims

1. A dual-granularity noise log filtering method based on incidence relation is characterized by comprising the following steps:

(2) statistics Log aggregation

The local dependency Dep_local(e_i,e_j) The calculation formula is as follows:

D_pre(e_k)＝N_pre(e_k)/ |U_pre(e_k) |

D_suc(e_k)＝N_suc(e_k)/ |U_suc(e_k) |

θ＝Max{DFD(e_x,e_y)}

where ζ is a global noise factor used to partition global noise events;

the mixed dependency Dep_mixed(e_i,e_j) The calculation formula is as follows:

Dep_mixed(e_i，e_j)＝α*Dep_local(e_i，e_j)+(1-α)*Dep_global(e_i，e_j)

α is a balance factor used for balancing the occupation proportion of the global dependency and the local dependency;

Mixed dependency matrix of all process events in

51) constructing an empty Log set

For storing the filtered tracks;

52) fetching a Log set

A trace of sigma, a discard value of sigma

Initializing to 1;

55) Taking out the next event e of the current event in the track_i+1；

56) In that

The penalty function is formulated as follows:

wherein

Return to step 55); if the corrected abandon value

Below the abandon threshold

Performing the following steps;

59) outputting a filtered log set

(6) Filtering log sets from output

And regenerating the log file.

2. The correlation-based dual-granularity noise log filtering method according to claim 1, wherein the log set in the step (1) is

All the execution examples of the business process are included, that is, each process track sigma corresponds to one execution example of the business process, the process track sigma is an ordered sequence composed of a plurality of process events e, and the process events e are one record of the business process execution activity.

3. The correlation-based dual-granularity noise log filtering method according to claim 1, wherein the frequency dependency DFD (e) in the step (2)_i,e_j) Indicating degree of direct follow, i.e. event e in all flow instances_jFollowing event e_iThe total frequency of occurrence.

4. The correlation-based dual-granularity noise log filtering method according to claim 1, wherein the global noise factor ζ in step (3) is 0.02.

5. The correlation-based dual-granularity noise log filtering method as claimed in claim 1, wherein the mixedness threshold β in step (5) is 0.5.

6. The correlation-based dual-granularity noise log filtering method as claimed in claim 1, wherein the weighting factor α in the step (5) is 0.5.

7. The correlation-based dual-granularity noise log filtering method according to claim 1, wherein the penalty factor in the step (5)

0.8 is taken.

8. The correlation-based dual-granularity noise log filtering method according to claim 1, wherein the abandon threshold in the step (5)

Take 0.7.