CN115688853A - Process mining method and system - Google Patents

Process mining method and system Download PDF

Info

Publication number
CN115688853A
CN115688853A CN202211333392.4A CN202211333392A CN115688853A CN 115688853 A CN115688853 A CN 115688853A CN 202211333392 A CN202211333392 A CN 202211333392A CN 115688853 A CN115688853 A CN 115688853A
Authority
CN
China
Prior art keywords
fitness
cause
individuals
function
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211333392.4A
Other languages
Chinese (zh)
Inventor
任俊达
周春雷
刘识
皮志贤
赵添翼
吕宏伟
陈振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Center Of State Grid Corp Of China
Original Assignee
Big Data Center Of State Grid Corp Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Center Of State Grid Corp Of China filed Critical Big Data Center Of State Grid Corp Of China
Priority to CN202211333392.4A priority Critical patent/CN115688853A/en
Publication of CN115688853A publication Critical patent/CN115688853A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A process mining method and system comprises the following steps: acquiring event log information of a required process model; calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model; optimizing all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix; and converting the optimal cause and effect matrix into a process model as an optimal process model. The method adopts the heuristic algorithm to obtain the cause and effect matrix and adopts the genetic algorithm to determine the technical characteristics of the optimal cause and effect matrix, thereby shortening the search time, enhancing the local search capability, having certain advantages on the problem of flow mining, being capable of processing invisible tasks, non-free selection and other special structures in the flow and improving the effect of flow mining.

Description

Process mining method and system
Technical Field
The invention relates to the technical field of process mining, in particular to a process mining method and system.
Background
At present, the main perfect process mining is the alpha-algorithm, the algorithm and the expansion thereof are simple and easy to understand, for common specific types of workflows, the algorithm may find a completely correct or behavior-equivalent workflow model, but the alpha-algorithm mainly aims at a structured workflow network, and cannot be realized for structures such as circulation, repeated tasks, invisible tasks, synchronous convergence and the like, which are common in a business process model.
And more business system-based process modeling technologies appear, including traditional workflow modeling technologies, multi-Agent-based flexible workflow modeling technologies and WebService-based dynamic combination modeling technologies, however, the process modeling technologies have some problems in practical application, firstly, the technologies all need workflow design, namely, a process customizer needs to establish an accurate model for describing a business process in detail, the established accurate model needs to have a very high business knowledge level and deep workflow knowledge, and needs related domain experts to participate in design, and secondly, artificially designed process modeling is often influenced by subjective factors, so that the accuracy and quality assurance of process mining are greatly influenced.
Disclosure of Invention
In order to overcome the defect that the traditional process modeling technology influences the accuracy of process mining, the invention provides a process mining method, which comprises the following steps:
acquiring event log information of a required process model;
calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model;
optimizing all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
and converting the optimal cause and effect matrix into a process model as an optimal process model.
Preferably, the performing optimization operation on all cause and effect matrices by using a genetic algorithm to obtain an optimal cause and effect matrix includes:
constructing a population from all the cause and effect matrices; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
selecting an adaptive function based on whether there is a sequence of activities in the process model reflected by the cause and effect matrix that is outside of the given event log information;
calculating the fitness of each individual by adopting the selected fitness function, and taking the individual with the maximum fitness as an optimal individual;
and taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
Preferably, the calculating the fitness of each individual by using the selected fitness function, and taking the individual with the maximum fitness as the optimal individual, includes:
step 1, initializing evolution iteration times and setting the maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, eliminating individuals with low fitness values based on the fitness values of the individuals to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration frequency is reached or whether the individuals with the fitness of 1 exist, if the maximum evolution iteration frequency is reached or the individuals with the fitness of 1 exist, entering the step 5, otherwise, returning to the step 2;
and 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration number is reached as the optimal individual.
Preferably, the selecting an adaptive function based on whether an activity sequence other than a given log exists in the process model reflected by the cause and effect matrix includes:
if the flow model reflected by the cause and effect matrix does not have an activity sequence except the given log, selecting a first adaptive function; otherwise, the second fitness function is selected.
Preferably, the first adaptive function is represented by the following formula:
Figure BDA0003913917770000021
in the above formula, f 1 Is a first fitness function; n is a radical of hydrogen 1 The number of all correctly resolved activities; n is a radical of 2 Is the sum of the number of activities in the log; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of hydrogen 4 Is the sum of the number of active sequences in the log.
The second fitness function is represented by:
Figure BDA0003913917770000022
in the above formula, f 2 Is a second fitness function; n is a radical of hydrogen 1 The number of all correctly resolved activities; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 5 The sum of the number of activities in the corresponding process model; n is a radical of hydrogen 6 Is the sum of the number of active sequences in the corresponding process model.
In another aspect, the present invention further provides a process mining system, including:
the acquisition module is used for acquiring event log information of a required process model;
the first obtaining module is used for calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm and obtaining a cause-and-effect matrix of each process model;
the second obtaining module is used for carrying out optimization operation on all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
and the third obtaining module is used for converting the optimal cause and effect matrix into the process model as the optimal process model.
Preferably, the second obtaining module includes:
the first preparation submodule is used for constructing a population by all the cause and effect matrixes; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
a function selection module for selecting an adaptive function based on whether an activity sequence other than the event log information is given in the process model reflected by the cause and effect matrix;
the optimal individual generation submodule is used for calculating the fitness of each individual by adopting the selected fitness function and taking the individual with the maximum fitness as the optimal individual;
and the cause and effect matrix submodule is used for taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
Preferably, the optimal individual generation submodule is specifically configured to:
step 1, initializing evolution iteration times and setting maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, eliminating individuals with low fitness values based on the fitness values of the individuals to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration frequency is reached or whether the individuals with the fitness of 1 exist, if the maximum evolution iteration frequency is reached or the individuals with the fitness of 1 exist, entering the step 5, otherwise, returning to the step 2;
and step 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration times is reached as the optimal individual.
Preferably, the function selecting module includes:
a function determination submodule, configured to select a first adaptive function if an activity sequence other than a given log does not exist in the process model reflected by the cause and effect matrix; otherwise, the second fitness function is selected.
Preferably, the first adaptive function is represented by the following formula:
Figure BDA0003913917770000041
in the above formula, f 1 Is a first fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 2 Is the sum of the number of activities in the log; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 4 The sum of the number of active sequences in the log;
preferably, the second adaptive function is represented by the following formula:
Figure BDA0003913917770000042
in the above formula, f 2 Is a second fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 5 The sum of the number of activities in the corresponding process model; n is a radical of hydrogen 6 Is the sum of the number of active sequences in the corresponding process model.
Compared with the prior art, the invention has the following beneficial effects:
a process mining method and system comprises the following steps: acquiring event log information of a required process model; calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model; optimizing all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix; and converting the optimal cause and effect matrix into a process model as an optimal process model. The method adopts the heuristic algorithm to obtain the cause and effect matrix and adopts the genetic algorithm to determine the technical characteristics of the optimal cause and effect matrix, thereby shortening the search time, enhancing the local search capability, having certain advantages on the problem of flow mining, being capable of processing invisible tasks, non-free selection and other special structures in the flow and improving the effect of flow mining.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating the main steps of a flow mining method according to an embodiment of the present invention;
fig. 2 is a main structural diagram of a process mining system according to an embodiment of the present invention.
Detailed Description
The following provides a more detailed description of embodiments of the present invention, with reference to the accompanying drawings.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The invention provides a process mining method, which comprises the following steps as shown in figure 1:
step S101: acquiring event log information of a required process model;
step S102: calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model;
step S103: optimizing all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
step S104: and converting the optimal cause and effect matrix into a process model as an optimal process model.
The specific process of step S102 is:
and based on data obtained by the read process event log, establishing an initial population by calculating a heuristic rule of causal dependency relationship. All activities occurring in one event log constitute a set of flow activities a, on the basis of which an initial population of cause and effect matrices can be randomly created. The causal relationships C, the conditional functions I and O are created randomly, where the initialization of C can be done using a heuristic that computes causal dependencies, i.e. if the form of activity a1a2 appears frequently in the event log, while the form of a2a1 is only a special case, (a 1, a 2) ∈ C can be defined. The greater the number of activities contained in the event log, the greater the search space for the genetic algorithm.
By computing any two tasks t 1 、t 2 Dependency relationship between D (t) 1 ,t 2 ):
Defining (dependency) t 1 、t 2 Are two events of the log, t 1 、t 2 The dependency relationship between them is defined as follows:
Figure BDA0003913917770000051
in the above formula, D (t) 1 ,t 2 ) As task t 1 、t 2 The dependency relationship between them; t is t 1 Is a log event; t is t 2 Is a log event; t (T) 1 ,t 2 ) As task t 1 And t 2 With t 1 t 2 The number of times such an event sequence appears in the log; t (T) 2 ,t 1 ) As task t 1 And t 2 With t 2 t 1 The number of times such an event sequence appears in the log; l is a radical of an alcohol 1 L(t 1 ,t 2 ) As task t 1 With t 1 t 1 The number of times such a sequence of events occurs in the log; l is a radical of an alcohol 2 L(t 1 ,t 2 ) As task t 1 And t 2 With t 1 t 2 t 1 Such an event sequence is the number of times it appears in the log.
Wherein, T (T) 1 ,t 2 ) Representative task t 1 And t 2 With t 1 t 2 The number of times such a sequence of events occurs in the log, if T (T) 1 ,t 2 ) Far greater than T (T) 2 ,t 1 ) Then we can consider task t 1 And t 2 There is a causal relationship between them.
For the short-loops mode,we introduce two special symbols to represent: l is 1 L(t 1 ,t 2 ) Representative task t 1 With t 1 t 1 The number of times such an event sequence appears in the log, L 2 L(t 1 ,t 2 ) Representative task t 1 And t 2 With t 1 t 2 t 1 Such a sequence of events is the number of times that it appears in the log.
In the embodiment, the cause and effect matrix is determined by adopting a heuristic algorithm, so that the search time is shortened, the local search capability is improved, and the process mining speed is increased.
Step S103 specifically includes:
s103a, constructing a population by all the cause and effect matrixes; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
step S103b, selecting an adaptive function based on whether activity sequences except the given event log information exist in the process model reflected by the cause and effect matrix;
step S103c, calculating the fitness of each individual by adopting the selected fitness function, and taking the individual with the maximum fitness as an optimal individual;
and S103d, taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
Step S103c specifically includes:
step 1, initializing evolution iteration times and setting the maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, based on the fitness value of the individuals, eliminating the individuals with low fitness value to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration frequency is reached or whether the individuals with the fitness of 1 exist, if the maximum evolution iteration frequency is reached or the individuals with the fitness of 1 exist, entering the step 5, otherwise, returning to the step 2;
and 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration number is reached as the optimal individual.
The specific process of step S103 is:
the input/output conditional function is created randomly, with the number of non-identical tasks in the log being used as a threshold to limit the number of subsets of input/output functions. And according to the selection definition of the individual coding mode, the fitness function and the genetic operator and the suggested initial population scale, the iteration number of the heuristic genetic algorithm can be specified.
And calculating the fitness of each genetic individual of the initial population based on the obtained initial population. If an individual in the genetic population correctly describes the behavioral characteristics recorded in the event log, the fitness of that individual may be high.
The present invention defines fitness as closely related to the number of activity sequences that can be correctly resolved from the event log. By "correctly resolve" is meant that the terminating library of the process model is the only library that is marked after resolution. In the ideal case without noisy data, the fitness of the model may be 1 or 100%, i.e. all activity sequences can be resolved. In practical situations, the value of the fitness fluctuates between 0 and 1. Thus, when no sequence of activities outside of a given log occurs in the flow model of the cause and effect matrix reaction, the individual fitness may be calculated with a first fitness function:
Figure BDA0003913917770000071
in the above formula, f 1 Is a first fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 2 Is the sum of the number of activities in the log; n is a radical of hydrogen 3 All correctly resolved activity sequence numbers; n is a radical of 4 Is the sum of the number of active sequences in the log.
In this embodiment, the total number of activities in the log is defined as the sum of the occurrence times of all process activities in the event log, and the repeated events are counted; the number of the activity sequences in the log is comprehensively defined as the total number of the activity sequences in the event log; the number of all correctly resolved activities is defined as the total number of activities that have been resolved for all event log activity sequences; the number of all correctly resolved activity sequences is defined as the number of correctly resolved event log activity sequences.
When the process model derived from the cause and effect matrix has a sequence of activities outside the given log, the fitness of the individual may be calculated using a second fitness function:
Figure BDA0003913917770000072
in the above formula, f 2 Is a second fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 5 The sum of the number of activities in the corresponding process model; n is a radical of 6 The sum of the number of active sequences in the corresponding process model.
In this embodiment, the sum of the number of activities in the corresponding process model is defined as the sum of all activities in the process model obtained after mining, including repeatedly occurring activities, and the sum of the number of activity sequences in the corresponding process model is defined as the sum of all activity sequences at the derivation position of the process model that can be mined.
And judging whether the genetic algorithm can be ended or not based on the obtained fitness data. The genetic algorithm ends under two conditions: has been calculated to the maximum post-algebra; an optimal cause and effect matrix is generated. If the algorithm stops, the current population is returned, if the two conditions for stopping are not met, the individuals with high fitness are assigned to the offspring according to a certain proportion, namely natural selection is carried out, and other individuals of the offspring are generated by carrying out cross variation on the current population.
In this embodiment, the cross probability is generally selected to be 0.6-1.0, and the mutation probability is generally selected to be 0.005-0.01.
The genetic algorithm carries out optimization operation on the genetic individuals of the initial population according to the following steps to find out the best individuals:
input: event Log L, dependency D
Output: cause and effect relationship C
1.T ← L Activity collections
2. For the
Figure BDA0003913917770000081
The following operations are performed:
(a) Randomly selecting real number r E [0, 1)
(b) If r < D (a, b), C ← C { (a, b) }
The crossover operation mainly comprises: running the individual to delete the activity from the subset activity of the input/output function and add the activity to the subset of the input/output function; exchanging causality between different individuals, adding causality that is in the population but not in that individual, removing causality and increasing/decreasing the number of subsets of input/output functions, etc.
The starting point of the crossover operation is two parents, the crossover operator randomly selects one activity a of the two parents as a crossover point, and simultaneously randomly selects one crossover point for the input (a) and output (a) sets of the two parents. And then, switching subsets from the switching points to the tail end of the set according to the cross probability, thereby recombining the input/output sets of the switching points in the two parents respectively. And then checking the recombined input/output set to ensure that correct exchange segmentation is obtained, and avoiding the inconsistency of the cause and effect matrix, thereby obtaining two new descendants.
Mutation operation refers to changing the existing cause and effect relationship in the population, i.e. the following operations can be performed in an individual according to the mutation probability: randomly selecting a subset from an input/output function of an activity in an individual, and adding an activity in the set of activities to the subset; a subset is randomly selected and an activity is removed from the subset and the elements in the subset of input/output are randomly re-allocated to the new subset.
In the embodiment, the search space of the genetic algorithm is defined, optimization operation on various structural models can be supported, and the process mining effect is improved; and the fitness of the individuals is evaluated according to the selected fitness function, so that the fitness of the individuals without the activity sequences except the log information and the activity sequences except the log information can be evaluated, the method is suitable for multiple scenes, all the individuals in a population can be ensured to be covered by adopting cross and variation operations, and the accuracy of the result is improved.
The specific process of step S104 is:
the method for mapping the cause and effect matrix to the Petri network is utilized, and the cause and effect matrix can be directly converted into a process model described by the Petri network.
In this embodiment, a flow model is determined by using internal description, and a multi-structure flow model can be supported.
Example 2
Based on the same inventive concept, the present invention further provides a process mining system, as shown in fig. 2, including:
the acquisition module is used for acquiring event log information of a required process model;
the first obtaining module is used for calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model;
the second obtaining module is used for carrying out optimization operation on all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
and the third obtaining module is used for converting the optimal cause and effect matrix into a process model as an optimal process model.
The first obtaining module comprises the following specific processes:
and based on the data obtained by the read process event logs, establishing an initial population by calculating a heuristic rule of causal dependency relationship. All activities occurring in one event log constitute a set of flow activities a, on the basis of which an initial population of cause and effect matrices can be randomly created. The causal relationships C, the conditional functions I and O are created randomly, where the initialization of C can be done using a heuristic that computes causal dependencies, i.e. if the form of activity a1a2 appears frequently in the event log, while the form of a2a1 is only a special case, (a 1, a 2) ∈ C can be defined. The greater the number of activities contained in the event log, the greater the search space for the genetic algorithm.
By computing any two tasks t 1 、t 2 Dependency relationship between D (t) 1 ,t 2 ):
Definition (dependency) t 1 、t 2 Are two events of the log, t 1 、t 2 The dependency relationship between them is defined as follows:
Figure BDA0003913917770000101
in the above formula, D (t) 1 ,t 2 ) As task t 1 、t 2 The dependency relationship between them; t is t 1 Is a log event; t is t 2 Is a log event; t (T) 1 ,t 2 ) As task t 1 And t 2 With t 1 t 2 The number of times such an event sequence appears in the log; t (T) 2 ,t 1 ) As task t 1 And t 2 With t 2 t 1 The number of times such an event sequence appears in the log; l is 1 L(t 1 ,t 2 ) As task t 1 With t 1 t 1 The number of times such a sequence of events occurs in the log; l is 2 L(t 1 ,t 2 ) As task t 1 And t 2 With t 1 t 2 t 1 Such a sequence of events is the number of times that it appears in the log.
Wherein, T (T) 1 ,t 2 ) Representative task t 1 And t 2 With t 1 t 2 The number of times such a sequence of events occurs in the log, if T (T) 1 ,t 2 ) Far greater than T (T) 2 ,t 1 ) Then we can consider task t 1 And t 2 There is a causal relationship between them.
For short-loops mode, we introduce two special symbols to represent: l is 1 L(t 1 ,t 2 ) Representative task t 1 With t 1 t 1 Such a sequence of events appears in the logNumber of times, L 2 L(t 1 ,t 2 ) Representative task t 1 And t 2 With t 1 t 2 t 1 Such a sequence of events is the number of times that it appears in the log.
In the embodiment, the cause and effect matrix group is determined by adopting a heuristic algorithm, so that the searching time is shortened, the local searching capability is improved, and the process mining speed is increased.
The second obtaining module specifically includes:
a first preparation submodule, configured to construct a population from all the cause and effect matrices; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
a function selection module for selecting an adaptive function based on whether an activity sequence other than the event log information is given in the process model reflected by the cause and effect matrix;
the optimal individual generation submodule is used for calculating the fitness of each individual by adopting the selected fitness function and taking the individual with the maximum fitness as the optimal individual;
and the cause and effect matrix submodule is used for taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
The optimal individual generation submodule is specifically configured to:
step 1, initializing evolution iteration times and setting maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, eliminating individuals with low fitness values based on the fitness values of the individuals to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration times is reached or whether the individuals with the fitness of 1 exist, entering the step 5 if the maximum evolution iteration times is reached or the individuals with the fitness of 1 exist, and returning to the step 2 if the maximum evolution iteration times is not reached or the individuals with the fitness of 1 exist;
and 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration number is reached as the optimal individual.
The function selection module specifically comprises:
a function determination submodule, configured to select a first adaptive function if an activity sequence other than a given log does not exist in the process model reflected by the cause and effect matrix; otherwise, the second fitness function is selected.
The second acquisition module comprises the following specific processes:
the input/output conditional function is created randomly, with the number of non-identical tasks in the log being used as a threshold to limit the number of subsets of input/output functions. And according to the selection definition of the individual coding mode, the fitness function and the genetic operator and the suggested initial population scale, the iteration number of the heuristic genetic algorithm can be specified.
And calculating the fitness of each genetic individual of the initial population based on the acquired initial population. If an individual in the genetic population correctly describes the behavioral characteristics recorded in the event log, the fitness of that individual may be high.
The present invention defines fitness as closely related to the number of activity sequences that can be correctly resolved from the event log. By "correctly resolve" is meant that the terminating library of the process model is the only library that is marked after resolution. In the ideal case without noisy data, the fitness of the model may be 1 or 100%, i.e. all activity sequences can be resolved. In practical situations, the value of the fitness fluctuates between 0 and 1. Thus, when no sequence of activities outside of a given log occurs in the flow model of the cause and effect matrix reaction, the individual fitness may be calculated with a first fitness function:
Figure BDA0003913917770000111
in the above formula, f 1 Is a first fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 2 Is the sum of the number of activities in the log; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 4 Is the sum of the number of active sequences in the log.
In this embodiment, the total number of activities in the log is defined as the sum of the occurrence times of all process activities in the event log, and the repeated events are counted; the number of the activity sequences in the log is comprehensively defined as the total number of the activity sequences in the event log; the number of all correctly resolved activities is defined as the total number of activities that have been resolved for all event log activity sequences; the number of all correctly resolved activity sequences is defined as the number of correctly resolved event log activity sequences.
When the process model derived from the cause and effect matrix has a sequence of activities outside the given log, the fitness of the individual may be calculated using a second fitness function:
Figure BDA0003913917770000121
in the above formula, f 2 Is a second fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of hydrogen 5 The sum of the number of activities in the corresponding process model; n is a radical of 6 Is the sum of the number of active sequences in the corresponding process model.
In this embodiment, the sum of the number of activities in the corresponding process model is defined as the sum of all activities in the process model obtained after mining, including repeatedly occurring activities, and the sum of the number of activity sequences in the corresponding process model is defined as the sum of all activity sequences at the derivation position of the process model that can be mined.
And judging whether the genetic algorithm can be ended or not based on the obtained fitness data. The genetic algorithm ends under two conditions: has been calculated to the maximum post-algebra; an optimal cause and effect matrix is generated. If the algorithm stops, the current population is returned, if the two conditions for stopping are not met, the individuals with high fitness are assigned to the offspring according to a certain proportion, namely natural selection is carried out, and other individuals of the offspring are generated by carrying out cross variation on the current population.
In this embodiment, the cross probability is generally selected to be 0.6-1.0, and the mutation probability is generally selected to be 0.005-0.01.
The genetic algorithm performs optimization operation on the genetic individuals of the initial population according to the following steps to find the best individual:
input: event Log L, dependency D
Output: causal relation C
1.T ← L Activity Collection
2. For
Figure BDA0003913917770000122
The following operations are performed:
(a) Randomly selecting real number r E [0, 1)
(b) If r < D (a, b), C ← C { (a, b) }
The crossover operation mainly comprises: running the individual to delete the activity from the subset activity of the input/output function and add the activity to the subset of the input/output function; exchanging causality between different individuals, adding causality that is in the population but not in that individual, removing causality and increasing/decreasing the number of subsets of input/output functions, etc.
The starting point of the crossover operation is two parents, the crossover operator randomly selects one activity a of the two parents as a crossover point, and simultaneously randomly selects one crossover point for the input (a) and output (a) sets of the two parents. And then, switching subsets from the switching points to the tail end of the set according to the cross probability, thereby recombining the input/output sets of the switching points in the two parents respectively. And then checking the recombined input/output set to ensure that correct exchange segmentation is obtained, and avoiding the inconsistency of the cause and effect matrix so as to obtain two new descendants.
Mutation operation refers to changing the existing cause and effect relationship in the population, i.e. the following operations can be performed in an individual according to the mutation probability: randomly selecting a subset from an input/output function of an activity in an individual, and adding an activity in the set of activities to the subset; a subset is randomly selected and an activity is removed from the subset and the elements in the subset of input/output are randomly re-allocated to the new subset.
In the embodiment, the search space of the genetic algorithm is defined, optimization operation on various structural models can be supported, and the process mining effect is improved; and the fitness of the individuals is evaluated according to the selected fitness function, so that the fitness evaluation method can be used for evaluating the fitness of the individuals without the activity sequences except the log information and the activity sequences except the log information, is suitable for multiple scenes, can ensure that all individuals in the population are covered by adopting cross and variation operations, and improves the accuracy of the result.
The third obtaining module comprises the following specific processes:
the method for mapping the cause and effect matrix to the Petri network is utilized, and the cause and effect matrix can be directly converted into a flow model described by the Petri network.
In this embodiment, a flow model is determined by using internal description, and a flow model with multiple structures can be supported.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (12)

1. A process mining method, comprising:
acquiring event log information of a required process model;
calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm to obtain a cause-and-effect matrix of each process model;
optimizing all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
and converting the optimal cause and effect matrix into a process model as an optimal process model.
2. The method of claim 1, wherein the performing an optimization operation on all cause and effect matrices using a genetic algorithm to obtain an optimal cause and effect matrix comprises:
constructing a population from all the cause and effect matrices; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
selecting an adaptive function based on whether there is a sequence of activities in the process model that is reflected by the cause and effect matrix that is outside of the given event log information;
calculating the fitness of each individual by adopting the selected fitness function, and taking the individual with the maximum fitness as an optimal individual;
and taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
3. The method of claim 2, wherein calculating the fitness of each individual using the selected fitness function, and wherein selecting the individual with the highest fitness as the optimal individual comprises:
step 1, initializing evolution iteration times and setting maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, based on the fitness value of the individuals, eliminating the individuals with low fitness value to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration frequency is reached or whether the individuals with the fitness of 1 exist, if the maximum evolution iteration frequency is reached or the individuals with the fitness of 1 exist, entering the step 5, otherwise, returning to the step 2;
and 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration number is reached as the optimal individual.
4. The method of claim 2, wherein selecting an adaptive function based on whether there is a sequence of activities in the flow model that is reflected by the cause and effect matrix that is outside of the given event log information comprises:
if the process model reflected by the cause and effect matrix does not have an activity sequence except the given log, selecting a first adaptive function; otherwise, the second fitness function is selected.
5. The method of claim 4, wherein the first fitness function is expressed as:
Figure FDA0003913917760000021
in the above formula, f 1 Is a first fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of hydrogen 2 Is the sum of the number of activities in the log; n is a radical of hydrogen 3 All correctly resolved activity sequence numbers; n is a radical of 4 Is the sum of the number of active sequences in the log.
6. The method of claim 4, wherein the second fitness function is represented by the following equation:
Figure FDA0003913917760000022
in the above formula, f 2 Is a second fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of hydrogen 3 All correctly resolved activity sequence numbers; n is a radical of hydrogen 5 The sum of the number of activities in the corresponding process model; n is a radical of hydrogen 6 Is the sum of the number of active sequences in the corresponding process model.
7. A process mining system, comprising:
the acquisition module is used for acquiring event log information of a required process model;
the first obtaining module is used for calculating a cause-and-effect dependency relationship heuristic rule of the event log information by adopting a heuristic algorithm and obtaining a cause-and-effect matrix of each process model;
the second obtaining module is used for carrying out optimization operation on all cause and effect matrixes by adopting a genetic algorithm to obtain an optimal cause and effect matrix;
and the third obtaining module is used for converting the optimal cause and effect matrix into a process model as an optimal process model.
8. The system of claim 7, wherein the second obtaining module comprises:
a first preparation submodule, configured to construct a population from all the cause and effect matrices; and taking a function closely related to the number of activity sequences which can be correctly analyzed from the event log information as an adaptive function;
a function selection module for selecting an adaptive function based on whether an activity sequence other than the event log information is given in the process model reflected by the cause and effect matrix;
the optimal individual generation submodule is used for calculating the fitness of each individual by adopting the selected fitness function and taking the individual with the maximum fitness as the optimal individual;
and the cause and effect matrix submodule is used for taking the cause and effect matrix corresponding to the optimal individual as an optimal cause and effect matrix.
9. The system of claim 8, wherein the optimal individual generation submodule is specifically configured to:
step 1, initializing evolution iteration times and setting maximum evolution iteration times;
step 2, calculating the fitness of the individuals in the population by using the selected fitness function to obtain the fitness value of the individuals;
step 3, eliminating individuals with low fitness values based on the fitness values of the individuals to obtain selected individuals;
step 4, carrying out crossover and mutation operations on the selected individuals in sequence to obtain new individuals, obtaining a next generation population by all the new individuals, judging whether the maximum evolution iteration times is reached or whether the individuals with the fitness of 1 exist, entering the step 5 if the maximum evolution iteration times is reached or the individuals with the fitness of 1 exist, and returning to the step 2 if the maximum evolution iteration times is not reached or the individuals with the fitness of 1 exist;
and 5, taking the individual with the maximum fitness or the individual with the fitness of 1 in the population when the maximum evolution iteration number is reached as the optimal individual.
10. The system of claim 8, wherein the function selection module comprises:
a function determination submodule, configured to select a first adaptive function if an activity sequence other than a given log does not exist in the process model reflected by the cause and effect matrix; otherwise, the second fitness function is selected.
11. The system of claim 10, wherein the first fitness function is expressed as:
Figure FDA0003913917760000031
in the above formula, f 1 Is a first fitness function; n is a radical of hydrogen 1 The number of all correctly resolved activities; n is a radical of 2 Is the sum of the number of activities in the log; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 4 Is the sum of the number of active sequences in the log.
12. The system of claim 10, wherein the second fitness function is expressed as:
Figure FDA0003913917760000032
in the above formula, f 2 Is a second fitness function; n is a radical of 1 The number of all correctly resolved activities; n is a radical of 3 All correctly resolved activity sequence numbers; n is a radical of 5 The sum of the number of activities in the corresponding process model; n is a radical of 6 Is the sum of the number of active sequences in the corresponding process model.
CN202211333392.4A 2022-10-28 2022-10-28 Process mining method and system Pending CN115688853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211333392.4A CN115688853A (en) 2022-10-28 2022-10-28 Process mining method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211333392.4A CN115688853A (en) 2022-10-28 2022-10-28 Process mining method and system

Publications (1)

Publication Number Publication Date
CN115688853A true CN115688853A (en) 2023-02-03

Family

ID=85046608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211333392.4A Pending CN115688853A (en) 2022-10-28 2022-10-28 Process mining method and system

Country Status (1)

Country Link
CN (1) CN115688853A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777191A (en) * 2023-08-18 2023-09-19 安徽思高智能科技有限公司 Flow decision-dependent construction method based on causal inference and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777191A (en) * 2023-08-18 2023-09-19 安徽思高智能科技有限公司 Flow decision-dependent construction method based on causal inference and storage medium
CN116777191B (en) * 2023-08-18 2023-11-03 安徽思高智能科技有限公司 Flow decision-dependent construction method based on causal inference and storage medium

Similar Documents

Publication Publication Date Title
US8583649B2 (en) Method and system for clustering data points
CN110175168B (en) Time sequence data filling method and system based on generation of countermeasure network
CN111581189B (en) Completion method and completion device for air quality detection data loss
CN109118155B (en) Method and device for generating operation model
CN115660078A (en) Distributed computing method, system, storage medium and electronic equipment
CN111552509A (en) Method and device for determining dependency relationship between interfaces
CN115688853A (en) Process mining method and system
CN111680085A (en) Data processing task analysis method and device, electronic equipment and readable storage medium
CN117556369B (en) Power theft detection method and system for dynamically generated residual error graph convolution neural network
CN114781688A (en) Method, device, equipment and storage medium for identifying abnormal data of business expansion project
CN106708875B (en) Feature screening method and system
CN112508440B (en) Data quality evaluation method, device, computer equipment and storage medium
CN116595918B (en) Method, device, equipment and storage medium for verifying quick logical equivalence
CN110472659B (en) Data processing method, device, computer readable storage medium and computer equipment
CN117114116A (en) Root cause analysis method, medium and equipment based on machine learning
CN111415200A (en) Data processing method and device
CN114581220B (en) Data processing method and device and distributed computing system
CN112783747B (en) Execution time prediction method and device for application program
JP2020052451A (en) Computer system and pattern generation method of business flow
CN115858648A (en) Database generation method, data stream segmentation method, device, equipment and medium
CN115345540A (en) One-dimensional blanking processing method, system, equipment and storage medium
CN114969148A (en) System access amount prediction method, medium and equipment based on deep learning
CN108564135B (en) Method for constructing framework program and realizing high-performance computing program running time prediction
CN112036567A (en) Genetic programming method, apparatus and computer readable medium
CN111898666A (en) Random forest algorithm and module population combined data variable selection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination