CN113139712B

CN113139712B - Machine learning-based extraction method for incomplete rules of activity attributes of process logs

Info

Publication number: CN113139712B
Application number: CN202110257681.XA
Authority: CN
Inventors: 聂富强; 叶旺; 孙曜
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2024-02-09
Anticipated expiration: 2041-03-09
Also published as: CN113139712A

Abstract

The invention discloses a machine learning-based extraction method for incomplete rules of activity attributes of a flow log. The invention comprises the following steps: step 1, preprocessing log data, firstly extracting flow log data recorded in a business flow information management system, converting the XES format log data into a CSV format suitable for a machine learning algorithm, and dividing the flow log data into flow active paths by taking a flow instance as a unit; step 2, after preprocessing pretreatment is carried out on the flow log data, each flow active path is encoded, and the flow active paths are converted into flow characteristic vectors; and step 3, classifying the flow characteristic vector by using a classification regression decision tree in machine learning, and constructing a classification decision tree. The invention can improve the data analysis efficiency to a great extent and provides a reference for analyzing the reasons of log data missing. The method provided by the invention has the characteristics of good universality, high accuracy and easy understanding.

Description

Machine learning-based extraction method for incomplete rules of activity attributes of process logs

Technical Field

The invention relates to the field of business process management, in particular to a process log activity attribute incomplete rule extraction method based on machine learning.

Background

The process mining is used as a technology in the field of business process management, and is mainly used for optimizing the existing enterprise resources by analyzing process logs recorded in a business process information management system. The business process mining research is mainly divided into three layers: process model discovery, consistency check, and model improvement. The process model discovery means that a process model is mined in a history process log; the consistency detection is mainly to measure the degree of agreement between the excavation model and the original model; model improvement refers to how a mined model is utilized to improve and optimize a known model, organization architecture, etc. At present, the business process model finds out the most studied, and the business process model finds out four mining dimensions of control flow, organization, case and time. The mining dimension found by the flow model is primarily dependent on the data dimension in the flow log.

The flow log is flow instance execution history information recorded by the business flow information management system. Fig. 1 is a flow log fragment in which one flow instance (Case) often contains multiple events (events) (also called activities) or tasks (tasks)). An event contains a number of attributes, such as the ID of the flow instance, the ID of the event, the execution timestamp of the event, and the execution resources of the activity (the activity executives, the executing roles, and the devices required to execute), etc. Most of the existing process mining methods are based on complete process log data, however, due to technical reasons (such as system faults and resource limitations) or artificial reasons (such as manual recording errors), certain data noise is usually recorded in the log information system, such as data missing, inaccurate data recording, irrelevant data recording, and the like. For example, the event time stamp in fig. 1 should be accurate to minutes, but not recorded for some reason or not sufficiently accurate. In data analysis, this phenomenon is referred to as "dirty in and dirty out". The business analysis based on the data with poor quality only produces meaningless results, so the improvement of the data quality has important significance for the mining of the business process. The existing data quality improving modes mainly comprise two modes, namely, when data is generated, the capturing mode is improved, so that log data is recorded as accurately as possible; and secondly, after log data are acquired, repairing the data. Repair log data fills in missing values or replaces inaccurate values mainly through predicted values, but the accuracy of prediction often cannot achieve ideal effects. Therefore, the invention mainly extracts the missing value of the log attribute or the rule of the inaccurate value of the record by a machine learning method and provides support for analyzing the reason of noise generated by the log data in the first mode.

Event attributes in the flow log play a key role in flow mining, e.g., case IDs (Case IDs) typically categorize flow activities by Case; the execution time stamp of the activity is usually used to find the flow execution path, and the flow control flow structure (such as selection, parallel, circulation, repetition, etc. structure in the flow model) is mined; the activity performers and the performance roles are commonly used for organizing mining, performing business analysis through a mined model, optimizing an organization structure and the like. It is difficult to accurately mine a flow model for event attribute value missing or logging inaccuracy. The completeness of the log event attribute determines the accuracy of the process mining.

Disclosure of Invention

The invention provides a machine learning-based extraction method of activity attribute incomplete rules of a flow log, which aims to find a trend of activity attribute value deficiency or inaccurate record in the flow log and provides support for analyzing reasons of the deficiency or inaccurate record of the activity attribute value of the flow log.

A method for extracting incomplete rules of activity attributes of a flow log based on machine learning comprises the following steps:

step 1, preprocessing log data, firstly extracting flow log data recorded in a business flow information management system, converting the log data in an XES format into a CSV format suitable for a machine learning algorithm, and dividing the flow log data into a flow active path set by taking a flow instance as a unit;

step 2, after preprocessing the flow log data, encoding each flow activity path, and converting the flow activity paths into flow characteristic vectors;

and step 3, classifying the flow characteristic vector by using a classification regression decision tree in machine learning, and constructing a classification decision tree.

Further, the step 1 is specifically implemented as follows:

let Case ID be 364868, convert the flow instance with Case ID 364868 into flow activity path, and record as trace= < a, B, C, D, E >, wherein a, B, C, D, E are all unique identifications of activity types. If the activity attribute in the flow activity path is complete, then the flow activity path is complete, if the activity attribute in the flow activity path contains a missing or inaccurate value, then the activity containing the missing or inaccurate value of the attribute is generally denoted in the flow activity path by "-" and if the time attribute value of activity B in the flow activity path trace is missing, then the flow activity path is recorded as trace= < a, -, C, D, E >.

Further, the step 2 is specifically implemented as follows:

and encoding the flow activity path by adopting a one-hot encoding mode: each activity of each path in the pre-processed flow log data is first traversed, and if the activity attribute value is found to be missing or inaccurate, a prefix variable Vpre is added to the immediately preceding activity ID of the activity, and a suffix variable Vsuf is added to the immediately following activity ID, so that the encoded activity is distinguished from the original activity in the flow feature vector.

Further, the prefix variable and the suffix variable are calculated as follows:

Vsuf＝∑Type _activity +1

Vpre＝Vsuf×2

wherein, type _activity Is the type of activity in the overall flow log. Taking the total number N of the activity types in the flow log data as a base variable, adding 1 to a suffix variable, and adding 2 times of the suffix variable to the prefix variable. The prefix variable and the suffix variable obtained by calculation are positive integers, and the original activity and the coded activity can be distinguished after the prefix variable and the suffix variable are added.

Further, if one flow active path contains incomplete value activity, the feature vector label value corresponding to the flow active path is set to 1, otherwise, 0 is set.

Further, the step 3 is specifically implemented as follows:

the leaf nodes in the constructed classification decision tree represent the number of flow activity paths containing attribute incomplete values in the flow activity paths, and the non-leaf nodes represent the front and back activity information containing the attribute incomplete value activities; x is less than or equal to Q in the non-leaf nodes and represents the decision condition of the path, and when the characteristic X is less than or equal to Q, the decision tree makes a decision to the left; when the feature X is larger than Q, the decision tree decides to the right; wherein Q is a set threshold; the samples in the non-leaf nodes represent the number S=S1+S2 of flow active paths, wherein S1 flow active paths do not contain incomplete values, and S2 flow active paths contain incomplete values; and S2, analyzing and judging the flow activity paths containing the incomplete values to obtain a conclusion.

Further, the value of the threshold Q is 0.5, and the scikit-learn specification in the machine learning library is adopted for selection.

Further, the analysis rule for the flow activity path including the incomplete value is as follows:

rule 1: whether the flow path containing the incomplete value of the activity attribute has the activity execution of the same ID before the activity containing the incomplete value;

rule 2: whether the flow path containing the incomplete value of the activity attribute has the activity execution of the same ID after the activity containing the incomplete value;

the invention has the following beneficial effects:

the invention aims to find the trend of the lack of the activity attribute value or the inaccuracy of the record in the flow log and provides support for analyzing the cause of the lack of the activity attribute value or the inaccuracy of the record in the flow log.

In the experiment, the classification accuracy is higher, and under the condition that the total number of the total data sets is 6042 and the paths containing the missing values are 3369, only 28 paths do not make correct classification, and the classification accuracy can reach 99.2%. Without this approach, the data analyst would need to look at where log data missing values occur in the massive historical log data without any assistance, a time and effort consuming task. The method can greatly improve the data analysis efficiency and provide a reference for analyzing the reasons of log data missing. The comprehensive analysis method provided by the invention has the characteristics of good universality, high accuracy and easy understanding.

Drawings

FIG. 1 is a flow log fragment.

Fig. 2 is a flowchart of an encoding algorithm.

Fig. 3 is a path coding diagram.

Fig. 4 is an example of a proposed method decision tree.

Fig. 5 is a flow chart of the present invention.

Fig. 6 is a decision tree trained based on real data.

Fig. 7 is a log activity type table.

Detailed Description

The invention is further described below with reference to the drawings and examples.

The flow log is typically recorded in the log information system in the data format of XES (eXtensible Event Stream, scalable event stream). XES is an XML-based event log standard, and has the advantages of less format limitation and high expansibility. However, it is difficult to train decision tree based on XES format data, so log data is preprocessed first, the process log data recorded in the business process information management system is extracted first, the XES format log data is converted into CSV format suitable for machine learning algorithm, and the process log data is divided into process activity paths (also referred to as process activity sequence or process activity track) in units of process instance. For example, in fig. 1, the flow instance with Case ID 364868 is converted into a flow activity path and then recorded as trace= < a, B, C, D, E >, where a, B, C, D, E are all unique identifiers of the activity types. If the activity attribute in the flow activity path is complete, then the flow activity path is also complete, and if the activity attribute in the flow activity path contains a missing or inaccurate value, then the activity containing the missing or inaccurate value of the attribute is generally denoted in the flow activity path by "-" e.g., when the time attribute value of activity B in this case is missing, the flow activity path is recorded as trace= < a, -, C, D, E >.

After preprocessing and preprocessing the flow log data, each flow active path needs to be encoded, and the flow active paths are converted into flow characteristic vectors. The invention adopts a one-hot coding mode to code the flow activity path: first, each activity of each path in the preprocessed flow log data is traversed, if the activity attribute value is found to be missing or inaccurate, a prefix variable Vpre is added to the activity ID (unique identifier of the activity) preceding the activity, and a suffix variable Vsuf is added to the activity ID immediately following the activity, fig. 2 is a coding algorithm flow.

Firstly traversing each activity in each flow activity path, judging whether activity attributes are incomplete, if yes, adding prefix variable and suffix variable to the previous activity ID and the next activity ID of the activity respectively, then judging whether the added feature value exists in a feature set, if yes, not repeatedly adding, and if not, storing the feature value into the set. If the activity attribute has no incomplete value, directly skipping to traverse the next activity until the process activity path traversal is finished.

The prefix variable and the suffix variable are added to the ID of the activity immediately before and after the activity containing the attribute incomplete value respectively, mainly for distinguishing the coded activity from the original activity in the flow characteristic vector, and the prefix variable and the suffix variable have the following calculation formulas:

Vsuf＝∑Type _activity +1

Vpre＝Vsuf×2

wherein, type _activity Is the type of activity in the overall flow log. Taking the total number N of the activity types in the flow log data as a base variable, adding 1 to the suffix variable (if the activity type codes start from 0, the base variable value can be directly used as the suffix variable), and the prefix variable is 2 times of the suffix variable. The prefix variable and the suffix variable obtained by calculation are positive integers, and the original activity and the coded activity area can be obtained after the prefix variable and the suffix variable are added. If one flow active path contains incomplete value activity, the feature vector label value corresponding to the flow active path is set to be 1, otherwise, the feature vector label value is set to be 0. For example, as shown in fig. 3, the activity type ID is encoded from 0, 13 activities are total, 13 suffix variable is taken, 26 prefix variable is taken, the flow activity path contains activity with attribute incomplete value, the feature value in the flow feature vector is 1, the tag value is 1, the activity without attribute incomplete value is contained, the feature value in the feature vector is 0, and the tag value is 0.

After the flow activity path is converted into the feature vector through the coding algorithm, the flow feature vector is classified by using a classification regression decision tree (CART) in machine learning, wherein the regression decision tree is a binary tree, and data can be continuously divided into two parts according to the features. The invention trains a CART classification tree, wherein leaf nodes represent the number of flow activity paths containing attribute incomplete values in the flow activity paths, and non-leaf nodes represent the front and back activity information containing attribute incomplete value activities. FIG. 4 shows an example of constructing a decision tree: "26.0< = 0.5" in the root node represents the decision condition of the path, when the feature 26.0 is 0, the decision tree makes a decision to the left, and when the feature is 1, the decision tree makes a decision to the right; 0.5 is chosen because the scikit-learn specification in the machine learning library is used, and other values may be chosen as long as 0 and 1 can be distinguished. The samples in the root node represent the number of flow active paths (6042), wherein 2350 flow active paths contain no incomplete values and 3692 flow active paths contain attribute incomplete values. The left branch of the root node represents the number of flow activity paths without incomplete values, the right branch represents the number of flow activity paths with missing values, in this example, we can find that the incomplete rule of the log attribute value is that most (3581) flow paths with incomplete values of the activity attribute have an activity with an ID of 0 before the activity with the incomplete value, then it can be derived from classification of the decision tree, and in other flow paths with the incomplete value, an activity with an ID of 0 is always executed after the activity with the incomplete value, according to the two rules generated by the decision tree, a common rule of the attribute incomplete value of the log activity can be extracted from the log data of the sample, and the activity with an ID of 0 is always executed before or after the activity with the attribute incomplete value.

The above is a research idea of the present invention, and a specific research flow is shown in fig. 5, and the validity of the method will be verified in the real log data set. The experimental data set adopts data of an information management system issued by Belgium Volvo information company, the system is a subject of intelligent challenge competition of a 2013 business process, and the work flow of processing is a process from system fault occurrence to normal recovery. The log data contains 6042 flow instances, for a total of 13 different types of activities, each with a unique activity ID and activity name, as shown in fig. 7. The activity ID is mainly used later as an element in the activity sequence. All experiments of the invention are completed on a machine with an operating system of Windows10 professional version, a CPU of Intel i7-77003.60GHz and a memory of 16.0G.

Example 1:

firstly, converting a data format through a process mining tool, then converting a process instance into an active sequence, and then adopting one-hot coding to the converted active sequence to enable the active sequence to become a feature vector. Because the data in the log information system are complete data and do not contain incomplete values, the experiment adopts random deletion of the appointed activity attribute value to simulate imperfect log records caused by a system fault or human error, the method is effective to both the missing value and the imprecise value of the log attribute, and the attribute missing value is taken as an example below, when the activity ID in a process example is 5,6,7,9,10, a random number is generated, and if the random number is greater than 45, the timestamp value in the activity attribute is deleted. After the experimental data are processed, the feature vectors are input into a decision tree trained by a machine learning algorithm as shown in fig. 6. The analysis experiment result shows that the number of flow paths containing attribute missing values in the log data of the root node is 3369, 2673 flow paths do not contain missing values, one common missing rule containing the attribute missing values is that activities with the ID of 0 always occur before activities containing the missing values, and some activities (48) containing the attribute missing values are executed after the activities with the ID of 0; further, an activity with an activity ID of 2 is always performed after an activity containing a missing value.

Claims

1. A method for extracting incomplete rules of activity attributes of a process log based on machine learning is characterized by comprising the following steps:

step 1, preprocessing log data, firstly extracting flow log data recorded in a business flow information management system, converting the XES format log data into a CSV format suitable for a machine learning algorithm, and dividing the flow log data into flow active paths by taking a flow instance as a unit;

step 2, after preprocessing pretreatment is carried out on the flow log data, each flow active path is encoded, and the flow active paths are converted into flow characteristic vectors;

step 3, classifying the flow characteristic vector by using a classification regression decision tree in machine learning to construct a classification decision tree;

the step 1 is specifically realized as follows:

setting the Case ID as 364868, converting the flow instance with the Case ID of 364868 into a flow activity path, and then recording the flow activity path as trace= < A, B, C, D, E >, wherein A, B, C, D, E are all unique identifiers of the activity type; if the activity attribute in the flow activity path is complete, then the flow activity path is also complete, if the activity attribute in the flow activity path contains a missing or inaccurate value, then the activity containing the missing or inaccurate value of the attribute is typically denoted in the flow activity path by "-" and if the time attribute value of activity B in the flow activity path trace is missing, then the flow activity path is recorded as trace= < a, -, C, D, E >;

the step 2 is specifically realized as follows:

and encoding the flow activity path by adopting a one-hot encoding mode: firstly traversing each activity of each path in the preprocessed flow log data, if the activity attribute value is found to be missing or the record is inaccurate, adding a prefix variable Vpre to the immediately previous activity ID of the activity, and adding a suffix variable Vsuf to the immediately next activity ID, so that the coded activity and the original activity are distinguished in the flow feature vector;

the step 3 is specifically realized as follows:

2. The method for extracting incomplete rules of activity attributes of a process log based on machine learning according to claim 1, wherein a prefix variable and a suffix variable are calculated according to the following formula:

Vsuf＝∑Type _activity +1

Vpre＝Vsuf×2

wherein, type _activity Is the activity type in the whole flow; taking the total number N of the activity types in the flow log data as a base variable, adding 1 to a suffix variable, and adding 2 times of the suffix variable to the prefix variable.

3. The method for extracting incomplete rules of activity attributes of a process log based on machine learning according to claim 1 or 2, wherein if one process activity path contains incomplete value activity, the feature vector label value corresponding to the process activity path is set to 1, otherwise, set to 0.

4. The method for extracting incomplete rules of activity attributes of a process log based on machine learning according to claim 3, wherein the value of the threshold Q is 0.5, and the scikit-learn specification in the machine learning library is adopted for selection.

5. The method for extracting incomplete rules of activity attributes of a process log based on machine learning according to claim 4, wherein the analysis rules for the process activity path containing incomplete values are as follows:

rule 2: the flow path containing the incomplete value of the activity attribute has the activity execution of the same ID after the activity containing the incomplete value.