CN115374595A

CN115374595A - Automatic software process modeling method and system based on process mining

Info

Publication number: CN115374595A
Application number: CN202210745436.8A
Authority: CN
Inventors: 张贺; 刘博涵; 王佳辉; 宋凯文; 荣国平; 周鑫; 邵栋
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-11-22

Abstract

The invention provides an automatic software process modeling method and system based on process mining, which comprises the following steps: acquiring log resources through a software resource library, a public data set and a web crawler; performing data preprocessing on logs acquired from different channels, including data extraction, cleaning, integration, conversion and separation of events with different granularities; applying different algorithms to the log to discover the process and construct a process model; replaying the log on the model to carry out consistency check, and comparing the difference between the model and the log; evaluating the performance of the model according to the degree of fit, the accuracy, the generalization degree and the simplicity of the model and the log; and analyzing process parameters of the event log, and displaying detailed process information through frequency statistics and process clustering. The method and the system model the software process through process mining, are favorable for process managers to comprehensively know the software process, and are also favorable for finding problems in the software process, thereby further improving and increasing the efficiency of the software process.

Description

Automatic software process modeling method and system based on process mining

Technical Field

The invention belongs to the technical field of process mining and software processes, and particularly relates to an automatic software process modeling method and system based on process mining.

Background

The software process model is a static model for describing the relationship among personnel, activities, resources and products involved in the software development process, can qualitatively restore the real process of a target, is the construction basis of the dynamic simulation model, and provides a reference frame and constraint for the construction of the final simulation model. The internal meanings of all stages of the software process are deeply understood, so that not only can a researcher and a process manager understand the whole process, but also the comparison and improvement of the problems found in the process are facilitated, and the efficiency improvement and optimization of the software process are realized. With the popularity of DevOps, the software lifecycle is further extended, and the connection between various development activities is tighter; in addition, a large number of automatic tools supporting development are widely applied, and basic conditions are provided for software process modeling with finer granularity and higher fidelity.

While the widespread use of DevOps' popular and automated tools provides rich research cases and data support for software process modeling, it also presents new challenges for modeling technologies, mainly including: data records are inconsistent with modeling targets, and data recorded in a software resource library mainly aim at attribute information of products rather than behavior information of processes, so that process description data cannot be directly obtained; poor data quality and insufficient validity, which means that the data records contain a large number of inconsistent, missing and unintelligible information items; the software process model is low in construction efficiency and poor in fidelity. The construction of the existing process model depends on the experience of a modeler and the cognition on the business process to a great extent, but the cognitive establishment process has low efficiency and strong subjectivity, and the original appearance of the process is difficult to objectively and specifically restore.

Disclosure of Invention

In order to solve the problems and challenges in software process modeling, the invention provides an automatic software process modeling method and system based on process mining, which are used for helping modelers to perform data processing and process modeling. The invention extracts, cleans, integrates and converts data from different data sources and different contents and formats to generate a process event log meeting requirements, so as to solve the problems that data records are inconsistent with modeling targets, the data quality is low and the data contents and formats are difficult to meet the process modeling requirements; an objective and specific process model is restored and evaluated by integrating a plurality of process discovery and model evaluation algorithms, so that the problems of low model construction efficiency, poor fidelity and strong subjectivity are solved; in addition, the invention also analyzes the process parameters of event logs, such as parameter statistics, clustering and the like, so that a modeler can know the specific process of the target object more intuitively.

In order to achieve the purpose, the technical scheme of the invention is as follows: the method for modeling the automatic software process based on process mining is provided, and the implementation scheme comprises the following steps:

s1: acquiring data, namely acquiring log data and statistical data sets with various contents and formats from a software resource library, a public data set and other process data sources related to a software process;

s2: data preprocessing, namely performing data extraction, cleaning, integration and conversion on different log data and statistical data and separating events with different granularities to obtain standard event logs;

s3: process discovery, namely applying a process discovery algorithm to an event log to construct a process model;

s4: the consistency check is carried out, the event log is replayed on the model, and the difference between the model and the log is compared through a consistency check algorithm;

s5: evaluating the model, namely evaluating the performance of the model from different dimensions so as to select the most appropriate process model;

s6: and analyzing process parameters, namely performing statistical mining on the event logs, and displaying more detailed process information through process clustering, frequency statistics and time statistics.

Preferably, the S2 step includes: unifying the log formats of different data sources into a CSV format; searching corresponding event information in logs of different data sources by using the object identification as a clue, and merging the obtained event information; completing the missing timestamps in the steps in the event log; cleaning abnormal noise data in the event log, and deleting the case if the log records have an inconsistent problem; if the event has a problem, the event is reserved; and separating the event logs according to different granularities, and separating the event logs into versions with different granularities.

In particular, the process mining process begins with a canonical Event log, where each Event (Event) in the log represents an independent activity or step in the process, an Event is associated with a specific Case (Case), which refers to each complete process in the record, and different events in a Case are ordered in time, and an Event can be considered as a step in its dependent Case. In addition to logging the event itself, the event log may also store other relevant information, such as resource information (personnel or equipment) that performed the event, a timestamp of the event occurrence, and other data elements that were logged with the event. Most process mining techniques have explicit provisions for the storage format of standard event logs, which are typically stored in both the XES and CSV formats.

Preferably, the S3 step includes: building a process model by taking a standard event log as an input of a process discovery algorithm; the process model is presented in conjunction with different model languages.

Preferably, the step S4 replays the log on the model, and finds the difference between the event log and the process model through a consistency check, and the role of the difference is mainly embodied in three aspects: firstly, a modeler can repair the model by using the found difference, so that the process model is more comprehensive; second, modelers can also evaluate process discovery algorithms using differences, the smaller the differences between the model and the log, the higher the accuracy of the algorithm; thirdly, the consistency check is also a bridge connecting the event log and the process model, and the business process can be further known through the consistency check, so that the improvement of the software process is facilitated.

Preferably, the step S5 evaluates the model using the degree of fit, accuracy, generalization degree, and simplicity. The model evaluation method refers to a series of methods for comparing the behavior in the log with the behavior in the model in order to check whether the event log and the mined process model match. By evaluating the model, the modeler can verify whether the model accurately reflects the real data, thereby further verifying the effectiveness of the process discovery algorithm.

Preferably, the S6 step includes: counting the occurrence frequency and duration of events in the process, so that a modeler jumps out of a single process view and understands services from the event view; and clustering analysis is performed to cluster different process modes in the same business process, so that a modeler can know all possible process modes more quickly.

Specifically, the invention can find out all process modes in a project and count the generation frequency of different modes. For the process mode with higher occurrence frequency, the method can be used for representing the conventional service process; aiming at the process modes with low occurrence frequency, wherein the process modes have multiple possible reasons, such as abnormal processes, log recording errors and the like, the low-frequency modes are presented to software process related personnel, so that the software process related personnel can help the software process related personnel to locate the abnormal processes in the business process, and the process is improved.

The invention also provides an automatic software process modeling system based on process mining, which comprises a data preprocessing module, a process mining module and a process parameter analysis module;

the data preprocessing module extracts, cleans, integrates and converts data from different data sources and in different contents and formats to generate a process event log meeting requirements;

the process mining module integrates a process discovery algorithm and a model evaluation algorithm, restores an objective specific process model and evaluates the objective specific process model by using a model evaluation method;

the process parameter analysis module provides a series of process parameter analysis functions such as parameter statistics and clustering, and brings convenience to modelers to know the specific process of the target object more intuitively.

Preferably, the data preprocessing module adopts an automatic mode to preprocess the software process data, and the module mainly comprises:

the data preprocessing function: performing secondary processing on all logs, including definition and extraction of effective data, cleaning of noise data and normalization and integration of data;

data file format conversion function: the event log is converted into a unified CSV or XES format to meet the requirements of the algorithm input.

Preferably, the process mining module automatically mines the software process by using a series of technologies of process mining, and the module mainly comprises:

the process discovery function: based on the event log, restoring a real software process through a process discovery algorithm, and finally outputting a discovery result in a mode of a model graph;

the consistency checking function: checking for discrepancies between the event log and the process model by replaying the log on the model;

a model evaluation function: and evaluating the reduced process model, and evaluating the performance of the model from four dimensions of fitness, accuracy, generalization capability and simplicity.

Preferably, the process parameter analysis module mainly includes process pattern clustering and event frequency statistical functions:

the process pattern clustering function: clustering different process modes in the software process, and counting the occurrence frequency of the different process modes;

event frequency statistics function: all events in the software process are described statistically from the angle of occurrence frequency and duration, and the statistical result is visualized.

The invention has the beneficial effects that:

the invention designs an automatic software process modeling method and system based on process mining, which are used for helping modelers to perform data processing and process modeling. Aiming at the problems of inconsistent data record and modeling target and insufficient data quality and effectiveness in software process modeling, the invention can realize the extraction, cleaning, integration and conversion of data with different data sources and different contents and formats and then generate a uniform process event log. Aiming at the problems of low software process model construction efficiency and poor fidelity, the invention can restore the objective specific process model by using various process discovery algorithms based on the event log and evaluate the process model by using various model evaluation methods. In addition, the invention also carries out a series of process parameter analysis such as parameter statistics, clustering and the like on the event log, thereby facilitating a modeler to know the specific process of the target object more intuitively.

Drawings

FIG. 1 is a modeling method and its relationships in an embodiment of the invention;

FIG. 2 is a modeling framework flow in an embodiment of the invention;

FIG. 3 is a CI construction lifecycle of an embodiment of the present invention;

FIG. 4 is a commit log fragment of an item in an embodiment of the invention;

FIG. 5 is a segment of a build record for a project in an embodiment of the present invention;

fig. 6 is a Job file record content in the embodiment of the present invention;

FIG. 7 is a CI process event log example in an embodiment of the invention;

FIG. 8 is a process model represented by BPMN for the "android" project in the embodiment of the present invention;

FIG. 9 is a process model represented by BPMN for the "sonarqube" project in an embodiment of the present invention;

FIG. 10 is a chart of the consistency check of the "android" project using the Token reenactment under the present embodiment;

FIG. 11 is the result of the consistency check of the "graylog2-server" project using business alignment in the embodiment of the present invention;

FIG. 12 is the statistics of the event frequency of the "metasploid-frame" item in the embodiment of the present invention;

FIG. 13 is the statistics of the event duration of the "metasploid-frame" project in accordance with the embodiment of the present invention;

FIG. 14 is a schematic structural diagram of an automated software process modeling system based on process mining according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

Example 1:

fig. 1 shows modules and relationships thereof associated with an automated software process modeling method based on process mining, and fig. 2 shows a modeling framework flow of the automated software process modeling method based on process mining according to the present embodiment. In the embodiment, a CI process is selected as a software process modeling analysis object, the CI process refers to a whole process from the time when a developer submits a change until the process is finished locally, and fig. 2 depicts all stages involved in a complete CI process.

In this embodiment, travis CI is selected as one of the main sources of study objects and data acquisition, and is an open-source, distributed CI tool. Travis CI provides a customizable build life cycle and table 1 lists all actions available for customization and their meanings. In Travis CI, each development language has a set of default build configurations, but is often combined into two major phases: "intall" (install) and "script" (execute script), and an optional stage: "deployment". In the 'install' stage, two actions of cloning a remote warehouse and installing the remote warehouse are mainly executed; in the script stage, two actions of constructing codes and running tests are mainly executed; in the "deployment" phase, software is mainly packaged and deployed on a specified persistent deployment platform.

Table 1 custom configuration build commands provided by Travis CI

Step 1: log data is obtained from a variety of data sources.

Selecting target items of the embodiment: travistornt is a public database of project building information, and all the projects in the database use GitHub as a version control tool and are continuously integrated through Travis CI (CI). The data set includes CI attribute type data (e.g., build number and build result), analytical type data on the build (e.g., run test, fail test), and commit data from the GitHub (e.g., time difference between code push and code build). This embodiment screens five projects from the data set, which are constructed more than 4000 times (up to 12 months in 2019), for application. The item names of these five items and their warehouse links are shown in table 2.

Table 2 example selected items and item repository links

Name of item	Warehouse linking
		owncloud/android	https://github.com/owncloud/android
rubinius/rubinius	https://lgithub.com/rubinius/rubinius
		rapid7/metasploit-framework	https://github.com/rapid7/metasploit-framework
Graylog2/graylog2-server	https://github.com/Graylog2/graylog2-server
		SonarSource/sonarqube	https://github.com/SonarSourcelsonarqube

Submission log of the project: the submission information for a project is primarily recorded in the Git version control system used by the project. The submission log includes: commit _ id (commit number), author (code author), commit (submitter), date (commit time), and message (commit information). The final log time range is up to 2019 and 12 months. FIG. 3 is a processed commit log of one of the acquired items;

construction dataset of project: in this embodiment, all the construction data is firstly exported in terms of items, then aggregation is performed according to the "tr _ built _ id" field, information belonging to the same construction is aggregated together, and finally, a construction process log of an item is integrated and exported as a CSV file. The process information of each build is recorded in units of builds in the file. FIG. 4 is a construction record information corresponding to an item;

each constructed Job log: according to the project construction data set, each construction corresponds to a 'tr _ Job _ id' field, and all Job files, namely 'Job _ id.txt', are crawled under an API given by Travis CI through the Job number. FIG. 5 is a build Job Log in a project.

Step 2: and based on the acquired original data, performing information extraction, cleaning, integration and conversion on the original data to generate a standard event log.

Log data obtained from data sources still cannot meet the data requirements of process mining, and basically all data sources face the problems of effective information dispersion, inconsistent data formats and incomplete event data. In this embodiment, first, effective information required in an event log is determined, then, the information is extracted, the extracted data is merged, and after a series of cleaning operations are performed, a standard event log is finally output:

(1) Searching corresponding event information in logs of different data sources by using the object identification as a clue, and merging the obtained event information;

the present embodiment takes the construction as one case at a time, and takes each action performed within the construction lifecycle as one event. The required data information includes: constructing ID, all events under specific construction, time stamp of event occurrence and four data elements of construction result;

finding corresponding construction data according to the submission ID, judging the action type triggering construction according to a 'gh _ is _ pr' field in the piece of data, if the value is '1', constructing the pull request trigger, taking a 'gh _ pr _ created _ at' field in the piece of data as a time stamp of an event, if the value is '0', constructing the push action trigger, and taking the time stamp of the push event from the 'gh _ published _ at' field;

and extracting the Job ID corresponding to the construction from the construction record file, and positioning the Job in the Job log set of the project according to the Job ID. Then, construction process data is obtained by processing the Job file, and after the construction process is finished, an event for marking the end of construction is required to be added, wherein the event is mainly divided into four end events: "past", "failed", "error", and "cancelled", the timestamp of the end event is the same as the timestamp of the last action in the build process.

(2) Completing the missing timestamps in the step;

since the data size with such problems is small, the present embodiment adopts a manual padding mode for processing. For example, if the event "install" lacks timestamp information, the previous step of the event (e.g., the end time of "push") can be used as the start and end time of the step to identify that the step "install" occurs after the step "push".

(3) Processing the noise data;

there is some data in the log that is abnormal at some time, such as the time of step "push" is before the time of step "commit". The data volume of such abnormal data is small, but since the cause of the abnormality is not known, no deletion operation is taken, and the part of the data is temporarily reserved for model calculation.

(4) And separating the event logs according to different granularities, and separating the event logs into versions with different granularities.

Events extracted from the Job log have a problem of inconsistent granularity, for example, a step of "install" may be divided into multiple finer-grained stages such as "install.1" and "install.2" in the Job log. In this embodiment, n fine-grained steps of "install.1", "install.2" … … "install.n" are combined into one step, "install", and the start time of the first small step and the end time of the last small step are used as the start and end times of the combined step.

And obtaining a standard event log aiming at the CI process through data preprocessing, wherein the log set comprises five CSV files, and each file corresponds to one item. The content of the log takes one-time construction as a case, and takes steps in the construction process as events. The event log includes five fields, which are a Build ID (JobID), an Activity (Activity), a Start time (Start Timestamp), an end time (CompleteTimestamp), and a Build result (Build Status), respectively. FIG. 6 illustrates a CI event Log segment for one of the items.

And step 3: and based on the event log, using different algorithms to restore the process model, and completing the process discovery of the continuous integrated process.

And applying an alpha algorithm, a Heuristic algorithm and a Directly-Follows graph algorithm to the standard event logs of the five items of the embodiment to obtain a process model.

Where the alpha algorithm principle is to scan a specific pattern in the event log, for example, if activity a is followed by activity b, but activity b is not followed by activity a, then a causal relationship between activity a and activity b can be assumed. In order to reflect the dependency relationship, the corresponding Petri net model also has a corresponding location to connect the activity a and the activity b. All possible relationships between activity a and activity b can then be represented by the following symbols:

indicates the existence of path σ = { t) in the log ₁ ,t ₂ ,t ₃ ,…t _n When i is e {1, … n-1}, t _i = a and t _i +1＝b。

a → b. If and only if

And is

a # b. If and only if

And is

a | | b. If and only if

And is

Heuristic algorithm: also called heuristic, first the algorithm constructs a dependency graph based on the frequency of occurrence of activities and the number of secondary relationships between activities. And adding the dependency relationship to the dependency graph according to the set threshold value. The entire dependency graph shows the Backbone portion (Backbone) of the process model, which is used to discover more detailed node behavior, including Split (Split) and merge (Join). In one aspect, if an activity has multiple input edges, the heuristic mining algorithm analyzes the event logs to infer the type of such a merge, including "AND merge (AND-join)", "exclusive-OR merge (XOR-join)", "OR merge (OR-join)". If so, or merged, the model learns specific concurrent behaviors. On the other hand, if an activity has multiple output edges, then this shunting behavior is learned by the model in the same way;

Directly-Follows graph algorithm: in the direct-Follows graph, events or activities are represented by nodes, and if at least one path exists between the nodes, the path is represented by a direct edge. Unlike other algorithms, on this direct edge, some metrics can be easily represented, such as frequency (total number of executions of the path represented by the direct edge), performance (collection of some parameters, such as average, time-consuming, etc.).

FIG. 7 shows the results of a model using the Heuristic algorithm for process discovery for the "android" project, the model using the BPMN language for description output. As can be seen in the figure, two ellipses represent the beginning and end of the process, and the rectangular boxes represent events in the process. The one-way arrows between each element represent the context in which the event occurs. The rectangular box representing the event contains the name of the event and the frequency of the event, such as 3704 times of occurrences of the 'commit' event in the log time range. The number on the one-way arrow indicates the frequency of the action, and the thickness of the arrow indicates the frequency. 1. In general, the thicker the arrow, the more frequent the events occur. Such as from a commit event to a skipped event, the process occurs 765 times within the log. Further penetrating to the back meaning of the model, it can be found that the process model has a plurality of different process modes, such as a process flow which occurs more frequently: "begin-commit-push-git. Checkout-git. Sub-module-android. Install-script-passed-end" which represents the meaning: the process starts with a code change being committed (commit), which is then pushed (push) to the remote repository, followed by the CI server sensing, and a series of repository clone operations, where "git. Checkout" and "git. Submodule" represent the check before cloning and the sub-module operation of the clone item, respectively. And then, the required dependence is downloaded on the server (android).

The above describes one of the processes observed from the model, and we can find some repetitive operations and cyclic operations in addition to this more conventional process. For example, if the event "commit" is repeated 302 times, it can be inferred that not every commit can trigger the building, and there may be a case where multiple commits trigger the building together, which is reflected in that the event "commit" is repeated several times in the event log before proceeding to the next step. It can also be seen that there are also more loop actions between events "commit" and "skip", which may be due to the fact that multiple submissions are determined to be skipped directly.

FIG. 8 shows the model calculated by applying the "sonarqube" item to the Directly-Follows graph algorithm. The process includes more build result types such as "failed" and "error". It can be seen that essentially all "errored" occurs after the "git. Checkout" step, and thus it can be inferred that an error occurred when the CI server cloned the warehouse. From these abnormal phenomena, software process personnel can go deep to find the inherent cause of the problem.

And 4, step 4: and performing consistency check on the logs and the model based on the event logs and the restored process model.

The present embodiment performs consistency check on five items, and the algorithm includes token replay and business alignment.

Tokken rehearsal: the method starts from an initial position, and matches the trace of the case in the log with a Petri net, so as to find out which transitions are executed in a target process and where a left Token or a missing Token exists. In the calculation process, each case defines four variables: the produced tokens (p), the remaining tokens (r), the missing tokens (m), and the tokens (c) that are consumed. Based on the above four variables, plus a Petri net (n) and a corresponding event stream (t), the token rehearsal method calculates the engagement degree according to the following formula:

and (3) service alignment: for each case, the business alignment method outputs a list of key-value pairs, which consists of two elements: one is an event in the log or the symbol ">", the symbol indicating that the event is not present in the log; the other is a transition in the model or the symbol ">", which indicates that the model did not capture the event. Each key-value pair falls into three different categories:

the synchronization action: this type represents that the events in the case correspond to transitions in the model. In this case, the case evolves in the same way as the model.

Log actions: in this type, the second value in the key-value pair is "> >", meaning that the actions present in the log cannot be captured and manifested by the process model. This represents a mismatch between the log and the model, and represents a deviation between the log and the model.

The model action: in this type, the first value of the key-value pair is "> > >", which indicates that the action embodied in the model is not present in the log. There are also two different distinctions for model actions. One is a model action associated with hidden transitions, for which, although it is not a synchronous action, it is also considered to be in-fit. The other is model action that does not involve hiding transitions, in which case the action is deemed to be unmatched and there is a discrepancy between the log and the model.

Fig. 9 lists the partial results calculated by applying the "tokenrehearsal" method to the "android" project, where each piece of data is the result of the examination for one case. The interpretation of the meaning of the fields in the figure is shown in table 3.

TABLE 3 Token Requirements consistency test result field names and representational meanings

The 34 cases shown in the figure basically do not completely follow the track of the process model, and the fitting degree with the process model is mostly concentrated on the value of 0.58. By observation, in the process of tokken rehearsal, the cases of low fitness are mostly due to confusion of the occurrence time of the "push" step in the process. The "push" step in the process model often occurs after the "commit" step, but the "push" step of the case with the problem often occurs at the end or the beginning of the case, which is also the reason why the cases do not conform to the process model well.

FIG. 10 shows the result of the consistency check performed by the "business alignment" project application on the "business alignment" algorithm, and the difference between the two can be obtained through comparison between the process model of the incoming project and the event log. The process model for this case was calculated using the "Directly-Follows graph Algorithm". The field names and the corresponding meanings in the figure are explained in table 4.

TABLE 4 Business alignment consistency check result field name and meaning

It can be seen that most cases do not align the process model well, and the fitting degree of the tracks of the 38 cases is basically concentrated between 0.1 and 0.2, which indicates that the Directly-Follows graph algorithm cannot well represent the process of the "gradient 2-server" project.

And 5: and evaluating the model based on the event log and the restored process model.

In the present embodiment, for the process models generated using the α algorithm and the Heuristic algorithm, the process models are evaluated from four dimensions of conformity, accuracy, generalization degree, and simplicity.

The degree of engagement: the primary purpose of this metric is to calculate how faithfully the process model reflects the event logs. Aiming at different 'rehearsal' methods (Token rehearsal and service alignment), different calculation methods are provided for 'rehearsal' engagement, and the 'rehearsal' engagement can be mainly divided into 'Token rehearsal' engagement and 'service alignment' engagement. The return value of the "token reenactment" fitness reflects the degree of coincidence between the model and the log, expressed in percentage. The method for calculating the integrating degree of service alignment is slightly different from the integrating degree of Token reenactment, and the return value is the sum average of the integrating degrees of each path;

precision: the measurement of this value is also largely divided into two ways: ETConform method and Align-ETConform method. The ETConformance method is developed on the basis of a consistency test method of 'tokenrehearsal', and the Align-ETConformance evaluates a process model on the basis of the idea of the 'business alignment' consistency test method. When a different front node is replayed on the model, every time a specific position is reached, the transitions enabled in the model are compared with a series of back nodes of the front node. The more differences between nodes, the lower the accuracy of the model will be; the more similar the nodes are, the higher the model accuracy is;

generalization ability: consider whether elements in a model are visited frequently enough during the model's replay. A process model fits perfectly to its corresponding event log and has a high degree of accuracy, but not necessarily a good generalization ability. The calculation formula used by the invention is as follows:

where avg _ t is the average of all transition kernel values, freq (t) refers to the frequency of occurrence of transition t in replay;

the simplicity is as follows: the invention uses the reverse arc degree to measure the model simplicity. The degree refers to the sum of the number of output edges and input edges of one transition in the Petri network. The method first considers the average degree of a transition in the Petri network, and if all transitions at least comprise an input edge and an output edge, the degree of the transition is at least 2. Assuming the number k ∈ (0, + ∞), then for the measure of simplicity, the formula is as follows:

wherein avg _ degree refers to the average degree of a transition in the Petri net.

The main calculation results are shown in table 5.

TABLE 5 Process model Effect evaluation of five items in the example

As can be seen from the table, for these five items, the overall effect of the Heuristic algorithm, whether in terms of fitness, accuracy, generalization, or simplicity, is better than that of the alpha algorithm, presumably related to the fact that the CI process data contains a portion of the noise data, whereas the Heuristic algorithm is better able to process the noise data than the alpha algorithm. From the detailed aspect of data, the more complex the process model is, the lower the model fitting degree, accuracy and simplicity (such as the items "android" and "metasploit-frame"), and conversely, the simpler the process is, the better the algorithm fits the data (such as "sonarqube"). Through comprehensive judgment of model evaluation results, the process model generated by the Heuristic algorithm is more suitable for the objective reality represented by the event log, and is better than the alpha algorithm in effect.

And 6: and based on the event log, carrying out process mode clustering and event statistical analysis.

The process pattern clustering can realize clustering of all process patterns appearing in one project event log, and statistics of the number and the proportion of each pattern. The clustering results of the "android" item are shown in table 6.

Table 6 android item for process mode clustering result partial interception

The top eight most dominant process modes are listed in the table in rank order, and the sum of the eight process modes accounts for over 86% of the total modes. It can be seen from the table that the process model, where the largest percentage is the traditional CI lifecycle model, the process with this model accounts for 53.11% of all cases. The process of the method starts from one submission until the last CI successfully ends (past), and the middle of the method goes through the steps of code warehouse inspection, warehouse cloning, installation dependence, execution of construction scripts, reporting of passing results and the like and is also the CI configuration model which is most commonly used by the project developer. The sixth process mode is the process followed by the case of failed construction, and it can be seen that failure occurs in the code construction link, the whole construction result fails, and finally the "fix" (defect repair) step flag CI ends. This process model accounts for 2.49% of all cases, which also indicates that for the "android" project, the constructed pass rate is much greater than the failure rate. The remaining dozens of modes are mostly related to the modification of build scripts by developers during the execution of build sessions.

Event statistical analysis is used for statistically analyzing the frequency and duration of events from the perspective of events, thereby helping related personnel to more comprehensively understand frequent events and time consumed by events in the process and deeply understand the whole process from a smaller granularity. FIGS. 11 and 12 are graphs showing the results of the event statistical analysis of the "metasploid-frame" project.

As can be seen from fig. 11, in the "metasploid-frame" item, the first five events that occur most frequently are: "script" (executing build script), "commit" (commit), "cache" (cache), "before install dependency), and" rvm "(Ruby command). It can be seen that the build script belongs to the core step of the CI process, and if the whole process can build a pass, the actions necessary to build the script are performed. In addition, we can also find that the "push" step occurs significantly less frequently than "commit", indicating that a significant portion of "commit" is skipped after submission, which is consistent with the objective data case.

FIG. 12 shows the average of the duration of each event in the "metasploid-frame" item in seconds. As can be seen in the figure, the step "script" event is the longest time consuming, accounting for most of the overall CI process, since the process includes not only compiling the code, but also testing the code. In addition, since the installation depends on the downloading operation, the step "install. The rest of the actions are similar to the instantaneous actions of "commit", "skippod", "push", etc., and the actions of "past", "failed", etc., marking the result, and the duration is 0.

The results of the embodiment show that the automatic software process modeling method based on process mining can effectively help software process related personnel to establish a real CI process model, the inspection results show that the process models found by different algorithms present different mining results, and for the embodiment, the process model restored by using the Heuritics algorithm is superior to other models in terms of evaluation standards such as fitness and accuracy. Further, it can be found by comparing the mined process model with the CI lifecycle model that the process model mined based on the event logs includes part of the phases of the CI build lifecycle, namely submitting code changes, cloning the code warehouse, installing dependencies, building source codes, running tests, and reporting results to the development team. The deployment construction is used as an optional configuration item in the Travis CI and is not contained in a default configuration file, and the deployment function is rarely configured when a developer customizes the CI process. From a case perspective, not every case will contain all the processes in the CI lifecycle, for three main reasons: firstly, a developer modifies a CI configuration file ". Travis.yml" by itself, so that the command details of the step are changed; secondly, there are a large number of interrupt constructions, which results in a case not being able to cover all lifecycle processes; and thirdly, the logging error is caused, and in the logging process, due to manual operation or other reasons, logging events are disordered and the steps are incomplete.

Example 2:

fig. 13 is a schematic structural diagram of an automated software process modeling system based on process mining according to embodiment 2 of the present invention, including: the system comprises a data preprocessing module, a process mining module and a process parameter analysis module.

The data preprocessing module extracts, cleans, integrates and converts data from different data sources and in different contents and formats to generate a process event log meeting requirements, and the process event log comprises the following steps:

a data acquisition unit: the system comprises a database table, a file system, a log file path and a file system, wherein the database table is used for storing a submitted log of a specific project; obtaining each constructed Job log of the project from a CI tool according to the specific project, downloading the logs to a server, writing a storage path into a database, and finally returning to a corresponding Job log file path; the project construction information file is acquired, the construction information of the project can be downloaded from a remote server, the file is stored, and a corresponding storage path is returned;

a data preprocessing unit: extracting effective information from different log files and combining the effective information into a process; converting the acquired TXT text format log into a CSV format file, and returning a storage path of the file; and combining the effective information in all the process files generated in the previous stage to finally generate a standard event log containing cases, events, time stamps and other related information, wherein the event log is stored in a CSV format.

The process mining module integrates process discovery and model evaluation algorithms, restores an objective specific process model and evaluates the process model by using a model evaluation method, and comprises the following steps:

a process discovery unit: providing a service for calling an alpha algorithm, a Heuristic algorithm and a direct-Follows graph algorithm to calculate a process model, finding a corresponding event log storage position in a database, and calculating to obtain a final process model and returning the final process model by using a process discovery algorithm selected by a modeler based on the obtained event log;

a model evaluation unit: providing a model consistency check service, comparing the process model with the event log through a Token replay and service alignment algorithm, and calculating and returning the difference between the process model and the event log; and evaluating the process model, transmitting the process model and a corresponding event log into the process model, and carrying out all-around evaluation on the model from four dimensions of the model integrating degree, the accuracy, the generalization degree and the simplicity.

The process parameter analysis module provides a series of process parameter analysis functions such as parameter statistics, clustering and the like, and is convenient for a modeler to more intuitively know the specific process of a target object, and the process parameter analysis module comprises the following steps:

a parameter analysis unit: providing an analysis function for project process parameters. Firstly, extracting event logs from a database, carrying out a series of analyses such as clustering and frequency statistics on the events based on data provided by the event logs, and finally returning a parameter analysis report. In the algorithm calculation and result visualization two parts, the statistical method and method library mainly used are as follows:

the DBSCAN algorithm is used as a basic clustering algorithm, and has the advantages of simplicity, no need of specifying the number of clusters, insensitivity to noise and the like. In the embodiment, the system mainly uses the clustering algorithm to divide and count different modes in the process. In the aspect of implementation, the core idea of the algorithm is to start from a certain core point and continuously expand to the range where the density can be reached, and finally obtain a maximized area containing the core point and the edge point, wherein any two points in the maximized area are connected in the density.

The method library is further packaged on the basis of Matplotlib, so that the drawing is more convenient and faster. In the embodiment, the system calls the database for visualizing the Seaborn data of Python to complete the efficient and attractive data visualization function.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions without departing from the scope of the invention. Still other equivalent embodiments may be included without departing from the inventive concept, the scope of which is determined by the claims appended hereto.

Claims

1. An automated software process modeling method based on process mining is characterized by comprising the following steps:

s6: and (4) analyzing process parameters, performing statistical mining on the event logs, and displaying more detailed process information through process clustering, frequency statistics and time statistics.

2. The process mining based automated software process modeling method of claim 1, wherein the S2 comprises the steps of:

s201: unifying the log formats of different data sources into a CSV format;

s202: searching corresponding event information in logs of different data sources by using the object identification as a clue, and merging the obtained event information;

s203: completing the timestamp of the step missing in the event log;

s204: cleaning abnormal noise data in the event log, and deleting the case if the log records have an inconsistent problem; if the event has a problem, the event is reserved;

s205: and separating the event logs according to different granularities, and separating the event logs into versions with different granularities.

3. The process mining based automated software process modeling method of claim 1, wherein the event log of the S2 standard is defined as:

each Event (Event) in the Event log represents an independent activity or step in the process, an Event being associated with a specific Case (Case), which refers to each complete process in the record, the different events in a Case being ordered in time, an Event can be considered as a step of its dependent Case. In addition to recording the event itself, the event log may also store other relevant information, such as resource information (personnel or equipment) that performed the event, a timestamp of the event occurrence, and other data elements recorded with the event. Most process mining techniques have explicit provisions for the storage format of standard event logs, which are typically stored in both the XES and CSV formats.

4. The process mining based automated software process modeling method of claim 1, wherein the S3 comprises the steps of:

s301: constructing a process model by taking a standard event log as an input of a process discovery algorithm;

s302: the process model is presented in conjunction with different model languages.

5. The method of claim 1, wherein the evaluation metrics in S5 include Replay Fitness (Replay fit), precision (Precision), generalization (Generalization), and Simplicity (Simplicity).

6. The process mining based automated software process modeling method of claim 1, wherein the S6 comprises:

counting the frequency and duration of events in the process;

and clustering analysis, namely clustering different process modes in the same business process.

7. An automatic software process modeling system based on process mining is characterized by comprising a data preprocessing module, a process mining module and a process parameter analysis module, wherein the data preprocessing module is used for preprocessing data, the process mining module is used for mining a process parameter, and the process parameter analysis module is used for:

8. The automated software process modeling system based on process mining of claim 7, wherein the data preprocessing module preprocesses software process data in an automated manner, the module consisting essentially of:

9. The automated software process modeling system based on process mining of claim 7, wherein said process mining module automatically mines software processes using a series of techniques of process mining, the modules consisting essentially of:

the process discovery function: based on the event log, restoring a real software process through a process discovery algorithm, and finally outputting a discovery result in a mode of a model diagram;

the consistency checking function: verifying a difference between the event log and the process model by replaying the log on the model;

a model evaluation function: and evaluating the restored process model, and evaluating the performance of the model from four dimensions of fitness, accuracy, generalization capability and simplicity.

10. The process mining-based automated software process modeling system of claim 7, wherein the process parameter analysis module, essentially comprises:

event frequency statistics function: all events in the software process are described in terms of occurrence frequency and duration, and the statistical result is visualized.