CN109902954B

CN109902954B - Flexible job shop dynamic scheduling method based on industrial big data

Info

Publication number: CN109902954B
Application number: CN201910144370.5A
Authority: CN
Inventors: 汤洪涛; 费永辉; 闫伟杰; 陈程; 梁佳炯; 程晓雅; 王丹南; 李晋青
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2020-11-13
Anticipated expiration: 2039-02-27
Also published as: CN109902954A

Abstract

A flexible job shop dynamic scheduling method based on industrial big data comprises the following steps: the method comprises the following steps: using data acquisition tools Sqoop and flute to acquire scheduling data from a database or a file system and storing the scheduling data in an HDFS file system; step two: dividing the scheduling data by taking a scheduling scheme as a unit through a data warehouse tool Hive; step three: converting the scheduling data into a training example by using a Spark calculation framework, and storing the training example into Hbase in a form of taking a scheduling scheme as a unit; step four: screening the indexes to obtain a scheduling data set generated in the execution of a well-behaved scheduling scheme; step five: clustering the scheduling related historical data based on disturbance attributes; step six: mining a random forest scheduling rule by adopting an improved random forest algorithm; step seven: and guiding the dynamic scheduling of the flexible job shop by using the mined scheduling rule. The method has the advantages of high practical operability and high calculation efficiency, and can quickly respond to the workshop disturbance in real time.

Description

Flexible job shop dynamic scheduling method based on industrial big data

Technical Field

The invention relates to a flexible job shop dynamic scheduling method based on industrial big data

Background

Scheduling plays an important role in the manufacturing system, and scheduling quality will affect the competitiveness of the manufacturing enterprise itself. The scientific and reasonable scheduling scheme is formulated for workshops, so that the production efficiency can be improved, the process cost can be reduced, the life cycle of products can be shortened, and meanwhile, the delivery of products on time and quality guarantee can be guaranteed. The flexible operation workshop has flexible process routes and rapid strain capacity for market demands, and can well meet the production demands of various products and small batches, so that the flexible operation workshop becomes a widely used production mode. The flexible job shop dynamic scheduling considers the disturbance of the actual production environment on the basis of static scheduling, and is more in line with the actual production environment, so that the flexible job shop dynamic scheduling has more research significance.

With the continuous change of product requirements to individuation, manufacturing processes are more diversified, actual scheduling problems become more complex, and the solution of the workshop scheduling problems of manufacturing enterprises puts higher requirements on the aspects of actual operability, computational efficiency, real-time response capability to workshop disturbance and the like. The priority scheduling rule is a simple heuristic rule, has high calculation efficiency and strong actual operability, can be used for real-time scheduling, and is suitable for complex and dynamic scheduling environments. However, the performance of the priority scheduling rule is affected by the actual environment change, and a single scheduling rule cannot have good scheduling performance in all disturbance environments. To meet the requirements of actual job shop scheduling, one feasible idea is to mine scheduling knowledge about scheduling rules from scheduling-related historical data to guide actual shop scheduling activities. The research for solving the scheduling problem through data mining is mainly divided into a method for combining the existing priority scheduling rule and a method for mining the scheduling rule from the scheduling-related historical data.

In the aspect of combining the existing priority scheduling rules, the WANG Shuang-Xi et al (A hybrid scheduling model using a decision tree and a neural network for selecting scheduling rules of a semiconductor final decision factor, 2005) provides a method for mining a priority scheduling rule selection mechanism from scheduling-related historical data by combining a decision tree and a neural network, and the selection mechanism can obtain the most suitable priority scheduling rule under the current environment. SHIUE y.r. et al (Data-based scheduling rule selection mechanism for a dispatch control system using a Supported Vector Machine (SVM) propose a method for mining a priority dispatch rule selection mechanism from dispatch-related historical Data, and make a real-time dispatch decision based on the method. Mouelhi (Training a neural network to selected scheduling rules in real time, 2009) and the like propose a scheduling rule selection method combined with a neural network, and the method excavates a scheduling rule real-time selection method from scheduling related historical data generated by simulation through the neural network.

The prior scheduling rule makes scheduling decision with only a small amount of information, which may result in unsatisfactory scheduling result, and thus it is another idea to extract a new scheduling rule from the scheduling-related history data. LI X et al (partitioning scheduling rules using marking, 2005) propose a method of obtaining a brand new scheduling rule from scheduling-related historical data using a decision tree, and it is proved through experiments that the extracted scheduling rule can be well fitted to the original scheduling scheme. A two-stage scheduling knowledge Learning method is provided in (Learning effective new single machine scheduling from scheduling data, 2010) of SIGURDUR OLAFSSON, and the like. Wangchangong et al (research on excavation methods of job shop scheduling rules, 2015) propose a scheduling rule excavation method combining a branch-and-bound algorithm and a decision tree algorithm of Petri network modeling, and the extracted scheduling rules can be used for guiding the scheduling of static job shops.

In summary, the current method for mining the scheduling rules from the historical data related to scheduling mainly aims at the problem of static scheduling of workshops, and is less applied to the problem of dynamic scheduling of flexible job workshops. In addition, the scheduling related historical data used by the method is biased to theoretical data, however, with the large use of intelligent sensing equipment in the inter-vehicle space, the workshop begins to develop towards intellectualization, and the workshop scheduling related historical data has the characteristics of large scale, low value, continuous sampling, high-dimensional and other industrial large data.

Disclosure of Invention

In order to solve the problems that the existing flexible job workshop dynamic scheduling method is low in actual operability, full in calculation efficiency and insufficient in real-time response capability to workshop disturbance, the invention provides the flexible job workshop dynamic scheduling method which is high in actual operability and calculation efficiency and can respond to the workshop disturbance in real time.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a flexible job shop dynamic scheduling method based on industrial big data comprises the following steps:

step one, data acquisition: historical data related to scheduling is collected from an existing information system by using a data collection tool under a Hadoop ecosystem, and is stored in an HDFS file system.

Step two, data integration: scheduling data set D in HDFS file system by using SQL statement through data warehouse tool Hive_hThe scheduling scheme is used as a unit for division, namely, scheduling related historical data generated in the execution of the primary scheduling scheme is divided together.

Step three, data conversion: and (3) converting the integrated data into a form of a training example by using Spark, so that a data mining algorithm can conveniently mine the scheduling rule.

Step four, data screening: and considering the historical scheduling scheme from three indexes of the maximum completion time, the total deadline time and the total load of the machine, and screening to obtain a scheduling-related historical data set generated in the execution of the scheduling scheme with good performance. The method specifically comprises the following steps:

step 4.1: and on the maximum completion time index, the maximum completion time of the scheduling scheme generated by only using the SPT rule under the same condition is used as a screening standard.

Step 4.2: and on the total delay time index, using the total delay time of using the EDD rule and combining the SPT rule to complete the scheduling task under the same condition as a screening standard.

Step 4.3: on the index of the total load of the machine, the total load of the machine which completes scheduling tasks by combining LMWT and SPT rules under the same condition is taken as a screening standard, and a scheduling data set which can simultaneously meet the three indexes and is generated in the execution of a scheduling scheme is taken as the input of a scheduling rule mining algorithm.

Step five; clustering based on disturbance attributes: and (3) clustering the screened scheduling related historical data by adopting a DBSCAN clustering method and taking a scheduling scheme as a unit (namely, data generated by one scheduling scheme is taken as an object), and clustering based on the disturbance attribute. The method specifically comprises the following steps:

step 5.1: and (3) performing data standardization on disturbance data when the schemes are executed, wherein if the data of certain disturbance attributes in each scheme are X1, X2, X3., Xn, the disturbance attributes need to be transformed according to the formula (1).

In the formula (1)

A mean value representing the attribute; s is expressed as the standard deviation; y1, Y2, Y3., Yn is the normalized data.

Step 5.2: and determining a parameter domain radius Eps of the DBSCAN algorithm and the number MinPts of at least objects contained in the core object domain radius.

Step 5.3: and randomly finding out a core object p, and creating a new cluster with p as the core object. Objects reachable from p direct density are repeatedly found and grouped into clusters.

Step 5.4: step 5.3 is repeated until no new points can be added to any cluster, and the process ends.

Step six, mining a random forest scheduling rule: and respectively mining a forest scheduling rule 1 for solving the problem of selecting a machine for workpieces and a random forest scheduling 2 for solving the problem of selecting and processing the workpieces by idle machines from each cluster after clustering by adopting an improved random forest algorithm. The method specifically comprises the following steps:

step 6.1: and for each clustered cluster, extracting training examples from the clusters in a replacement manner to form k new training example sets for constructing a decision tree.

Step 6.2: and randomly selecting m characteristic attributes, calculating an optimal splitting mode, and respectively training to obtain k decision trees.

Step 6.3: the classification performance of the decision tree is tested using the unselected training instances in the cluster.

Step 6.4: and judging whether similar decision trees exist or not, if so, reserving the existing decision trees of the table in the test, and forming a random forest.

Step 6.5: and finally, calculating the weight w and h of each decision tree according to a Bayesian voting mechanism to obtain a forest scheduling rule 1 and a random forest scheduling rule 2.

Step seven, the scheduling rule is used: and guiding the dynamic scheduling of the flexible job shop by the mined random forest scheduling rules. The method specifically comprises the following steps:

and 7.1, finding a random forest scheduling rule 1 or a random forest scheduling rule 2 corresponding to the cluster to which the disturbance environment of the current flexible job shop belongs according to the problem of selecting a workpiece machine for solving or selecting a workpiece for processing by an idle machine.

And 7.2, selecting an optimal method through pairwise comparison according to the selected random forest scheduling rule, and selecting the most appropriate workpiece or machine from the candidate machine set M or the candidate workpiece set J.

The technical conception of the invention is as follows: the relevant historical data of the workshop scheduling has the characteristics of large scale, low value, continuous sampling, high dimension and other industrial big data, so that the preprocessing of the relevant historical data of the scheduling is completed by combining the big data. FIG. 2 shows a data pre-processing model incorporating big data technology. The dynamic scheduling problem of the data preprocessing flexible job shop is to solve the problem of selecting a machine of a workpiece and the problem of selecting a workpiece of an idle machine in a disturbed environment, so that a collected data set D_hThe method comprises the following three parts: d1 timing disturbance information of system related to disturbance for scheduling scheme; d2 when a machining machine is selected for a certain process of a workpiece, each machine in the set of machines that can currently machine the processStatus information of the device; d3 is the status information of each workpiece in the current queue for which the idle machine needs to select a workpiece to process in the wait queue. Scheduling data set D_hThe data form in the method is disordered, cannot be directly used for the subsequent data screening, clustering and scheduling rule mining work, and the scheduling data set D needs to be sorted through data integration and conversion_hThe data of (1). In scheduling data set D_hThe method has the advantages that a large amount of effective information reflecting the characteristics of the actual scheduling environment and scheduling knowledge is hidden, and meanwhile, a plurality of useless or wrong rules or modes are accompanied. Therefore, the multi-index data screening mechanism of fig. 3 is adopted to consider the historical scheduling scheme from three aspects of maximum completion time, total deadline time and total load of the machine, and retain data generated in the execution of the scheduling historical scheme meeting the three indexes.

The random forest algorithm is used as a mining algorithm of the scheduling rule, the finally obtained scheduling rule is a random forest constructed by the algorithm, the random forest is essentially a plurality of trained C4.5 decision trees, the scheduling performance of the scheduling rule depends on the classification performance of the decision trees, and the calculation efficiency and the complexity degree of the scheduling rule depend on the branch number of the decision trees. Clustering optimal scheduling data D through DBSCAN_bThe method has the advantages that reasonable division is carried out, data generated by scheduling decisions made under different disturbance environments are distinguished, scheduling rules aiming at different disturbance environments are obtained from each divided region, classification performance of decision trees in the obtained random forest scheduling rules can be enhanced, and the number of branches is reduced, so that the complexity of the scheduling rules is lower, the calculation efficiency is higher, and the scheduling performance is better.

Learning a scheduling rule f from historical scheduling related data through a random forest algorithm, wherein f is an estimation of a real scheduling rule y

Therefore, it is not only easy to use

And y is a certain error. The error comprises three parts: noise(s)²Square, squareDifference (D)

And deviation from

Wherein the noise is²Is inevitable, but can be reduced by reducing the variance

Or deviation of

The error of the algorithm is reduced, and therefore the performance of the random forest algorithm is improved. Meanwhile, the variance can be reduced by reducing the correlation rho between the decision trees, so that if the similarity between two decision numbers is too large, the decision number with good test performance is reserved, and the correlation rho between the decision trees is reduced. A traditional random forest algorithm adopts a voting mechanism that a minority obeys a majority, and the classification performance of a decision tree in a random forest has the same weight no matter how good the classification performance is. Such a mechanism results in decision trees with poor classification performance having the same degree of influence on the final result as decision trees with good classification performance. So the bayesian voting mechanism is adopted in this document. The mechanism sets a weight value based on the classification and representation of each decision tree in the test, and then votes according to the weight value.

The invention has the following beneficial effects: the method for mining the scheduling rules from the scheduling-related historical data with the characteristics of industrial big data to guide scheduling is used as a main framework, a data preprocessing model combined with a big data technology is established, the speed and the accuracy of data preprocessing are improved, a clustering mechanism based on disturbance attributes is established, the complexity of the scheduling rules is reduced, the higher calculation efficiency and the scheduling performance of the scheduling rules are improved, a scheduling mining model based on an improved random forest algorithm is established, and the generalization capability and the scheduling performance of the scheduling rules are improved.

Drawings

FIG. 1 is a scheduling rules mining overall architecture of the present invention.

FIG. 2 is a model of the present invention for scheduling data pre-processing in conjunction with big data technology.

FIG. 3 is a multi-index data screening mechanism of the present invention.

FIG. 4 is a flow chart of the improved random forest algorithm mining dispatch rules of the present invention.

FIG. 5 is a scheduling scheme resulting from the use of the flexible job shop dynamic scheduling method based on industrial big data of the present invention.

Detailed Description

Referring to fig. 1 to 5, a flexible job shop dynamic scheduling method based on industrial big data, the overall framework of which refers to fig. 1, is specifically divided into three parts: the first part is a scheduling data preprocessing model combined with a big data technology, and refers to fig. 2, which is specifically divided into data acquisition, data integration, data transformation and data screening; a first part that clusters a policy based on a perturbation attribute; and in the third part, mining the model based on the dispatching rule of the improved random forest algorithm. The general technical steps are as follows: step one, data acquisition: historical data related to scheduling are collected from existing information systems such as MES, ERP, SCADA and the like by using data collection tools Sqoop and Flume under a Hadoop ecosystem, and are stored in an HDFS file system. The collected data includes three parts D_hD1, d2, d3 }: d1 timing disturbance information of system related to disturbance for scheduling scheme; d2 is the state information of each machine in the set of machines which can process the working procedure at present when selecting the processing machine for the working procedure of the workpiece; d3 is the status information of each workpiece in the current queue for which the idle machine needs to select a workpiece to process in the wait queue.

Step three, data conversion: and d2 and d3 parts in the integrated data are converted into a form of training examples by using Spark, so that the scheduling rule mining of a data mining algorithm is facilitated. The method specifically comprises the following steps:

step 3.1: for the collected scheduling data set D_hThe part d2 of (a), regarding the actually selected machine m1 in a certain historical scheduling scheme as the most suitable machine, comparing it with the machines in the alternative machine set { m2, m3.

Step 3.2: for the collected scheduling data set D_hThe d3 section of (a), regarding the actually selected workpiece j1 in a certain historical scheduling scheme as the most suitable machine, and comparing it with the workpieces in other workpiece sets { j2, j3.. } waiting for processing one by one to form training examples.

Step four, data screening: considering the historical scheduling scheme from three indexes of maximum completion time, total delay time and total load of machines, and screening to obtain a scheduling-related historical data set D generated in the execution of the scheduling scheme with good performance_b. The method specifically comprises the following steps:

step 4.1: and on the maximum completion time index, the maximum completion time of the scheduling scheme generated by only using the SPT rule under the same condition is used as a screening standard. The use of only the SPT rule means that the workpiece selects the workpiece with the fastest machining time and the idle machine selects the workpiece with the shortest machining time. And 4.2, entering the scheduling scheme with the maximum completion time meeting the index, and eliminating the scheduling scheme if the maximum completion time does not meet the index.

Step 4.2: and on the total delay time index, using the total delay time of using the EDD rule and combining the SPT rule to complete the scheduling task under the same condition as a screening standard. The SPT + EDD rule refers to that the workpiece is selected to process the workpiece most quickly and the idle machine is selected to deliver the workpiece with the earliest delivery date. And 4.3, entering the scheduling scheme with the total deadline time meeting the index into step 4.3, and eliminating the scheduling scheme if the total deadline time does not meet the index.

Step 4.3: on the machine total load index, the machine total load of a scheduling task is combined with LMWT and SPT rules under the same condition to serve as a screening standard, and the LMWT + SPT rules refer to that the machine with the longest idle time is selected as the workpiece and the idle machine selects the workpiece with the shortest processing time. The scheduling data set generated during the execution of the scheduling scheme which can simultaneously meet the three indexes is used as the input of the scheduling rule mining algorithm.

Step five; clustering based on disturbance attributes: using DBSCAN to D_bIn units of scheduling schemes (i.e. data generated by one scheduling scheme as one object), according to D_bThe system disturbance attribute (part d 1) in the recipe creation in (1) is subjected to clustering based on the disturbance attribute. The method specifically comprises the following steps:

step 5.1: the d1 partial data are normalized, if the data of a certain disturbance attribute in each scheme are X1, X2, X3., Xn, then they need to be transformed as formula (1).

In the formula (1)

Step 5.3: and randomly finding an unprocessed (not classified into a certain cluster or marked as noise) core object p (the number of objects contained in the domain radius is not less than MinPts), establishing a new cluster C, and adding all objects in the p neighborhood radius Eps into a candidate set N.

Step 5.4: and randomly finding out the object q which is not processed in one candidate set N. If q is a core object, adding an unprocessed and not added object to N within the q neighborhood radius Eps to N. If q does not belong to any cluster, q is added to C.

Step 5.5: repeat step 5.4 until N is empty.

Step 5.6: steps 5.3, 5.4, 5.5 are repeated until no new objects can be added to any cluster, and the process ends

step 6.1: and for each clustered cluster, extracting training examples from d2 (mining random forest scheduling rule 1) and d3 (mining random forest scheduling rule 2) in the cluster in a return mode, and respectively forming k new training example sets P1 and P2 for constructing a decision tree.

Step 6.2: and (3) randomly selecting m characteristic attributes from d2 and d3 respectively by P1 and P2, calculating an optimal splitting mode, and training respectively to obtain k decision trees T1 and T2.

The construction process of the decision tree comprises the following steps:

step 6.2.1: a root node N is created.

Step 6.2.2: and judging whether the training example set has residual training examples, if not, returning to the node N, and if so, carrying out the next step.

Step 6.2.3: and judging whether the scheduling decisions of the rest training examples in the training example set are all C, if so, returning to the node N and marking as class C, and if so, carrying out the next step.

Step 6.2.4: and judging whether the production attribute list is empty or not, if so, marking the class with the most occurrence in the sample, and otherwise, carrying out the next step.

Step 6.2.5: and checking whether the attributes in the attribute class table are continuous or not, and obtaining the attribute separation mode with the maximum attribute gain G (D, A) by the continuous attributes through dichotomy. (all attribute values of the attribute can be divided into two parts by the dichotomy, which has N-1 dividing methods, and the dividing threshold of the dichotomy is the average value of two adjacent points at the selected dichotomy. the information gain is calculated by the formulas (2), (3) and (4)).

G(D,A)＝H(D)-H(D|A)(2)

In the formula (2), G (D, A) represents the information gain of the attribute A; entropy of H (D, A) class information in formula (3); in the formula (4), H (D | A) represents conditional entropy; furthermore, D represents the training instance dataset, | D | represents the number of training instances of D, and D has K classes C_k，k＝1,2；|C_kI is represented in category C_kThe number of training examples in (2). D can be divided into n subsets D by the attribute A₁，D₂，…，D_n，|D_iL is D_iThe number of training examples. D_iIn the class C_kIs D_ik，|D_ikL is D_ikThe number of training examples.

Step 6.2.6: and selecting the attribute marking node N with the largest information gain rate, wherein the calculation formulas of the information gain rate are shown as formulas (5) and (6), and returning to the step 6.2.2.

GR(D,A)＝G(D,A)/H(A)(5)

GR (D, a) in equation (5) represents an information gain ratio; h (A) indicates split information; the other symbols have the same meanings as above.

Step 6.3: using the unselected training examples in d2 and d3, the classification performance of the decision trees in T1 and T2, respectively, was tested.

Step 6.4: and (3) calculating the similarity S between the decision trees in the T1 or the T2, wherein the calculation formula is shown as the formula (7), and if the similarity between the decision trees is more than 60%, comparing the test performances in the step 6.3, reserving the good decision trees and forming a random forest.

DT in formula (7)₁And DT₂Representing two decision trees for similarity calculation; k represents DT₁And DT₂The test cases are classified for the same times; r is_1nAnd r_2nRepresents the n-th classification resultSame, DT₁And DT₂C represents the classification result; when r is_1n＝r_2nWhen is DT₁And DT₂When the same classification result is obtained with the same feature attributes, I (r)_1n.c,r_2nC) 1, otherwise 0, Nt being the number of test cases.

Step 6.5: and respectively calculating the weight w and h of each decision tree in T1 and T2 through a Bayesian voting mechanism, wherein the calculation formulas are as formulas (8) and (9), and thus obtaining a forest scheduling rule 1 and a random forest scheduling rule 2.

V in the formulas (8) and (9) represents the number of times the test case is correctly classified by the decision tree; m represents the number of times of error classification of the test case;

Step 7.2.1, for the workpiece machine selection problem, if M1 and M2 are two machines in M, according to the random forest scheduling rule 1 selected in step 7.1, the selection result of each decision tree in the random forest scheduling rule is calculated, and the results include selection 1 and selection 2 (selection 1 represents that M1 is proper, and selection 2 represents that M2 is proper). For the problem of selecting a workpiece by an idle machine, if J1 and J2 are two workpieces in J, the selection result of each decision tree in the random forest scheduling rules is calculated according to the random forest scheduling rule 2 selected in step 7.1, and the results include decision 1 and decision 2 (decision 1 represents that J1 is appropriate, and decision 2 represents that J2 is appropriate).

Step 7.2.2: and obtaining a weighted selection result WR of each decision tree through a Bayesian voting mechanism, wherein the WR is calculated according to a formula (10), and obtaining an average value AWR of the weighted results, wherein if the AWR is less than 1.5, the former m1 or j1 is proper, and if the AWR is more than 1.5, the latter m2 or j2 is proper.

WR＝wC+hR(10)

C in the formula (10) represents a classification result given by the decision tree; r represents the mean value of the classification results given by all decision trees, and the calculation formulas of w and h are shown in the formulas (8) and (9).

Example (c): in a certain scheduling task, workpieces JT1, JT2, are processed, JT8 has 100 pieces, namely 10 batches, the delivery dates of the workpieces are 20.0, 22.0, 14.0, 21.0, 19.0, 22.0, 18.0 and 23.0 processing unit time respectively, and the processing time of each process of the workpieces on each machine is as shown in table one. And a machine failure occurred at time 4, and it was found that a material shortage occurred in the second step of JT1 after the completion of the first step, and at time 10, the processing time of the workpiece was increased by 10% in total.

Table-workpiece processing time table

The scheduling scheme obtained by the flexible job shop dynamic scheduling method based on the industrial big data is shown in fig. 5, wherein the abscissa represents time, the ordinate represents a machine, the percentile number in the gantt chart represents the type of a workpiece, and the unit number represents the work order number. The maximum completion time of the final scheme is 21.8 processing unit time, the total delay time is 5.3 processing unit time, and the total load of the machine is 96.4 processing unit time.

The patent method can smoothly solve the problem of dynamic scheduling of the flexible job shop, and the scheduling rules mined by the method are used for guiding the scheduling of the flexible job shop, so that the method has the characteristics of strong practical feasibility, high calculation efficiency, no need of modeling the scheduling problem, real-time response to the disturbance of the shop and the like.

Claims

1. A flexible job shop dynamic scheduling method based on industrial big data comprises the following steps:

step one, data acquisition: collecting historical data related to scheduling from the existing information systems MES, ERP and SCADA by using data collection tools Sqoop and Flume under a Hadoop ecosystem, and storing the historical data in an HDFS file system; the collected data includes three parts D_hD1, d2, d3 }: d1 timing disturbance information of system related to disturbance for scheduling scheme; d2 is the state information of each machine in the set of machines which can process the working procedure at present when selecting the processing machine for the working procedure of the workpiece; d3 is the state information of each workpiece in the current queue when the idle machine needs to select the workpiece in the waiting queue for processing;

step two, data integration: scheduling data set D in HDFS file system by using SQL statement through data warehouse tool Hive_hDividing by taking a scheduling scheme as a unit, namely dividing scheduling related historical data generated in the execution of the primary scheduling scheme together;

step three, data conversion: d2 and d3 parts in the integrated data are converted into a form of a training example by Spark, so that a data mining algorithm can conveniently mine the scheduling rules; the method specifically comprises the following steps:

step 3.1: for the collected scheduling data set D_hThe part d2 of (1), regarding the actually selected machine m1 in a certain historical scheduling scheme as the most suitable machine, and comparing the most suitable machine with machines in other alternative machine sets { m2, m3. } which can process the process one by one to form training examples;

step 3.2: for miningSet of scheduling data D_hThe part d3 of (a), regarding the actually selected workpiece j1 in a certain historical scheduling scheme as the most suitable machine, and comparing the most suitable machine with the workpieces in other workpiece sets { j2, j3.. } waiting for processing one by one to form training examples;

step four, data screening: considering the historical scheduling scheme from three indexes of maximum completion time, total delay time and total load of machines, and screening to obtain a scheduling-related historical data set D generated in the execution of the scheduling scheme with good performance_b(ii) a The method specifically comprises the following steps:

step 4.1: on the maximum completion time index, the maximum completion time of the scheduling scheme generated by only using the SPT rule under the same condition is used as a screening standard; only using the SPT rule means that the workpiece selects the workpiece with the shortest processing time from the machine with the fastest processing and the idle machine; the scheduling scheme with the maximum completion time meeting the index enters step 4.2, and is eliminated if the maximum completion time does not meet the index;

step 4.2: on the total delay time index, using the total delay time of scheduling tasks under the same condition to finish by using an EDD rule and an SPT rule under the same condition as a screening standard; the SPT + EDD rule refers to that a workpiece selects a machine with the fastest processing and an idle machine selects a workpiece with the earliest delivery date; entering the step 4.3 if the total deadline time meets the index, and eliminating if the total deadline time does not meet the index;

step 4.3: on the basis of the total load index of the machine, the total load of the machine which completes scheduling tasks by combining LMWT and SPT rules under the same condition is taken as a screening standard, and the LMWT + SPT rules refer to that the machine with the longest idle time is selected for the workpiece and the idle machine selects the workpiece with the shortest processing time; the scheduling data set generated in the execution of the scheduling scheme which can simultaneously meet the three indexes is used as the input of a scheduling rule mining algorithm;

step five; clustering based on disturbance attributes: using DBSCAN to D_bData generated in units of scheduling schemes, i.e. one scheduling scheme, as one object, according to D_bThe system disturbance attribute in the scheme (1), namely the d1 part, is clustered based on the disturbance attribute; the method specifically comprises the following steps:

step 5.1: normalizing d1 partial data, if the data of a certain disturbance attribute in each scheme is X1, X2, X3., Xn, they need to be transformed as formula (1);

in the formula (1)

A mean value representing the attribute; s is expressed as the standard deviation; y1, Y2, Y3., Yn being normalized data;

step 5.2: determining a parameter field radius Eps of the DBSCAN algorithm and the number MinPts of at least objects contained in the core object field radius;

step 5.3: randomly finding out an unprocessed core object p, namely the core object p which is not classified into a certain cluster or marked as noise, wherein the number of objects contained in the radius of the field of the core object p is not less than MinPts, establishing a new cluster C, and adding all objects in the p neighborhood radius Eps into a candidate set N;

step 5.4: randomly finding out an unprocessed object q in a candidate set N; if q is a core object, adding an unprocessed object which is not added into N in the q neighborhood radius Eps into N; if q does not belong to any cluster, adding q to C;

step 5.5: repeating step 5.4 until N is empty;

Step six, mining a random forest scheduling rule: respectively mining a random forest scheduling rule 1 for solving the problem of selecting a machine for workpieces and a random forest scheduling rule 2 for solving the problem of selecting the machine for machining the workpieces by using an improved random forest algorithm from each cluster after clustering; the method specifically comprises the following steps:

step 6.1: for each clustered cluster, extracting training examples from d2 and d3 in the cluster in a replacement manner to form k new training example sets P1 and P2 respectively for constructing a decision tree;

step 6.2: randomly selecting m characteristic attributes from d2 and d3 respectively by P1 and P2, calculating an optimal splitting mode, and respectively training k decision trees T1 and T2;

the construction process of the decision tree comprises the following steps:

step 6.2.1: creating a root node N;

step 6.2.2: judging whether the training example set has residual training examples, if not, returning to the node N, and if so, carrying out the next step;

step 6.2.3: judging whether the scheduling decisions of the rest training examples in the training example set are all C, if so, returning to the node N and marking as class C, and if so, carrying out the next step;

step 6.2.4: judging whether the production attribute list is empty or not, if so, marking the empty production attribute list as the most classes in the sample, and otherwise, carrying out the next step;

step 6.2.5: checking whether the attribute in the attribute class table is continuous or not, wherein the continuous attribute obtains an attribute separation mode with the maximum attribute gain G (D, A) through a dichotomy; all attribute values of the attributes can be divided into two parts through a dichotomy, the method has N-1 dividing methods, and a dividing threshold value of the dichotomy is an average value of two adjacent points at a selected dichotomy; the information gain calculation method is as follows, formulas (2), (3) and (4);

G(D,A)＝H(D)-H(D|A) (2)

in the formula (2), G (D, A) represents the information gain of the attribute A; entropy of H (D, A) class information in formula (3); in the formula (4), H (D | A) represents conditional entropy; furthermore, D represents the training instance dataset, | D | represents the number of training instances of D, and D has K classes C_k，k＝1,2；|C_kI is represented in category C_kIn (1)The number of training examples; d can be divided into n subsets D by the attribute A₁，D₂，…，D_n，|D_iL is D_iThe number of training instances of (c); d_iIn the class C_kIs D_ik，|D_ikL is D_ikThe number of training instances of (c);

step 6.2.6: selecting the attribute marking node N with the largest information gain rate, wherein the calculation formulas of the information gain rate are shown as formulas (5) and (6), and returning to the step 6.2.2;

GR(D,A)＝G(D,A)/H(A) (5)

GR (D, a) in equation (5) represents an information gain ratio; h (A) indicates split information; other symbols have the same meanings as above;

step 6.3: testing classification performance of decision trees in T1 and T2 respectively by using unselected training examples in d2 and d 3;

step 6.4: calculating the similarity S between decision trees in T1 or T2, wherein the calculation formula is shown as formula (7), if the similarity between the decision trees is more than 60%, comparing the test performances in the step 6.3, keeping the good decision trees, and forming a random forest;

DT in formula (7)₁And DT₂Representing two decision trees for similarity calculation; k represents DT₁And DT₂The test cases are classified for the same times; r is_1nAnd r_2nIndicating the same result of the nth classification, DT₁And DT₂C represents the classification result; when r is_1n＝r_2nWhen is DT₁And DT₂When the same classification result is obtained with the same feature attributes, I (r)_1n.c,r_2nC) 1, otherwise 0, Nt being the number of test cases;

step 6.5: respectively calculating the weight w and h of each decision tree in T1 and T2 through a Bayesian voting mechanism, wherein the calculation formulas are as formulas (8) and (9), and obtaining a random forest scheduling rule 1 and a random forest scheduling rule 2;

step seven, the scheduling rule is used: guiding the dynamic scheduling of the flexible job shop by the mined random forest scheduling rule; the method specifically comprises the following steps:

step 7.1, according to the problem to be solved, selecting a workpiece machine or selecting a workpiece to be processed by an idle machine, and finding a random forest scheduling rule 1 or a random forest scheduling rule 2 corresponding to a cluster to which a disturbance environment of the current flexible job shop belongs;

7.2, selecting an optimal method through pairwise comparison according to the selected random forest scheduling rule, and selecting the most appropriate workpiece or machine from the candidate machine set M or the candidate workpiece set J;

step 7.2.1, for the workpiece machine selection problem, if M1 and M2 are two machines in M, calculating to obtain the selection result of each decision tree in the random forest scheduling rules according to the random forest scheduling rule 1 selected in the step 7.1, wherein the results comprise selection 1 and selection 2, the selection 1 represents that the former M1 is proper, and the selection 2 represents that the latter M2 is proper; for the problem of selecting workpieces by idle machines, if J1 and J2 are two workpieces in J, the selection result of each decision tree in the random forest scheduling rules is calculated according to the random forest scheduling rule 2 selected in the step 7.1, the results include decision 1 and decision 2, decision 1 represents that J1 is proper, and decision 2 represents that J2 is proper;

step 7.2.2: obtaining a weighted selection result WR of each decision tree through a Bayesian voting mechanism, wherein the WR is calculated according to a formula (10), and obtaining an average value AWR of the weighted results, wherein if the AWR is less than 1.5, the former m1 or j1 is proper, and if the AWR is more than 1.5, the latter m2 or j2 is proper;

WR＝wC+hR (10)