CN114912826A

CN114912826A - Flexible job shop scheduling method based on multilayer deep reinforcement learning

Info

Publication number: CN114912826A
Application number: CN202210603831.2A
Authority: CN
Inventors: 李小霞; 曾正祺
Original assignee: Huazhong Agricultural University
Current assignee: Huazhong Agricultural University
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-16

Abstract

The invention discloses a flexible job shop scheduling method based on multilayer deep reinforcement learning, which comprises the following steps: p1 deep reinforcement learning model part: the deep learning adopts a graph neural network, and the graph is extracted as the input of the graph neural network to obtain the characteristics of the graph, so that the characteristic representation of the problem is effectively obtained. The reinforcement learning is based on a Markov decision model, the flexible workshop scheduling problem obtains a decision scheme through the repeated decision process of the model, and the goal is optimized in a mode of maximizing the reward value. P2 training algorithm part: the method comprises the steps of training a model by adopting an operator _ critic algorithm, distributing tasks collected by samples to a plurality of sub-threads for carrying out decision making and sample generation independently by each sub-thread, and simultaneously deciding a plurality of problems by each sub-thread to generate a plurality of decision tracks, so that an unrelated high-quality sample optimization model is rapidly generated, and a final model is rapidly obtained.

Description

Flexible job shop scheduling method based on multilayer deep reinforcement learning

Technical Field

The invention relates to the field of combination optimization, in particular to a flexible job shop scheduling method based on multilayer deep reinforcement learning.

Background

The flexible job shop scheduling problem, in which the same workpiece may have multiple processing paths and the processing machines of the same process may have multiple sets, is an important extension of the shop scheduling problem and is considered as an NP problem. This greatly increases the complexity of the problem. How to find out the optimal solution for the flexible job shop scheduling problem in the shortest time has important significance in the combination optimization problem. At present, the main methods for solving the flexible job shop scheduling problem are scheduling rules and meta-heuristic algorithms. By prioritizing processes and machines based on the scheduling rules of the flexible job shop scheduling problem, solutions can be obtained quickly.

However, the scheduling results obtained using the scheduling rules are not ideal, and the scheduling rules are not applicable to a diverse processing environment. Compared with a scheduling rule, the meta-heuristic algorithm finds the optimal solution through a plurality of rounds of iteration, can obtain a good result, but has long calculation time, does not have generalization performance, and needs to be initialized and iterated again when the problem changes. Machine learning is applied to many fields as a new method and achieves good results, so that the application of the machine learning method to the flexible workshop scheduling problem is a new research direction. The deep reinforcement learning is a research branch of machine learning, a model of the deep reinforcement learning can be directly used for problem decision after a large amount of training, and a flexible workshop scheduling problem can also be expressed as a decision problem. The design of the deep reinforcement learning model is an important part of the method.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a flexible job shop scheduling method based on multilayer deep reinforcement learning aiming at the defects in the prior art. The flexible workshop scheduling problem is represented by a disjunctive graph, the graph neural network is used for extracting features, states, actions and rewards corresponding to the problem are designed to establish a Markov model, a layered decision model is designed to divide the flexible workshop scheduling problem into two sub-problems of procedure sequencing and machine selection for solving, and the asynchronous dominant operator _ critical algorithm can be used for training the model quickly and effectively.

The technical scheme adopted by the invention for solving the technical problem is as follows:

the invention provides a flexible job shop scheduling method based on multilayer deep reinforcement learning, which is characterized in that a deep reinforcement learning model is established for a flexible shop scheduling problem, the deep reinforcement learning model is trained, the flexible shop scheduling problem is solved through the trained deep reinforcement learning model, and an optimal scheduling scheme is output; the method comprises the following two parts:

p1 deep reinforcement learning model part: the deep reinforcement learning model is used for deciding the flexible workshop scheduling problem, expressing the flexible workshop scheduling problem as an extraction graph, and solving the flexible workshop scheduling problem as an orientation process of extraction arcs; the deep learning adopts a graph neural network, and the graph is extracted as the input of the graph neural network to obtain the characteristics of the graph, so that the characteristic representation of the problem is effectively obtained; the reinforcement learning is based on a Markov decision model, a state, an action and a reward corresponding to the problem are designed, and the layered decision model makes corresponding actions according to the state characteristics; the flexible workshop scheduling problem obtains a decision scheme through a repeated decision process of the model, and the target is optimized in a mode of maximizing the reward value;

p2 training algorithm part: the method comprises the steps of training a deep reinforcement learning model by adopting a multithreading and multi-track asynchronous dominant operator _ critic algorithm, distributing tasks collected by samples to a plurality of sub-threads for carrying out decision making and sample generation independently by each sub-thread, and simultaneously deciding a plurality of problems by each sub-thread to generate a plurality of decision tracks, so that an unrelated high-quality sample optimization model is rapidly generated, a final model is rapidly obtained, and the trained model supports rapid solving of flexible workshop scheduling problems and generalization on problems of different scales; and outputting an optimal scheduling scheme of the flexible workshop through the trained deep reinforcement learning model, and handing the optimal scheduling scheme to the flexible workshop for execution.

Further, a specific method for obtaining the characteristics of the disjunctive graph in the P1 deep reinforcement learning model part of the present invention is as follows:

step 1.1, obtaining an analytic Graph representation Graph according to a flexible workshop scheduling problem;

step 1.2, determining node information according to disjunct arcs in the disjunct graph;

and step 1.3, obtaining the Feature of the extracted graph by taking the extracted graph as the input of a graph neural network.

Further, the extraction diagram in step 1.1 of the present invention is defined as follows:

the disjunctive graph of the flexible plant scheduling problem is described as: given graph G ═ O, C, D, where O is the set of all process nodes O and two virtual process nodes S and E, which represent the start and end of the schedule, respectively; c is a connecting arc set<v，w>L V, w belongs to V }, and the two processes represented by V and w belong to the same workpiece; for a compound belonging to C<v，w>The expression that the node v to the node w have a connecting arc which is a one-way arc and has s for ensuring the sequential constraint of the processing sequence of each procedure on the same workpiece _tv <s _tw ，s _tv The machining start time of the process represented by the node v; d is an extraction arc set, D is a last<v，w>L V, w belongs to V, and each procedure of extracting arcs which are bidirectional arcs to represent connected nodes V and w can be processed on the same machine; the final goal is to determine the directions of all disjunct arcs and simultaneously make the maximum completion time shortest; the number of working procedures of each workpiece in the flexible workshop scheduling problem may be different, and when the analysis graph is converted, if the number of the working procedures of the workpiece is less than the maximum number of the working procedures, a '0' working procedure node is added at the tail of the workpiece to ensure the uniformity of the graph structure, the '0' working procedure running time is not counted, and the workpiece can be processed on all machines.

Further, the method for calculating the node information in step 1.2 of the present invention specifically includes:

step 1.2.1, randomly selecting the execution time of each procedure on an executable machine as the estimated execution time of each procedure;

and 1.2.2, not considering the unoriented disjunctive arc constraint, sequentially processing each procedure according to the connection arc constraint relation and the oriented disjunctive arc relation, and calculating the completion time of each procedure as the node information.

Further, the specific method for calculating the neural network characteristics of the graph in the step 1.3 of the present invention is as follows:

step 1.3.1, inputting node information and an arc relation into a neural network of a kth-level graph to calculate node representation, wherein k is 1; the node characterization calculation formula is as follows:

adopting a graph isomorphic network structure, executing K times of updating iterations to calculate p-dimensional embedding of each node V, wherein V belongs to V, and the updating of the K-th layer is expressed as:

wherein the content of the first and second substances,

is a characteristic representation of node v at layer k; MLP is a multi-layer linear model, N _(v) Is the set of all nodes connected to node v;

step 1.3.2, pooling node characterization graphs to obtain graph characterization, and adopting average pooling, wherein k is k + 1;

step 1.3.3, circularly executing step 1.3.1 and step 1.3.2 for K times;

and 1.3.4, performing linear transformation on the output layer to obtain the output characteristic Feature by the representation of the final graph.

Further, the decision making process in the P1 deep reinforcement learning model part of the present invention is as follows:

step 2.1, calculating the probability of the selection process according to the obtained Feature as the input of a decision model;

2.2, selecting a process o with the maximum probability according to the obtained probability greedy;

2.3, selecting the machine m most suitable for the selected process according to the scheduling rule;

step 2.4, the selected combination of the process o and the machine m is used as an action (o, m) in the current state, the state conversion is executed to obtain a new state, the extraction graph is updated, and the new and old states and the reward value are saved as samples;

step 2.5, repeatedly executing the step 1.2 to the step 2.4 until all the process selections are finished;

step 2.6, obtaining a final decision scheme through repeated decision of the model, and strengthening learning based on a Markov decision model; the Markov decision model is as follows:

a State State: the method comprises the steps that a corresponding graph structure is obtained through an input test set or a training set, machining processes of workpieces serve as nodes of the graph, machining sequence relations of the processes are arcs, node information comprises completion time of the processes, the arcs in the graph are directed arcs, arc information comprises machining sequences of the processes on a machine, namely two process nodes connected by the arcs execute a second process pointed by the arcs after a first process is completed in a decision scheme. The status also includes basic problem information, including whether each workpiece can be processed on different machines and the time corresponding to the processing;

and (4) Action: defining a primary action as a process o of determining a certain workpiece and allocating a machine m for the workpiece, wherein the process o is represented as (o, m), the state is used as the input of a deep reinforcement learning model, the characteristics of the process are extracted through a graph neural network and then input into a decision model to obtain process selection probability distribution, the workpiece is selected by the obtained probability greedy and the process is determined, and a proper machine is selected for the workpiece by a relevant scheduling rule;

and (3) state conversion: updating the state according to the selected action, updating the arc relation and the node information of the graph according to the working procedure and the machine corresponding to the action, namely adding or modifying the arc in the directed graph, and updating the completion time of the working procedure to be used as a new state;

reward: and taking the difference of the maximum completion time of the corresponding schemes of the analysis graphs before and after one state conversion as the timely reward of the decision, and summing the instant rewards of each decision as the accumulated reward according to the estimated processing time.

Further, the specific calculation method of the scheduling rule in step 2.3 of the present invention is as follows:

step 2.3.1, determining its set of executable machines S by the selection process o _m ；

Step 2.3.2, obtaining a value f1 obtained by normalizing the time of each machine processing selected procedure in the set as an index 1;

step 2.3.3, calculating a value f2 normalized by the number of the processed procedures of each machine in the set as an index 2;

step 2.3.4, adding the index 1 and the index 2 to obtain a final index (f1+ f 2);

and 2.3.5, determining a machine from the set according to the final index, wherein the selected machine is a machine with short processing time and a small number of processed procedures.

Further, the state transition process in the markov decision model of the present invention specifically includes:

step 3.1, judging the processing feasibility of the selection process on the selection machine according to the state and the action;

step 3.2, determining and selecting a machined process sequence of a machine according to the analysis chart;

3.3, determining the processing time of the selection procedure on the selection machine;

judging whether the selection process can be inserted into a preset idle time period of the selection machine, if so, executing the step 3.4, and if not, executing the step 3.7;

step 3.4, calculating the earliest machinable time of the selection procedure and the idle time section of the selection machine, and determining the insertion position of the selection procedure in the machined procedure sequence

Step 3.5, modifying the arc relation of the extraction graph according to the insertion position, and deleting other extraction arcs connected with the selected process nodes;

step 3.6, updating the node information and finishing the state conversion;

step 3.7, determining the total start-up time, and adding the process to the end of the processed process sequence;

step 3.8, determining the disjuncting arc direction of the selection procedure on the selection machine, and deleting other disjuncting arcs connected with the selection procedure nodes;

and 3.9, updating the node information and finishing the state conversion.

Further, the calculation process of the instant reward and the cumulative reward in the markov decision model of the present invention is as follows:

step 4.1, calculating the maximum completion time T of the old state _s ；

Step 4.2, calculating the maximum completion time T of the new state _s+1 ；

Step 4.3, calculate the instant prize value T _s -T _s+1 ；

The calculation formula of the accumulated award is as follows:

r _t ＝TS _t -Ts _t+1

wherein R is the cumulative prize value, T _s1 For maximum completion time, T, corresponding to the initial state _send Maximum completion time for the final project due to T _s1 The fixed value determined according to the problem information does not change with the decision, so maximizing the prize value is equivalent to minimizing the maximum completion time of the final scheme.

Further, the training process in the P2 training algorithm part of the present invention specifically includes:

step 5.1, generating a main thread and T sub-threads and initializing a training round number Count to be 0;

step 5.2, copying parameters of the main thread model to the sub thread model;

step 5.3, starting each sub thread, and initializing the number of training rounds;

step 5.4, generating U problems by each sub thread;

5.5, solving a flexible workshop scheduling problem through a deep reinforcement learning model by the sub-thread and generating a sample;

step 5.6, completing sample collection, and optimizing the model parameters of the main thread by using a gradient descent strategy, wherein the Count is equal to Count + 1; the optimization formula is as follows:

where pi is the operator network, i.e. the graph neural network and the decision network, theta is the parameter thereof, and v is the critic network, used for and rewardingThe values are collectively estimated to calculate a merit function,

is a parameter thereof;

step 5.7, judging whether the maximum training round number T is reached _c And if the parameter is not reached, executing the step 5.4 to the step 5.6, if the parameter is reached, finishing the training, and storing the main thread model parameters.

The invention has the following beneficial effects:

1. deep learning extracts problem features, can take into account internal connections of problems and adapt to changes in different manufacturing environments.

The trained model can obtain a better result in a short time, and the model can be used for solving flexible workshop scheduling problems of different scales without retraining, so that the method has strong generalization.

2. The flexible workshop scheduling problem is decomposed into two sub-problems of procedure ordering and machine selection, the two sub-problems are solved by using a layered structure, the complexity of the problem is reduced, meanwhile, the calculation time is reduced, and the structural complexity of the overall model is also reduced by using a mode of cooperation of a neural network model and a scheduling rule.

3. The asynchronous dominant operator _ critical algorithm is used for training, the training time is greatly shortened by the multi-thread and multi-track training method, samples used for training are from different decision processes of a model for scheduling different flexible workshops at the same moment, and therefore each sample is irrelevant and effective.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a schematic diagram of a flexible job shop scheduling method based on multi-layer deep reinforcement learning according to the present invention;

FIG. 2 is a disjunctive graph model of a 3 × 3 flexible shop scheduling problem;

FIG. 3 is a flexible shop dispatch criteria test set MK 01.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The framework of the invention is a training framework of a multi-thread and multi-track based hierarchical deep reinforcement learning model as shown in FIG. 1. Taking a training process as an example, the method comprises the following specific steps:

as shown in fig. 1, the training method is based on an asynchronous dominance operator _ critic algorithm, a multithreading multi-track method is adopted to train a model, a main thread model is a model to be optimized, model parameters of sub-threads are copied from the main thread model, each sub-thread makes a decision on a plurality of problems at the same time to generate a plurality of decision tracks, and each decision track is obtained according to a depth reinforcement learning model.

The specific implementation steps of the deep learning extraction feature specific implementation process of the deep reinforcement model in P1 are as follows:

step 1, generating a flexible workshop scheduling problem of 20 × 10 according to a set scale, initializing the flexible workshop scheduling problem, recording problem information, converting the problem into an extraction graph structure, calculating node information to obtain an initial state, and setting a decision time t to be 0;

when the processing sequence and the processing machines are not determined in the corresponding process of the node, one machine is randomly selected from the processing machines of the node, the processing end time of the process on the selected machine is used as the estimated processing time, and the actual processing time is used as the node information after the processing sequence and the processing machines are determined for the process.

Step 2, the extraction graph is input into a graph neural network to calculate state Feature;

the specific implementation steps of the state feature calculation of the neural network of the graph described in P1 are as follows:

step 1, inputting node information and an arc relation into a k-th layer graph neural network to calculate node representation, wherein k is 1;

the node characterization calculation formula is as follows:

step 2, pooling the node characterization graphs to obtain the characterization of the graphs, and adopting average pooling, wherein k is k + 1;

step 3, circularly executing the step 1 and the step 2 for K times, wherein K is 3;

and 4, obtaining the Feature of the output state by the representation of the graph through multi-layer linear network transformation.

The state characteristics are output through a layered decision network, the action comprises a process and a machine, the process decision adopts a linear neural network, and the machine selects a scheduling rule.

The specific implementation steps of the hierarchical decision network output action in P1 are as follows:

step 1, calculating the probability of selecting a process according to the obtained Feature as the input of a decision model;

step 2, selecting the process o with the maximum probability according to the obtained probability greedy _t ；

Step 3, selecting the machine m most suitable for the selected process according to the scheduling rule _t ；

Step 4, Process o of selection _t And machine m _t Combined as action (o) in the current state _t ,m _t )；

In the above technical solution, the scheduling rule of the hierarchical decision network is calculated as follows:

step 1, determining executable machine set S by selection process o _m ；

Step 2, obtaining the time P of the working procedure selected by each machine in the set _om The normalized value f1 is index 1;

step 3, calculating the number N of the processed working procedures of each machine in the set _m The normalized value f2 is index 2;

step 4, adding the index 1 and the index 2 to obtain a final index (f1+ f 2);

step 5, from the set S of final indexes _m Middle determination machineMachine m _t The machine selected is a machine with a short processing time and a small number of processed processes.

The sample collection procedure described in P2 is as follows:

step 1, executing state conversion to obtain a new state, and updating an extraction graph;

step 2, calculating an award value, and storing a new state and an old state and the award value as samples, wherein t is t + 1;

step 3, judging whether the decision is finished or not, if t is less than 200, returning to the step 2 in the feature extraction, and if t is 200, ending the decision;

and 4, obtaining a final scheme.

In the above technical solution, the state transition specifically executed steps of sample collection are as follows:

step 1, judging the processing feasibility of a selection procedure on a selection machine according to the current state and the selected action;

step 2, determining and selecting the machined process sequence M of the machine according to the analysis chart _sec ；

Step 3, determining the processing time P of the selection procedure on the selection machine _om And selecting the earliest starting time T of the process _o ；

Step 3, calculating T _o Judging whether the selection procedure can be inserted into the preposed idle time period of the selected machine or not in the maximum idle time period MT of the machine after the moment, and if the MT is in the preset idle time period>P _om Execute step 4 if MT<P _om Then step 7 is executed;

step 4, determining the sequence M of the selected working procedure in the processed working procedure according to the earliest processable time of the selected working procedure and the idle time segment of the selected machine _sec The insertion position of (a);

step 5, modifying the arc relation of the extraction graph according to the insertion position, and deleting other extraction arcs connected with the selected process node;

step 6, updating the node information and finishing the state conversion;

step 7, determining the idle time T of the machine _m In the selection step, the earliest starting time T _o And T _m Max (T) of _o ,T _m ) As a start-upTime, process step addition processed process step sequence M _sec Ending;

step 8, determining the disjuncting arc direction of the selection procedure on the selection machine, and deleting other disjuncting arcs connected with the selection procedure nodes;

and 9, updating the node information and finishing the state conversion.

In the above technical solution, the calculation process of the instant prize collected by the sample is as follows:

step 1, calculating the maximum completion time T of the old state _s ；

Step 2, calculating the maximum completion time T of the new state _s+1 ；

Step 3, calculating the reward value T _s -T _s+1 。

The total training flow described in P2 is as follows:

step 1, generating a main thread and T sub-threads, wherein T is 4;

step 2, copying the parameters of the main thread model as the model parameters of the sub-thread model;

step 3, each thread uses U problems to make a decision at the same time, and U is 4;

step 4, initializing a sub thread, wherein the number of training rounds is 0;

step 5, extracting a sample collection flow according to the characteristics to make a decision and collect a sample;

step 6, optimizing a main thread model parameter by using a gradient descent method according to the sample, wherein the Count is equal to Count + 1;

and 7, repeating the steps 2 to 6 until the training Count reaches the maximum training round number Tc, and ending the sub-thread, wherein Tc is 10000.

In the above technical solution, the gradient descent method has the following optimization formula:

wherein pi is an actor network, namely a graph neural network and a decision network, theta is a parameter of the actor network, v is a critic network and is used for estimating and calculating the advantage function together with the reward value,

is a parameter thereof;

the above technical solution describes the general framework of the present invention in a training process, and the following describes the proposed flexible workshop scheduling method based on hierarchical reinforcement learning by taking a solving process as an example, after model training is completed, through the process of solving MK 01. The method comprises the following specific steps:

step 1, loading model parameters;

step 2, converting MK01 into an extraction graph to obtain an initial state, wherein t is 0;

step 3, calculating the disjunctive graph through a graph neural network to obtain state characteristics;

step 4, inputting the state characteristics into a decision network to obtain process selection probability and select a process, and determining a processing machine according to a scheduling rule, wherein the process selection probability and the process selection probability are combined into an action;

step 5, executing state conversion to obtain a new state, wherein t is t + 1;

and 6, judging whether the decision is finished or not, returning to the step 3 if t is less than 60, finishing the decision if t is 60, and outputting a solving scheme.

The maximum completion time of the result obtained by the above solving process is 52, and the specific actions are selected as follows:

(0,5),(30,0),(1,1),(18,1),(24,5),(36,2),(48,0),(42,0),(54,5),(6,2),(12,5),(7,2),(8,0),(13,4),(25,5),(9,5),(26,5),(2,4),(3,0),(10,1),(11,0),(4,3),(27,2),(28,4),(5,2),(14,2),(15,2),(16,1),(17,4),(19,3),(20,1),(21,2),(22,4),(23,0),(49,3),(50,5),(51,2),(52,4),(53,5),(31,4),(32,3),(33,0),(34,2),(35,3),(37,1),(38,3),(39,3),(40,3),(41,0),(29,0),(43,5),(44,3),(45,0),(46,5),(47,5),(55,5),(56,5),(57,0),(58,2),(59,3)。

wherein the first value of each action is a selected process, the first process from the first workpiece is process 0, the second process from the first workpiece is process 1, and so on until the last process from the last workpiece is process 59; the second value is the processing machine selected for the process.

According to the implementation case, the training algorithm in the technical scheme can quickly and efficiently perform model training and obtain the model suitable for solving the flexible workshop scheduling problem, the layered deep reinforcement learning model in the technical scheme can quickly solve the flexible workshop scheduling problem and can obtain a good optimization result, and the trained model can be directly used for solving the flexible workshop scheduling problem in different scales and has good generalization performance.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A flexible job shop scheduling method based on multilayer deep reinforcement learning is characterized in that for a flexible shop scheduling problem, a deep reinforcement learning model is established and trained, the flexible shop scheduling problem is solved through the trained deep reinforcement learning model, and an optimal scheduling scheme is output; the method comprises the following two parts:

2. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 1, wherein the specific method for obtaining the extraction map features in the P1 deep reinforcement learning model part is as follows:

3. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 2, wherein the extraction map in the step 1.1 is defined as follows:

the disjunctive graph of the flexible plant scheduling problem is described as: given graph G ═ O, C, D, where O is the set of all process nodes O and two virtual process nodes S and E, which represent the start and end of the schedule, respectively; c is a connecting arc set<v，w>L V, w belongs to V, and the two processes represented by V and w belong to the same workpiece; for a compound belonging to C<v，w>The expression that the node v to the node w have a connecting arc which is a one-way arc and has s for ensuring the sequential constraint of the processing sequence of each procedure on the same workpiece _tv ＜s _tw ，s _tv The machining start time of the process indicated by the node v; d is a disjuncting arc set, D ═ tone<v，w>L V, w belongs to V, each is bidirectionalThe arc extraction indicates that the connected node v and node w can be processed on the same machine; the final goal is to determine the directions of all disjunct arcs and simultaneously make the maximum completion time shortest; the number of working procedures of each workpiece in the flexible workshop scheduling problem may be different, and when the analysis graph is converted, if the number of the working procedures of the workpiece is less than the maximum number of the working procedures, a '0' working procedure node is added at the tail of the workpiece to ensure the uniformity of the graph structure, the '0' working procedure running time is not counted, and the workpiece can be processed on all machines.

4. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 2, wherein the calculation method of the node information in the step 1.2 is specifically as follows:

5. The flexible job shop scheduling method based on multilayer deep reinforcement learning according to claim 2, wherein the specific method for calculating the neural network characteristics of the graph in the step 1.3 is as follows:

wherein, the first and the second end of the pipe are connected with each other,

step 1.3.2, pooling node characterization graphs to obtain the characterizations of the graphs, wherein average pooling is adopted, and k is k + 1;

step 1.3.3, step 1.3.1 and step 1.3.2 are executed in K times in a circulating manner;

6. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 2, wherein the decision making process in the P1 deep reinforcement learning model part is as follows:

state: the method comprises the steps of obtaining a corresponding graph structure through an input test set or a training set, using machining processes of workpieces as nodes of a graph, wherein the machining sequence relation of the processes is an arc, node information comprises the completion time of the processes, the arc in the graph is a directed arc, the arc information comprises the machining sequence of the processes on a machine, namely, two process nodes connected by the arc execute a second process pointed by the arc after a first process is completed in a decision scheme. The status also includes basic problem information, including whether each workpiece can be processed on different machines and the time corresponding to the processing;

7. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 6, wherein the specific calculation method of the scheduling rule in the step 2.3 is as follows:

Step 2.3.2, obtaining a value f1 as an index 1 after normalizing the time of each machine processing selected procedure in the set;

8. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 6, wherein the state transition process in the Markov decision model is specifically:

step 3.4, calculating the earliest machinable time of the selection process and the idle time section of the selection machine, and determining the insertion position of the selection process in the machined process sequence

step 3.6, updating the node information and finishing the state conversion;

and 3.9, updating the node information and finishing the state conversion.

9. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 6, wherein the computation process of the instantaneous reward and the cumulative reward in the Markov decision model is as follows:

step 4.1, calculating the maximum completion time T of the old state _s ；

Step 4.2, calculating the maximum completion time T of the new state _s+1 ；

Step 4.3, calculating the instant prize value T _s -T _s+1 ；

The calculation formula of the accumulated award is as follows:

r _t ＝Ts _t -Ts _t+1

wherein R is the cumulative prize value, T _s1 For maximum completion time, T, corresponding to the initial state _send Maximum completion time for the final project due to T _s1 The fixed value, which is determined based on the problem information, does not vary with the decision, so maximizing the reward value is equivalent to minimizing the maximum completion time of the final proposal.

10. The flexible job shop scheduling method based on multi-layer deep reinforcement learning according to claim 1, wherein the training process in the P2 training algorithm part is specifically as follows:

step 5.2, copying parameters of the main thread model to the sub thread model;

step 5.4, generating U problems by each sub thread;

wherein pi is actorNetworks, namely a graph neural network and a decision network, theta is a parameter of the graph neural network, v is a critic network, and is used for estimating and calculating the advantage function together with the reward value,

is a parameter thereof;