CN116796964A

CN116796964A - Method for solving job shop scheduling problem based on generation countermeasure imitation study

Info

Publication number: CN116796964A
Application number: CN202310678078.8A
Authority: CN
Inventors: 李�浩; 胡志坤; 康雁; 陈亦敏; 李信衍
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-22

Abstract

The application discloses a method for solving job shop scheduling problem based on generation countermeasure imitation study, which comprises the following steps: s1, acquiring a data set; s2, performing Markov decision process formalized design on the environment, and S3, performing design of a neural network structure, wherein the design comprises strategy network, value network and rewarding network design; s4, constructing an expert trajectory set; s5, training of an agent: the GAIL algorithm updates the reward network, the rewards in the track are completed through the reward network, and then the intelligent agent network is updated through the PPO algorithm by utilizing the complete track; s6, deploying the intelligent agent, and generating the purpose that the processing time of workshop workpieces is as short as possible when the processing of workshop workpieces is maximally completed. The application designs the agent into the independence of the job shop scheduling problem scale, and can directly test the job shop scheduling problem on a larger scale after training on the job shop scheduling problem on a small scale.

Description

Method for solving job shop scheduling problem based on generation countermeasure imitation study

Technical Field

The application belongs to the technical field of job shop scheduling, and particularly relates to a method for solving job shop scheduling problems based on generation countermeasure imitation learning.

Background

Job shop scheduling problems (Job Shop Scheduling Problem, JSP) are a classical class of combinatorial optimization problems in computer science and operations research, playing a vital role in modern manufacturing, being widely used in various manufacturing processes such as semiconductor manufacturing, automotive manufacturing and textile manufacturing. In JSP a workpiece consists of successive processes, each of which needs to be assigned to the machine specified by the process to optimize one or more objectives such as maximum finishing time and maximum flow time.

Currently, the current solutions to this type of combinatorial optimization problem of NP-hard can be broadly divided into two categories, precise and approximate. Accurate methods, such as mathematical programming, search for optimal solutions in the entire solution space, solving the large-scale scheduling problem in a reasonable time, are challenging. Due to the complexity of such problems, more and more approximation methods have been developed to address real-world problem instances, including heuristics, meta-heuristics. In general, the approximation method can achieve a good balance between solving time and solving quality. Meta-heuristic methods can solve the problem in a reasonable time compared to the exact mathematical programming method, but these methods are not feasible in a real-time scheduling environment, because they may still suffer from unpredictable very long computation times to obtain a satisfactory solution when the underlying algorithm requires a large number of iterations. The preferential scheduling rules (Priority Dispatching Rule, PDR) are widely used in real-time scheduling systems as a representative of heuristics, generally have lower computational complexity, are easier to implement than mathematical planning and meta-heuristics, whereas efficient scheduling rules generally require domain expertise and continuous attempts, and do not guarantee local optimality.

Deep reinforcement learning (Deep Reinforcement Learning, DRL) may be an ideal technique for solving such problems. Because of the strong generalization performance of the neural network, after the model is trained by a large amount of data, a satisfactory solution can be deduced in a very short time for an unseen sample during deployment. In addition, because of the NP-hard nature of most combination optimization problems, the optimal solution to the problem on a slightly large scale is not available, so that the problem optimal solution is not feasible to be used as a label training model through a common supervision learning mode, however, the solutions are easy to be good and bad, and the method has the characteristic of processing time when a scheduling process sequence is determined, which shows that the method is suitable for formal design by using a Markov decision process (Markov Decision Process, MDP) and solving by using a DRL algorithm.

In existing RL solution JSP, patent CN202210255402.0 needs to initialize one network for each problem instance of different scale, which obviously is disadvantageous for the further application of the method to realistic scenarios; patent CN202210406935.4, although using a state representation in a graphic form and using a graph neural network (Graph Neural Network, GNN) for feature extraction, has a certain gap between the solution found by the agent and the solution found by the planner.

Disclosure of Invention

The application aims at: aiming at the defects of the prior art, a simulation learning technology is provided to simulate a planner to decide and improve the performance of an agent on the maximum finishing processing time index on the scheduling problem of a job shop. Considering that the planner can get the optimal solution in a reasonable time on a small scale problem instance, using these build expert trajectory sets, an anti-imitation learning (Generative Adversarial Imitation Learning, GAIL) algorithm is generated to automatically learn the reward function, training the agent with the PP0 (Proximal Policy Optimization) algorithm. The state of the environment is represented by graphic data, and the characteristic extraction is carried out by utilizing a graphic neural network so that the working procedure simultaneously comprises information of two aspects of a workpiece and a machine. Because of the local sharing of network parameters, the intelligent agent has independence of problem scale, and can directly test on larger-scale examples after training on scheduling problem examples of small-scale job shops.

The technical scheme of the application is as follows:

the application discloses a method for solving job shop scheduling problem based on generation countermeasure imitation study, which is characterized by comprising the following steps: s1, acquiring a processing machine and required processing time of each procedure in a production workshop to form a training data set; s2, performing Markov decision process formalized design on the environment, and respectively performing state space design, action space design and state transfer function design on the data set environment; s3, designing a neural network structure, including designing a strategy network, a value network and a reward network; s4, constructing an expert trajectory set: the method comprises the steps of solving a problem instance by using a planner, guiding action selection of an intelligent agent in interaction of a solving result, and obtaining a plurality of state action tracks, namely an expert track set, after the action selection is interacted with an environment; s5, training of an agent: the GAIL algorithm updates the reward network, the rewards in the track are completed through the reward network, and then the intelligent agent network is updated through the PPO algorithm by utilizing the complete track; s6, deployment of an agent: the processing machine of each procedure in the production workshop and the data formed by the required processing time are acquired to initialize the environment, the processing machine interacts with the intelligent body trained in the previous step to obtain the starting processing time of all procedures, and then the processing of the workpieces in the workshop is generated in rows to achieve the aim of maximally completing the processing time as small as possible.

Further, the specific method for designing the state space in S2 is as follows:

using a directed acyclic graph to represent the state of a job shop scheduling problem, wherein g= (V, E) represents a directed acyclic graph, the node set V comprises all the processes, and the process lists among the workpieces are connected by directed edges according to the sequence to form an edge set E;

for any one process node O _ij E V all have node characteristics There are two types of dynamic characteristics->And-> Indicating whether the current process has finished processing, indicating that processing has been completed when 1, and +.>Is the lower limit of the finishing time of the working procedure, when O _ij The processing is completed>Equal to x _ij +p _ij Otherwise->Equal to->

Further, the specific method for designing the action space in S2 is as follows: all the working procedures are processed according to the sequence of the working sequence table in the workpieces, the first unprocessed working procedure of all the workpieces is selected in each step in the scheduling process, and the indication vector of all the unfeasible working procedures is indicated.

Further, the specific method for designing the state transfer function in S2 is as follows: performing edge set updating and node characteristic updating of the graph, and when one process is accepted, adding directed edges to the previous process on the same machine, and connecting the processLet 1, recalculate +.>And->

Further, the specific steps of constructing the expert trajectory set in S4 are as follows:

s41, solving an optimal solution for the problem example by using a planner to obtain the starting processing time of all the working procedures;

s42, according to the solution of the planner and the problem instance, establishing DAG according to two constraints on the workpiece and the machine, and performing topological sorting on the graph to obtain an action track;

s43, interacting with the environment according to the action of each step in the action track to obtain the state of the step,

obtaining a state action track.

Further, the agent network in S5 is divided into three modules according to the PPO algorithm, namely a Feature-Exact module, an Actor-MLP module and a Critic-MLP module.

Further, the Feature-Exact module uses a graph neural network to perform Feature extraction of the state, and the kth updating formula of the node representation is as follows:

wherein the method comprises the steps ofFor any one of the processes O _ij Performing such an update twice results in a node representation +.>And then average all node representations to obtain a graph representation h _G The calculation formula is as followsThe following is shown:

further, the Actor-MLP module needs to input an action, and for a certain process, the obtained node representation and the graph representation are spliced and then input into the multi-layer perceptron to obtain the logti for selecting the process, wherein the calculation formula is as follows:

and according to the currently infeasible procedure indication vector in the state, assigning the corresponding logti to minus infinity, and then initializing the complete logti into a polynomial distribution, and sampling from the distribution to obtain the procedure to be selected.

Further, the Critic-MLP needs to output the state value V, h _G As the input of the multi-layer perceptron, the scalar value of the output is taken as the state value, and the calculation formula is as follows:

V＝MLP(h _G )。

further, the calculation formula of the output prize r of the receiving state s is as follows:

r＝MLP(s)。

further, the process of training the agent in S5 includes the following steps:

s51, initializing parameters of an agent network and a reward network, initializing an environment through a problem instance, and transmitting the environment into an expert track set;

s52, the intelligent agent interacts with the environment to obtain a plurality of strip-shaped action tracks, namely a generated track;

s53, the rewarding network complets rewards of each step in the generated track;

s54, updating the intelligent agent network by using the generated track with rewards by a PPO algorithm;

s55, sampling expert tracks of the same scale from the expert track set, and complementing rewards of each step by using a rewarding network;

s56, updating the reward network by using the generated track and the expert track by using the GAIL algorithm;

and S57, if the maximum iteration number is reached, the method exits, otherwise, the method returns to the step S52.

Compared with the prior art, the application has the beneficial effects that:

1. aiming at the job shop scheduling problem, the application provides a simulation learning technology for simulating a planner to decide to improve the performance of an agent on the job shop scheduling problem, a model is trained by using an expert track set obtained by a smaller-scale job shop scheduling problem example, the agent is designed into the scale independence of the job shop scheduling problem, the agent can be directly tested on a larger-scale job shop scheduling problem after being trained on the smaller-scale job shop scheduling problem, a proper simulation learning algorithm is selected to train the agent, and proper adjustment is carried out according to the job shop scheduling problem.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing expert trajectories according to the present application;

FIG. 2 is a schematic flow chart of a method for training an agent according to the present application;

FIG. 3 is a problem example dataset of the application 3*3;

FIG. 4 shows the results of experimental data of the present application.

Detailed Description

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The technical scheme of the application is further described in detail below with reference to the examples.

The application discloses a method for solving job shop scheduling problem based on generation countermeasure imitation study, which comprises the following steps:

s1, acquiring a processing machine and required processing time of each procedure in a production workshop to form a training data set; s2, performing Markov decision process formalized design on the environment, and respectively performing state space design, action space design and state transfer function design on the data set environment; s3, designing a neural network structure, including designing a strategy network, a value network and a reward network; s4, constructing an expert trajectory set: the method comprises the steps of solving a problem instance by using a planner, guiding action selection of an intelligent agent in interaction of a solving result, and obtaining a plurality of state action tracks, namely an expert track set, after the action selection is interacted with an environment; s5, training of an agent: the GAIL algorithm updates the reward network, the rewards in the track are completed through the reward network, and then the intelligent agent network is updated through the PPO algorithm by utilizing the complete track; s6, deployment of an agent: the processing machine of each procedure in the production workshop and the data formed by the required processing time are acquired to initialize the environment, the processing machine interacts with the intelligent body trained in the previous step to obtain the starting processing time of all procedures, and then the processing of the workpieces in the workshop is generated in rows to achieve the aim of maximally completing the processing time as small as possible.

JSP datasets typically contain questions of different sizes, and there may be multiple question instances of the same size at one size. A series of reference data sets for job shop scheduling problems can be found from OR Library official websites, typically considering a problem instance size as n×m, illustrating n workpieces and m machines inside, and two n×m-sized matrices PM and MM inside, PM being the process time matrix of the process, and the ith row and jth column elements inside represent the process O _ij Processing time p of (2) _ij MM is a matrix of processing machines in the process, and the ith row and jth column elements in the matrix representProcedure O _ij Processing machine m of (2) _ij . As an example of a 3×3 problem shown in fig. 3, there are 3 workpieces each consisting of 3 processes, each row represents one workpiece, each position in a row represents information of a process, the left Bian Juzhen represents information of a processing time, and the right matrix represents a processing machine.

As a technical optimization scheme of the application, the JSP scheduling environment Markov decision process is formally designed, and deep reinforcement learning (Deep Reinforcement Learning, DRL) algorithm is applied to solve.

In this embodiment, reinforcement learning (Reinforcement Learning, RL) deals with time series decision problems, and markov decision process (Markov Decision Process, MDP) is a formalized approach to time series decision problems. The problem is typically formalized and treated as a dynamic environment, while the decision maker that develops a solution during interaction with the environment is treated as an agent. The solving process of the problem is converted into a process sequence selection problem through the following formal design of the JSP scheduling environment, and the formal is more suitable for solving the problem by using the RL.

An MDP consists of five tuples (S, a, P, R, γ), each of which is called a state space, an action space, a transfer function, a reward function, a discount factor, respectively. While RL provides a powerful tool and generic framework to deal with timing decision problems, there are difficulties in the design of bonus functions to apply to practical problems. One is that there is a problem in some application fields that the reward function is difficult to define, for example, training a machine trolley to learn social obstacle avoidance behaviors in a multi-row human environment, and secondly, the sensitivity of the DRL algorithm to the reward coefficient and the reward scale may cause difficulty in designing a good reward function. Imitation learning is a method that avoids learning strategies directly from expert trajectories for bonus function design, which is relatively easy to obtain in certain scenarios despite the difficulty in bonus function design, such as driving data from experienced drivers in automatic driving tasks, but designs a bonus function that allows the agent to learn driving without going from the bottom. The use of a simulated learning method herein omits the design of the bonus function and learns a scheduling strategy directly from expert trajectories. Therefore, only the state space design, the motion space design and the state transfer function design of the environment are needed.

And (3) designing a state space: the directed acyclic graph G is used to represent the state of JSP, g= (V, E), where the node set V contains virtually all the processes, and the ordered edge connections between the process lists between each workpiece form an edge set E. For any one process node O _ij E V all have node characteristicsAmong them are three types of dynamic characteristics->And-> Indicating whether the procedure can act as an action, if 1 indicates that it can not otherwise,/or->Indicating if the process has been completed, 1 indicating that the process has been completed or else not, +.>Is the lower limit of the finishing time of the working procedure, when O _ij The processing is completed>Equal to x _ij +p _ij Wherein x is _ij Is the starting process of the processThe time is the maximum value of the finishing time of the previous working procedure of the workpiece and the finishing time of the previous working procedure of the machine, otherwise +.>Equal to->Wherein->Is the lower bound of the finishing time for the previous process on the workpiece. In the process, the graph G is stored in a matrix form, one is an |V| X3 node characteristic matrix X, each row is node characteristics of one procedure, the other is an |V| x| V| adjacent matrix Y representing the edge connection relation, and in fact, few edges in the graph are possible, and the Y further adopts a sparse matrix form to save storage space.

And (3) designing an action space: the action space is all working procedures, and in order to meet the constraint of processing according to the sequence of the working sequences in the workpieces, a feasible working procedure set can be selected as the first unprocessed working procedure of all the workpieces in each step in the scheduling process. Since the set of viable processes changes at each step in the interactive process, the environment returns a state that additionally contains information indicating all viable processes, i.e., on the node characteristics

And (3) state transfer function design: since the state is data in a graphic form, the update of the edge set and the update of the node characteristics are required to obtain a new state after the action is accepted. The working procedures of the workpiece are continuously selected in the decision making process of the intelligent agent, the sequence of the working procedures on the same machine is determined, and corresponding edge connection is added to reflect the sequence constraint, so that the information of the adjacent working procedures on the workpiece and the information of the adjacent working procedures on the machine can be learned when the node representation learning of the graph neural network is carried out. Then the action is corresponding to the working procedureLet 1, recalculate +.>And->

As a technical optimization scheme of the application, the PPO algorithm is used for updating the policy network.

In this embodiment, the PPO algorithm is a type of reinforcement learning algorithm with wide application, and has both a policy network (Actor) for directly outputting actions and a value network (Critic) for providing status value to affect the output of the policy network. Usually, the two networks share a feature extraction part, so that the purposes of reducing the network scale and accelerating the calculation are achieved. The whole intelligent network is then divided into three modules, a Feature-Exact module, an Actor-MLP module and a Critic-MLP module, according to the PPO algorithm. In the Feature-Exact module, since the JSP environment uses directed acyclic graphs to represent states, and accordingly uses graph neural networks to perform Feature extraction of states, the kth updated formula of the node representation is as follows:

wherein the method comprises the steps ofFor any one of the processes O _ij Performing such an update twice results in a node representation +.>And then average all node representations to obtain a graph representation h _G The calculation formula is as follows:

an input operation is required in the Actor-MLP module, and for a certain process, the obtained node representation and graph representation are spliced and then input into a multi-layer perceptron (Multilayer Perceptron, MLP) to obtain a logic for selecting the process, wherein the calculation formula is as follows:

the corresponding logits are assigned to minus infinity according to the infeasible procedures indicated in the current state, then the complete logits are initialized into a polynomial distribution, and the procedures to be selected, namely actions, are obtained from the distribution sampling.

Critic-MLP needs to output state value V, will h _G As an input of the multi-layer perceptron, the subscript ψ is a parameter of the network, and the output scalar value is used as a state value, and the calculation formula is as follows:

V＝MLP(h _G )。

as a technical optimization scheme of the application, the GAIL algorithm is used for updating the bonus network.

In general, the input of the GAIL winning network is state and action, considering that the job shop scheduling problem example which is not seen by the intelligent agent is used in the application scene test of the application, even the job shop scheduling problem example with larger scale, in order to make the winning network more robust and make the input of the winning network be only environment state, the design can achieve the purpose of reducing network parameters and accelerating reasoning, and the calculation formula of the reward r is as follows:

r＝MLP(s)。

as shown in fig. 1, as a technical optimization scheme of the present application, an expert trajectory set is constructed, a planner is used to solve a problem instance, then a solution result is used to guide action selection during interaction of an agent, and a plurality of state action trajectories, namely, the expert trajectory set, are obtained after interaction with an environment, and the specific steps are as follows:

s43, interacting with the environment according to the action of each step in the action track to obtain the state of the step, and obtaining the state action track.

In this embodiment, a solution of the planner needs to be obtained first. The corresponding mathematical programming form of JSP is as follows:

min C _max

s.t.x _ij ≥0, for i∈1,...,n andj∈1,...,m

x _ik ≥x _ij +p _ij , for i∈1,...,n andj∈1,...,m-1andk＝j+1

x _ij ≥x _lj +p _lj or x _lj ≥x _ij +p _ij ,fori,l∈1,...,n and i≠landj∈1,...,m

C _max ≥x _ij +p _ij ,fori∈1,...,n andj∈1,...,m

invoking the OR Tools planner tool library to write code according to the above formula can solve an optimal solution and multiple preferred solutions for a small-scale problem instance in an acceptable time. From the solution, the start time of each process corresponding to each decision variable can be known, but the decision result of the agent is the processing sequence of the process, so in order to construct the expert trajectory set, the start time of all the processes needs to be converted into the processing sequence form of the process. Certain constraints need to be met to convert the starting processing time of a process into the processing sequence of the process, one is that the processes under the workpiece must be performed in a given sequence, and the other is that the processes on the same machine must be performed in the determined starting processing time sequence. According to both forms of constraint, if one process requires first processing and the other process requires second processing, a directed edge connection is used between the two processes to form a DAG. The topological order is executed on the graph to obtain the node traversing order, namely the needed working procedure processing order, namely the action track. And interacting with the environment according to the action of each step in the action track to obtain the state of the step, and then obtaining a complete strip-shaped action track. Meanwhile, a problem example can obtain a plurality of preferred solutions, more than one node traversing sequence obtained by topological sorting may exist in the DAG, and thus, a plurality of strip-shaped action tracks are obtained aiming at a plurality of node traversing sequences under the preferred solutions of the problem example to construct an expert track set.

As shown in FIG. 2, as a technical optimization scheme of the application, the GAIL algorithm updates the bonus network by training the agent, supplements the bonus in the trajectory through the bonus network, and updates the agent network by using the complete trajectory through the PPO algorithm. The method comprises the following specific steps:

The detailed calculation flow of the PPO algorithm in S54 is as follows:

s541, calculating returns according to rewards and discount factors in the tracks;

s542, inputting the state in the track into a Critic network, and calculating to obtain a state value;

s543, calculating advantages according to the rewards and the state value in the track;

s544, calculating the loss of the Actor network according to the advantages in the track;

s545, calculating Critic network loss according to the returns in the track;

s546, calculating entropy regularization loss;

s547, the three losses are linked together by different coefficients to calculate the total loss, and then the whole agent network is updated.

In this embodiment, the generation of the countermeasure imitation learning is an imitation learning algorithm that references GAN ideas, and compared with the conventional model learning algorithm, the calculation efficiency is higher and the learned strategy is more robust.

Input in the GAIL is a pair of state actions in the trace, the generator is a policy network, and the discriminator is a reward network. The discriminator loss function is as follows:

the network parameters of the generator are updated by the PPO algorithm, which differs from the conventional RL flow in that rewards in the interaction track are not generated by the environment but by the discriminator. PPO improves sample efficiency by collecting data from multiple environments and updating the collected data multiple times simultaneously, as compared to other algorithms. The calculation formula of the strategy gradient, namely the Actor network loss function is as follows:

wherein the method comprises the steps ofIs an importance weight, θ _old Refers to the parameter θ, co, which is the cropping factor before a number of updates, ++>Is a dominance estimate obtained using a generic dominance estimate (Generalized Advantage Estimator, GAE). In multiple updates, the track distribution under the old strategy is exploited by importance sampling +.>To update the current policy parameter pi _θ And meanwhile, a clipping mechanism is used for ensuring that strategies before and after updating do not differ too much and ensuring that importance sampling plays a role. GAE summarizes the approximation of various computational advantages, and proposes a general calculation method that trades off between high variance and high bias, and its calculation formula for a certain trajectory is as follows:

where T is the length of the track, the super parameter lambda controls the degree between the deviation and the variance, V _θ Is a state cost function with a parameter θ. V (V) _θ (s) approximately the trace-back expectation under a certain state s, the trace-back G can be constructed ₀ The mean square error loss function of (2) achieves the object, namely, the Critic network loss function, and the calculation formula is as follows:

because the policy network and the value network share parameters, two weight coefficients c are needed _π And c _V To relate the objective functions of the two networks, ensuring that the gradient of the entire Actor-Critc network can be updated at one time during back propagation. Besides, an entropy regularization term related to strategies is added on the overall objective function, so that the final model performance is improved. The calculation formula of the entropy regularization loss and the total loss is as follows:

J(θ)＝c _π J ^π (θ)-c _v J ^V (θ)+c _H J ^H (θ)

wherein J ^H And c _H The track start time strategy network outputs the expected distribution entropy and the weight coefficient of the control item respectively.

As shown in FIG. 4, the same randomly generated dataset as in the paper "ZHANG C, SONG W, CAO Z, et al learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning [ C ]. Conference on Neural Information Pro-processing Systems,2020 were used for ease of experimental comparison therewith. There are seven different scales in the random dataset, 6 x 6, 10 x 10, 15 x 15, 20 x 20, 30 x 20, 50 x 20, and 100 x 20, where 6 x 6 refers to the number of work pieces and 6 machines, with 100 problem instances at each scale.

The results obtained by choosing four well-known scheduling rules, the shortest processing time (Shortest Processing Time, SPT), the maximum work piece residual (Most Work Remaining, MWKR), the minimum ratio of flow to maximum work piece residual (Minimum ratio of Flow Due Date to Most Work Remaining, FDD/MWKR), the maximum process residual (Most Operations Remaining, MOPNR), and the method results in the above-mentioned paper named RL and the inventive method results named Our.

A common optimization criterion for JSP is to maximize finishing timeMinimum, wherein C _ij ＝x _ij +p _ij Indicating procedure O _ij And x is the time of completion of _ij Is the start-up time of the process, the average maximum completion time C at each scale _max As shown in the following table, where the optimal results are indicated in bold.

Simulation conclusion: aiming at the job shop scheduling problem, a simulation learning technology is provided to simulate the planner decision to improve the performance of the agent on the job shop scheduling problem, and in the experimental result, the method of the application can be seen to have optimal performance not only on job shop scheduling problem instances with the same size, but also on job shop scheduling problem instances with larger sizes which are not seen, which clearly illustrates the superiority of the agent trained by the method of the application, and the simulation learning technology is used to simulate the effectiveness and the universality of the planner decision method.

The job shop real-time scheduling system based on the method comprises the following steps: our goal is not only to develop a solution that is suitable for use on small-scale job shop scheduling problem instances, but also to find a solution that can perform equally well on large-scale job shop scheduling problem instances. The application uses the expert track set obtained by the job shop scheduling problem example with smaller scale to train the model, designs the agent into the independence of the job shop scheduling problem scale, can directly test on the job shop scheduling problem with larger scale after training on the job shop scheduling problem with smaller scale, selects the proper imitation learning algorithm to train the agent, and properly adjusts according to the job shop scheduling problem, and uses the larger scale problem example which is not seen during training in the job shop scheduling problem scene because of testing, the environment state transfer function can be changed, so that the input of the reward network only comprises the state more proper.

The method for obtaining the minimum maximum finishing processing time utilizes the scheduling problem of the production workshop, uses the data formed by the processing machine of each procedure and the required processing time in the production workshop to initialize the environment, and interacts with the intelligent agent trained in the step S5 to obtain the starting processing time of all procedures, namely, the condition that the maximum finishing processing time reaches the minimum is satisfied.

The description of the specific embodiments is intended to be illustrative, and should not be taken as limiting the scope of the application. It should be noted that it is possible for a person skilled in the art to make several variants and modifications without departing from the technical idea of the application, which fall within the scope of protection of the application.

Claims

1. A method for solving job shop scheduling problems based on generation of countermeasures simulation learning, comprising the steps of:

s1, acquiring a processing machine and required processing time of each procedure in a production workshop to form a training data set;

s2, performing Markov decision process formalized design on the environment, and respectively performing state space design, action space design and state transfer function design on the data set environment;

s3, designing a neural network structure, including designing a strategy network, a value network and a reward network;

s4, constructing an expert trajectory set: the method comprises the steps of solving a problem instance by using a planner, guiding action selection of an intelligent agent in interaction of a solving result, and obtaining a plurality of state action tracks, namely an expert track set, after the action selection is interacted with an environment;

s5, training of an agent: the GAIL algorithm updates the reward network, the rewards in the track are completed through the reward network, and then the intelligent agent network is updated through the PPO algorithm by utilizing the complete track;

s6, deployment of an agent: and (3) acquiring data formed by processing machines and required processing time of each process in a production workshop, initializing an environment, interacting with the intelligent agent trained in the step (S5) to obtain the starting processing time of all the processes, and then generating the processing time of workshop workpieces according to rows to achieve the maximum finishing processing time as small as possible.

2. The method for solving job shop scheduling problems based on generation of countermeasure simulation study according to claim 1, wherein the specific method of state space design in S2 is:

3. The method for solving job shop scheduling problems based on generation of countermeasure simulation study according to claim 1, wherein the specific method of action space design in S2 is: all the working procedures are processed according to the sequence of the working sequence table in the workpieces, the first unprocessed working procedure of all the workpieces is selected in each step in the scheduling process, and the indication vector of all the unfeasible working procedures is indicated.

4. The method for job shop scheduling problem solving based on generation of countermeasure imitation learning according to claim 1, wherein the state transition in S2The specific method for function design is as follows: performing edge set updating and node characteristic updating of the graph, and when one process is accepted, adding directed edges to the previous process on the same machine, and connecting the processLet 1, recalculate +.>And->

5. The method for solving job shop scheduling problems based on generation of countermeasure simulation study according to claim 1, wherein the specific steps of constructing the expert trajectory set in S4 are:

6. The method for solving job shop scheduling problems based on generation of resist imitation learning according to claim 1, wherein the agent network in S5 is divided into three modules according to the PPO algorithm, namely a Feature-Exact module, an Actor-MLP module and a Critic-MLP module.

7. The method for solving job shop scheduling problems based on generation-resistant simulation learning according to claim 6, wherein the Feature-Exact module uses a graph neural network to perform Feature extraction of states, and the kth updated formula of the node representation is as follows:

8. the method for solving job shop scheduling problem based on generation of resist imitation learning according to claim 6, wherein the Actor-MLP module needs to input actions, and for a certain process, the obtained node representation and graph representation are spliced and then input into a multi-layer perceptron to obtain logti for selecting the process, and the calculation formula is as follows:

9. The method for job shop scheduling problem solving based on generation of countermeasure learning according to claim 6, wherein the critical-MLP needsOutput state value V, will h _G As the input of the multi-layer perceptron, the scalar value of the output is taken as the state value, and the calculation formula is as follows:

V＝MLP(h _G )。

10. the method for job shop scheduling problem solving based on generation of countermeasure simulation study according to claim 1, wherein the process of training the agent in S5 includes the steps of: