CN111985672A

CN111985672A - Single-piece job shop scheduling method for multi-Agent deep reinforcement learning

Info

Publication number: CN111985672A
Application number: CN202010380488.0A
Authority: CN
Inventors: 张洁; 赵树煊; 汪俊亮; 贺俊杰
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-11-24
Anticipated expiration: 2040-05-08
Also published as: CN111985672B

Abstract

The invention provides a single job shop scheduling method based on multi-Agent deep reinforcement learning, aiming at the characteristics that the single job shop scheduling problem is complex in constraint and various in solution space, and the traditional mathematical programming algorithm and meta heuristic algorithm cannot meet the requirement for fast solving of the large-scale job shop scheduling problem. Firstly, designing a communication mechanism among multiple agents, and performing reinforcement learning modeling on a scheduling problem of a single job shop by adopting a multi-Agent method; secondly, a deep neural network is constructed to extract the workshop state, and an operation workshop action selection mechanism is designed on the basis, so that the interaction between a workshop processing workpiece and a workshop environment is realized; thirdly, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by using a policy gradient algorithm to obtain a more excellent scheduling result; and finally, performing performance evaluation and verification on the performance of the algorithm by using a standard data set. The method and the system can solve the problem of job shop scheduling and enrich the method system of the job shop scheduling problem.

Description

Single-piece job shop scheduling method for multi-Agent deep reinforcement learning

Technical Field

The invention relates to the field of workshop scheduling, and the researched problem is the most common single-piece job workshop scheduling problem in production.

Background

The manufacturing industry is the pillar industry of China, the production links of modern manufacturing enterprises are many, the cooperation relation is complex, and reasonable production scheduling has important significance for improving the production efficiency of the enterprises, reducing the cost and shortening the production cycle. The job-shop scheduling problem (JSP) is the most common job-shop scheduling problem, and reflects the mapping relationship between the allocation of manufacturing tasks and resources under the constraints of workshop materials, processes and the like.

The JSP problem is complex to solve and is a typical NP-Hard problem. At present, common methods for solving the JSP problem include an optimization method and a meta-heuristic algorithm. The optimization method carries out modeling solution on the job shop scheduling problem through a mathematical programming method, and can be described by an integer programming method, a mixed integer programming method and a dynamic programming method respectively according to different job shop scheduling problems. The meta-heuristic algorithm can obtain a near-optimal solution of the problem in a short time through continuous iterative optimization, and can be divided into a local search algorithm, a tabu search algorithm, a simulated annealing algorithm, a genetic algorithm, a particle swarm search algorithm, an artificial neural network algorithm and the like according to different optimization strategies.

In recent years, the rise of deep reinforcement learning provides a new idea for solving the JSP problem. In 2017, Irwan uses deep reinforcement learning to train a model with decision-making capability, and successfully solves the TSP problem of the generic NP-Hard problem. The Hanjun Dai uses a deep reinforcement learning method to solve the problem of combination optimization. In 2019, Xiao et al used a deep reinforcement learning method to solve the scheduling problem of flow shop. At present, the research on JSP problem of deep reinforcement learning is still lacked in China.

Disclosure of Invention

The purpose of the invention is: the distributed reinforcement learning modeling of the single job shop scheduling problem, the scheduling decision based on the neural network and the scheduling decision optimization based on the policygadient algorithm are realized.

In order to achieve the aim, the technical scheme of the invention is to provide a single job shop scheduling method for multi-Agent deep reinforcement learning, which is characterized by comprising the following steps of:

step 1, carrying out distributed modeling on a job shop scheduling environment by adopting a multi-Agent method, and factorizing a global state S into m Ag in the scheduling process of a multi-Agent reinforcement learning job shop_iLocal state S of_iSequentially inputting into the multi-Agent reinforcement learning system and outputting Ag_iCurrently performed action a_iChange global state S, obtain reward R, repeat the process until all Ag_iCompleting the processing task, wherein, Ag_iCorresponding to the ith machine tool, i is 1, m, m is the total number of the machine tools, S_iIs Ag_iS ═ S₁，L,S_i,L,S_m}， A_iIs Ag_iA set of local actions;

step 2, constructing a neural network model and extracting a workshop state;

inputting the global state S into a neural network model, outputting the probability P of each workpiece to be processed by the neural network model, and adopting a probability function P (a, S) oriented to the scheduling process of the job shop when the probability is output by the neural network model_i|θⁱ) Indicating a status S in the work shop_iProbability P, theta of executing action aⁱRepresents the state S_iThe weight corresponding to each action is used to make the probability of selecting the workpieces which are not processed currently and the workpieces which are processed completely zero, and the following steps are provided:

in the formula (I), the compound is shown in the specification,

represents the state S_iThe weight corresponding to the lower action a is calculated,

state S_iWeight corresponding to the lower action x, x ∈ S_iDenotes x as state S_iAll possible actions to perform;

and 3, selecting a processing workpiece according to the workshop state extracted by the neural model:

when action selection is carried out according to the probability P, the action selection mechanism is designed by combining the action selection of the maximum probability alpha max (P) with the action selection of the action alpha random (P) according to probability distribution and adding uncertainty into the current optimal decision, the action selection mechanism has artificially set hyper-parameter c and a natural number d generated immediately, d belongs to (0,1), when d is greater than the hyper-parameter c, the workpiece with the maximum probability is selected for processing, and when d is less than the hyper-parameter c, the workpiece is selected for processing according to the probability distribution, namely:

step 4, designing a multi-Agent interaction mechanism of the job shop to realize interaction between workshop processing workpieces and workshop environment:

when Ag is present_iIn-process step O_a,b，a∈A_iThen the process O is completed_a,bThen, Ag_iLocal action set A of_iIs changed into A_i:＝A_iA, and Ag_i′(i′＝γ(O_a,b+1) Is expanded as A)_i′:＝A_i+ a, defining the action transfer function σ_i：

Wherein a represents a processing step O_a,bCorresponding work, b denotes a working process O_a,bCorresponding machine tool, gamma (o)_a,b) To representWorking Process O_a,bCorresponding processing time, k represents all machine tools in the scheduling problem of the job shop;

and 5, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by updating the weight parameter of the neural network by using a policy gradient algorithm.

Preferably, in step 1, the reward R is denoted as R (S, a, S ') indicating that performing action a in state S results in a reward value obtained in state S'.

Preferably, in step 1, the local state S_iFrom Ag_iSet of local actions A corresponding to a workpiece waiting to be machined on a machine tool_iAnd the corresponding processing time (A)_i) Denotes, i.e. S_i＝A_i∪(A_i)。

Preferably, in step 2, the neural network model is composed of an input layer, a hidden layer, and an output layer, where:

an input layer: will the job shop state S_iConversion to vector mode output S_iTo the first hidden layer h₁Inputting layer to hidden layer h₁The tanhx activation function is adopted in the method,

W₁and b₁Respectively representing a first hidden layer h₁The weights and offsets of (c) are then:

h₁＝tanh(W₁S_i+b₁)

hiding the layer: the number of nodes of the hidden layer is set to be 20, and an activation function is not used from the hidden layer to the output layer, and the method comprises the following steps:

h_N＝tanh(W_Nh_N-1+b_N)

in the formula, h_NDenotes the Nth hidden layer, W_NAnd b_NRespectively represent the Nth hidden layer h_NWeight and bias of;

an output layer: each node theta of the output layer_aThe number of output layer nodes is set to n, corresponding to the probability that the action a of Ag is selected.

Preferably, in step 5, the policy gradient algorithm updates the policy according to the reward function J (θ) after the policy is completed, and includes:

the function J (θ) indicates that the final state S is reached when T steps are over^fThe obtained weighted reward, the weighting factor gamma^tDepending on the time step and the discount factor gamma, G_tRepresents the weighted reward found for T steps,

representing the weighted average of the awards, aiming at the characteristic of awarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the target function min (C) of the JSP problem is completed when the scheduling process is completed_max) Assigning the prize value to-C_maxAnd when γ in the formula is 1, then:

relating the reward function J (theta) to the action probability parameter theta_aDifferentiating to obtain a function gradient g_aThe method comprises the following steps:

in the formula (I), the compound is shown in the specification,

representing the probability θ for action a_aCalculating a deviation derivative;

determining a gradient g of the function_aThen to Ag_iMovement ofProbability parameter theta_aThe updating is carried out by the following steps:

θ_a:＝θ_a+μ_Ng_a

wherein mu_NE, R represents the updating rate, and N represents the updating times;

for the probability parameter theta_aAnd after updating is completed, calling an Adadelta optimizer by using a back propagation principle to update parameters of the neural network weight W, and completing the updating of the whole strategy.

The invention provides a single job shop scheduling method based on multi-Agent deep reinforcement learning, aiming at the characteristics that the single job shop scheduling problem is complex in constraint and various in solution space, and the traditional mathematical programming algorithm and meta heuristic algorithm cannot meet the requirement for fast solving of the large-scale job shop scheduling problem. Firstly, designing a communication mechanism among multiple agents, and performing reinforcement learning modeling on a scheduling problem of a job shop by adopting a multiple Agent method; secondly, a deep neural network is constructed to extract the workshop state, and an operation workshop action selection mechanism is designed on the basis, so that the interaction between a workshop processing workpiece and a workshop environment is realized; thirdly, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by using a policy gradient algorithm to obtain a more excellent scheduling result; and finally, performing performance evaluation and verification on the performance of the algorithm by using a standard data set. The method and the system can solve the problem of scheduling of a single job shop and enrich the method system of the problem of scheduling of the job shop.

Drawings

FIG. 1 is a multi-Agent reinforcement learning model;

FIG. 2 is a deep neural network model structure;

FIG. 3 is a Policy Gradient flow chart;

FIG. 4 is a graph of FT06 objective function;

FIG. 5 is an optimal solution to the FT06 problem;

FIG. 6 is a flow chart of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The invention provides a single job shop scheduling method for multi-Agent deep reinforcement learning, which comprises the following steps:

step 1, carrying out distributed modeling on a single job shop scheduling environment by adopting a multi-Agent method.

As shown in fig. 1, the multi-Agent reinforcement learning model of the present invention includes the following contents:

Ag＝{Ag₁,L,Ag_i,L,Ag_min which Ag is_iFor the ith machine tool, i 1.

S＝{S₁，L,S_i,L,S_mIs all Ag_iLocal state S of_iThe global state of the composition.

A＝{A₁,L,A_i,L,A_mIs all Ag_iLocal action set A of_iConstructed global action set, a_iRepresents Ag_iThe action performed at the current time.

P is the probability transition matrix in state S, and the transition function P (S '| S, a) represents the probability that executing action a in state S results in state S'.

R is the reward function and R (S, a, S ') represents the reward value obtained by performing action a in state S resulting in state S'.

Gamma is a discount factor, and gamma belongs to [0,1 ].

Factorizing global state S into m Ag in the multi-Agent reinforcement learning job shop scheduling process_iLocal state S of_iSequentially inputting into the multi-Agent reinforcement learning system and outputting Ag_iCurrently performed action a_iChange global state S, obtain reward R, repeat the process until all Ag_iAnd finishing the processing task. Wherein the local state S_iFrom Ag_iSet of local actions A corresponding to a workpiece waiting to be machined on a machine tool_iAnd the corresponding processing time (A)_i) Denotes, i.e. S_i＝A_i∪(A_i)。

And 2, constructing a neural network model and extracting the workshop state.

In this embodiment, the neural network model is composed of an input layer, a hidden layer, and an output layer.

An input layer: the number of nodes of the input layer is set to 10, and the state S of the job shop is set_iConversion to vector mode output S_iTo the first hidden layer h₁Inputting layer to hidden layer h₁The tanhx activation function is adopted in the method,

h₁＝tanh(W₁S_i+b₁)

h_N＝tanh(W_Nh_N-1+b_N)

in the formula, h_NDenotes the Nth hidden layer, W_NAnd b_NRespectively represent the Nth hidden layer h_NWeight and bias of

state S_iWeight corresponding to the lower action x, x ∈ S_iDenotes x as state S_iAll possible actions to be performed.

In this embodiment, the global state S is input into the neural network model, the neural network model outputs the probability P of each workpiece being processed, a tanhx activation function is used between the input layer and the hidden layer,

when the neural network model outputs the probability, the finished workpieces and the workpieces which can not be machined currently are prevented from being selected in the scheduling process of the job shop, and the sum of the probabilities is ensured

The invention designs a probability function P ═ f (a, S) for the scheduling process of a job shop based on a Softmax function_i|θⁱ) Indicating a status S in the work shop_iProbability P, theta of executing action aⁱRepresents the state S_iAnd the weight corresponding to each action is used for enabling the probability of selecting the workpieces which are not machined currently and the workpieces which are machined completely to be zero.

In the formula (I), the compound is shown in the specification,

And 3, selecting the machined workpiece according to the workshop state extracted by the neural model.

When selecting an action according to the probability P, in order to ensure that the scheduling strategy can converge and has the capability of jumping out of a local optimal solution, the action selecting mechanism is designed by combining the action selecting with the maximum probability alpha max (P) and the action selecting with the probability distribution alpha random (P), and adding uncertainty into the current optimal decision. The action selection mechanism has a manually set hyper-parameter c and a randomly generated natural number d, d belongs to (0,1), when d is greater than the hyper-parameter c, a workpiece with the maximum probability is selected for processing, and when d is less than the hyper-parameter c, the processed workpiece is selected according to probability distribution, namely:

and 4, designing a multi-Agent interaction mechanism of the job shop to realize interaction between the workshop processing workpiece and the workshop environment.

State S in job shop scheduling_iSet of follow-up local actions A_iChange, whereby the local state S is visible_iSet of follow-up local actions A_iThe invention therefore establishes a communication mechanism between agents by defining an action transfer function. When Ag is present_iIn-process step O_a,b，a∈A_iThen the process O is completed_a,bThen, Ag_iLocal action set A of_iIs changed into A_i:＝A_iA, and Ag_i′(i′＝γ(O_a,b+1) Is expanded as A)_i′:＝A_i+ a. Defining an action transfer function sigma by this invention_i：

Wherein a represents a processing step O_a,bCorresponding work, b denotes a working process O_a,bCorresponding machine tool, gamma (o)_a,b) Shows a working Process O_a,bCorresponding processing time, k represents all machine tools in the scheduling problem of the job shop;

and 5, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by updating the weight parameter of the neural network by using a policy gradient algorithm. As shown in fig. 3, the core of policygidi algorithm is to update the policy according to the reward function J (θ) after the policy is completed, which includes:

indicating a weighted average of their rewards. Aiming at the characteristic of rewarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the JSP problem target function min (C) is used when the scheduling process is finished_max) Assigning the prize value to-C_maxAnd γ ═ 1, then:

the criterion of fastest expected revenue reduction is followed when the strategy is updated, so that a return function J (theta) is added to an action probability parameter theta_aDifferentiating to obtain a function gradient g_aThe method comprises the following steps:

in the formula (I), the compound is shown in the specification,

representing the probability θ for action a_aCalculating the partial derivative to obtain the function gradient g_aThen to Ag_iMotion probability parameter θ_aThe updating is carried out by the following steps:

θ_a:＝θ_a+μ_Ng_a

wherein mu_NE R represents the update rate and N represents the number of updates.

Claims

1. A single job shop scheduling method for multi-Agent deep reinforcement learning is characterized by comprising the following steps:

step 1, performing distributed modeling on a job shop scheduling environment by adopting a multi-Agent method;

factorizing global state S into m Ag in the multi-Agent reinforcement learning job shop scheduling process_iLocal state S of_iSequentially inputting into the multi-Agent reinforcement learning system and outputting Ag_iCurrently performed action a_iChange global state S, obtain reward R, repeat the process until all Ag_iCompleting the processing task, wherein, Ag_iCorresponding to the ith machine tool, i is 1, m, m is the total number of the machine tools, S_iIs Ag_iS ═ S₁，…,S_i,…,S_m}，A_iIs the local action set of Ag …;

step 2, constructing a neural network model and extracting a workshop state;

in the formula (I), the compound is shown in the specification,

2. The single job shop scheduling method of multi-Agent deep reinforcement learning according to claim 1, wherein in step 1, the reward R is represented as R (S, a, S ') indicating that the reward value obtained in state S' is obtained by executing action a in state S.

3. The single-piece job shop scheduling method for multi-Agent deep reinforcement learning according to claim 1, wherein in step 1, the local state S is_iFrom Ag_iSet of local actions A corresponding to a workpiece waiting to be machined on a machine tool_iAnd the corresponding processing time (A)_i) Denotes, i.e. S_i＝A_i∪(A_i)。

4. The single-piece job shop scheduling method of multi-Agent deep reinforcement learning according to claim 1, wherein in step 2, the neural network model is composed of an input layer, a hidden layer and an output layer, wherein:

W₁and b₁Respectively representing a first hidden layer h₁The weight and the threshold value of (c) are:

h₁＝tanh(W₁S_i+b₁)

h_N＝tanh(W_Nh_N-1+b_N)

in the formula, h_NDenotes the Nth hidden layer, W_NAnd b_NRespectively represent the Nth hidden layer h_NThe weight and threshold of;

5. The single job shop scheduling method for multi-Agent deep reinforcement learning according to claim 4, wherein in step 5, the policygadient algorithm updates the policy according to the return function J (θ) after the policy is completed, and comprises:

the function J (θ) indicates that the final state S is reached when T steps are over^fThe obtained weighted reward, the weighting factor gamma t depending on the time step and the discount factor gamma, G_tRepresents the weighted reward found for T steps,

indicating a weighted average of their rewards. Aiming at the characteristic of rewarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the JSP problem target function min (C) is used when the scheduling process is finished_max) Assigning the prize value to-C_maxAnd when γ is 1, then:

in the formula (I), the compound is shown in the specification,

θ_a:＝θ_a+μ_Ng_a