CN111985672A - Single-piece job shop scheduling method for multi-Agent deep reinforcement learning - Google Patents

Single-piece job shop scheduling method for multi-Agent deep reinforcement learning Download PDF

Info

Publication number
CN111985672A
CN111985672A CN202010380488.0A CN202010380488A CN111985672A CN 111985672 A CN111985672 A CN 111985672A CN 202010380488 A CN202010380488 A CN 202010380488A CN 111985672 A CN111985672 A CN 111985672A
Authority
CN
China
Prior art keywords
action
job shop
probability
scheduling
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010380488.0A
Other languages
Chinese (zh)
Other versions
CN111985672B (en
Inventor
张洁
赵树煊
汪俊亮
贺俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN202010380488.0A priority Critical patent/CN111985672B/en
Publication of CN111985672A publication Critical patent/CN111985672A/en
Application granted granted Critical
Publication of CN111985672B publication Critical patent/CN111985672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Educational Administration (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Manufacturing & Machinery (AREA)
  • Primary Health Care (AREA)
  • General Factory Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Multi-Process Working Machines And Systems (AREA)

Abstract

The invention provides a single job shop scheduling method based on multi-Agent deep reinforcement learning, aiming at the characteristics that the single job shop scheduling problem is complex in constraint and various in solution space, and the traditional mathematical programming algorithm and meta heuristic algorithm cannot meet the requirement for fast solving of the large-scale job shop scheduling problem. Firstly, designing a communication mechanism among multiple agents, and performing reinforcement learning modeling on a scheduling problem of a single job shop by adopting a multi-Agent method; secondly, a deep neural network is constructed to extract the workshop state, and an operation workshop action selection mechanism is designed on the basis, so that the interaction between a workshop processing workpiece and a workshop environment is realized; thirdly, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by using a policy gradient algorithm to obtain a more excellent scheduling result; and finally, performing performance evaluation and verification on the performance of the algorithm by using a standard data set. The method and the system can solve the problem of job shop scheduling and enrich the method system of the job shop scheduling problem.

Description

Single-piece job shop scheduling method for multi-Agent deep reinforcement learning
Technical Field
The invention relates to the field of workshop scheduling, and the researched problem is the most common single-piece job workshop scheduling problem in production.
Background
The manufacturing industry is the pillar industry of China, the production links of modern manufacturing enterprises are many, the cooperation relation is complex, and reasonable production scheduling has important significance for improving the production efficiency of the enterprises, reducing the cost and shortening the production cycle. The job-shop scheduling problem (JSP) is the most common job-shop scheduling problem, and reflects the mapping relationship between the allocation of manufacturing tasks and resources under the constraints of workshop materials, processes and the like.
The JSP problem is complex to solve and is a typical NP-Hard problem. At present, common methods for solving the JSP problem include an optimization method and a meta-heuristic algorithm. The optimization method carries out modeling solution on the job shop scheduling problem through a mathematical programming method, and can be described by an integer programming method, a mixed integer programming method and a dynamic programming method respectively according to different job shop scheduling problems. The meta-heuristic algorithm can obtain a near-optimal solution of the problem in a short time through continuous iterative optimization, and can be divided into a local search algorithm, a tabu search algorithm, a simulated annealing algorithm, a genetic algorithm, a particle swarm search algorithm, an artificial neural network algorithm and the like according to different optimization strategies.
In recent years, the rise of deep reinforcement learning provides a new idea for solving the JSP problem. In 2017, Irwan uses deep reinforcement learning to train a model with decision-making capability, and successfully solves the TSP problem of the generic NP-Hard problem. The Hanjun Dai uses a deep reinforcement learning method to solve the problem of combination optimization. In 2019, Xiao et al used a deep reinforcement learning method to solve the scheduling problem of flow shop. At present, the research on JSP problem of deep reinforcement learning is still lacked in China.
Disclosure of Invention
The purpose of the invention is: the distributed reinforcement learning modeling of the single job shop scheduling problem, the scheduling decision based on the neural network and the scheduling decision optimization based on the policygadient algorithm are realized.
In order to achieve the aim, the technical scheme of the invention is to provide a single job shop scheduling method for multi-Agent deep reinforcement learning, which is characterized by comprising the following steps of:
step 1, carrying out distributed modeling on a job shop scheduling environment by adopting a multi-Agent method, and factorizing a global state S into m Ag in the scheduling process of a multi-Agent reinforcement learning job shopiLocal state S ofiSequentially inputting into the multi-Agent reinforcement learning system and outputting AgiCurrently performed action aiChange global state S, obtain reward R, repeat the process until all AgiCompleting the processing task, wherein, AgiCorresponding to the ith machine tool, i is 1, m, m is the total number of the machine tools, SiIs AgiS ═ S1,L,Si,L,Sm}, AiIs AgiA set of local actions;
step 2, constructing a neural network model and extracting a workshop state;
inputting the global state S into a neural network model, outputting the probability P of each workpiece to be processed by the neural network model, and adopting a probability function P (a, S) oriented to the scheduling process of the job shop when the probability is output by the neural network modelii) Indicating a status S in the work shopiProbability P, theta of executing action aiRepresents the state SiThe weight corresponding to each action is used to make the probability of selecting the workpieces which are not processed currently and the workpieces which are processed completely zero, and the following steps are provided:
Figure RE-GDA0002727509250000021
in the formula (I), the compound is shown in the specification,
Figure RE-GDA0002727509250000022
represents the state SiThe weight corresponding to the lower action a is calculated,
Figure RE-GDA0002727509250000023
state SiWeight corresponding to the lower action x, x ∈ SiDenotes x as state SiAll possible actions to perform;
and 3, selecting a processing workpiece according to the workshop state extracted by the neural model:
when action selection is carried out according to the probability P, the action selection mechanism is designed by combining the action selection of the maximum probability alpha max (P) with the action selection of the action alpha random (P) according to probability distribution and adding uncertainty into the current optimal decision, the action selection mechanism has artificially set hyper-parameter c and a natural number d generated immediately, d belongs to (0,1), when d is greater than the hyper-parameter c, the workpiece with the maximum probability is selected for processing, and when d is less than the hyper-parameter c, the workpiece is selected for processing according to the probability distribution, namely:
Figure RE-GDA0002727509250000024
step 4, designing a multi-Agent interaction mechanism of the job shop to realize interaction between workshop processing workpieces and workshop environment:
when Ag is presentiIn-process step Oa,b,a∈AiThen the process O is completeda,bThen, AgiLocal action set A ofiIs changed into Ai:=AiA, and Agi′(i′=γ(Oa,b+1) Is expanded as A)i′:=Ai+ a, defining the action transfer function σi
Figure RE-GDA0002727509250000031
Wherein a represents a processing step Oa,bCorresponding work, b denotes a working process Oa,bCorresponding machine tool, gamma (o)a,b) To representWorking Process Oa,bCorresponding processing time, k represents all machine tools in the scheduling problem of the job shop;
and 5, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by updating the weight parameter of the neural network by using a policy gradient algorithm.
Preferably, in step 1, the reward R is denoted as R (S, a, S ') indicating that performing action a in state S results in a reward value obtained in state S'.
Preferably, in step 1, the local state SiFrom AgiSet of local actions A corresponding to a workpiece waiting to be machined on a machine tooliAnd the corresponding processing time (A)i) Denotes, i.e. Si=Ai∪(Ai)。
Preferably, in step 2, the neural network model is composed of an input layer, a hidden layer, and an output layer, where:
an input layer: will the job shop state SiConversion to vector mode output SiTo the first hidden layer h1Inputting layer to hidden layer h1The tanhx activation function is adopted in the method,
Figure RE-GDA0002727509250000032
W1and b1Respectively representing a first hidden layer h1The weights and offsets of (c) are then:
h1=tanh(W1Si+b1)
hiding the layer: the number of nodes of the hidden layer is set to be 20, and an activation function is not used from the hidden layer to the output layer, and the method comprises the following steps:
hN=tanh(WNhN-1+bN)
in the formula, hNDenotes the Nth hidden layer, WNAnd bNRespectively represent the Nth hidden layer hNWeight and bias of;
an output layer: each node theta of the output layeraThe number of output layer nodes is set to n, corresponding to the probability that the action a of Ag is selected.
Preferably, in step 5, the policy gradient algorithm updates the policy according to the reward function J (θ) after the policy is completed, and includes:
Figure RE-GDA0002727509250000041
the function J (θ) indicates that the final state S is reached when T steps are overfThe obtained weighted reward, the weighting factor gammatDepending on the time step and the discount factor gamma, GtRepresents the weighted reward found for T steps,
Figure RE-GDA0002727509250000042
representing the weighted average of the awards, aiming at the characteristic of awarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the target function min (C) of the JSP problem is completed when the scheduling process is completedmax) Assigning the prize value to-CmaxAnd when γ in the formula is 1, then:
Figure RE-GDA0002727509250000043
Figure RE-GDA0002727509250000044
relating the reward function J (theta) to the action probability parameter thetaaDifferentiating to obtain a function gradient gaThe method comprises the following steps:
Figure RE-GDA0002727509250000045
in the formula (I), the compound is shown in the specification,
Figure RE-GDA0002727509250000046
representing the probability θ for action aaCalculating a deviation derivative;
determining a gradient g of the functionaThen to AgiMovement ofProbability parameter thetaaThe updating is carried out by the following steps:
θa:=θaNga
wherein muNE, R represents the updating rate, and N represents the updating times;
for the probability parameter thetaaAnd after updating is completed, calling an Adadelta optimizer by using a back propagation principle to update parameters of the neural network weight W, and completing the updating of the whole strategy.
The invention provides a single job shop scheduling method based on multi-Agent deep reinforcement learning, aiming at the characteristics that the single job shop scheduling problem is complex in constraint and various in solution space, and the traditional mathematical programming algorithm and meta heuristic algorithm cannot meet the requirement for fast solving of the large-scale job shop scheduling problem. Firstly, designing a communication mechanism among multiple agents, and performing reinforcement learning modeling on a scheduling problem of a job shop by adopting a multiple Agent method; secondly, a deep neural network is constructed to extract the workshop state, and an operation workshop action selection mechanism is designed on the basis, so that the interaction between a workshop processing workpiece and a workshop environment is realized; thirdly, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by using a policy gradient algorithm to obtain a more excellent scheduling result; and finally, performing performance evaluation and verification on the performance of the algorithm by using a standard data set. The method and the system can solve the problem of scheduling of a single job shop and enrich the method system of the problem of scheduling of the job shop.
Drawings
FIG. 1 is a multi-Agent reinforcement learning model;
FIG. 2 is a deep neural network model structure;
FIG. 3 is a Policy Gradient flow chart;
FIG. 4 is a graph of FT06 objective function;
FIG. 5 is an optimal solution to the FT06 problem;
FIG. 6 is a flow chart of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention provides a single job shop scheduling method for multi-Agent deep reinforcement learning, which comprises the following steps:
step 1, carrying out distributed modeling on a single job shop scheduling environment by adopting a multi-Agent method.
As shown in fig. 1, the multi-Agent reinforcement learning model of the present invention includes the following contents:
Ag={Ag1,L,Agi,L,Agmin which Ag isiFor the ith machine tool, i 1.
S={S1,L,Si,L,SmIs all AgiLocal state S ofiThe global state of the composition.
A={A1,L,Ai,L,AmIs all AgiLocal action set A ofiConstructed global action set, aiRepresents AgiThe action performed at the current time.
P is the probability transition matrix in state S, and the transition function P (S '| S, a) represents the probability that executing action a in state S results in state S'.
R is the reward function and R (S, a, S ') represents the reward value obtained by performing action a in state S resulting in state S'.
Gamma is a discount factor, and gamma belongs to [0,1 ].
Factorizing global state S into m Ag in the multi-Agent reinforcement learning job shop scheduling processiLocal state S ofiSequentially inputting into the multi-Agent reinforcement learning system and outputting AgiCurrently performed action aiChange global state S, obtain reward R, repeat the process until all AgiAnd finishing the processing task. Wherein the local state SiFrom AgiSet of local actions A corresponding to a workpiece waiting to be machined on a machine tooliAnd the corresponding processing time (A)i) Denotes, i.e. Si=Ai∪(Ai)。
And 2, constructing a neural network model and extracting the workshop state.
In this embodiment, the neural network model is composed of an input layer, a hidden layer, and an output layer.
An input layer: the number of nodes of the input layer is set to 10, and the state S of the job shop is setiConversion to vector mode output SiTo the first hidden layer h1Inputting layer to hidden layer h1The tanhx activation function is adopted in the method,
Figure RE-GDA0002727509250000061
W1and b1Respectively representing a first hidden layer h1The weights and offsets of (c) are then:
h1=tanh(W1Si+b1)
hiding the layer: the number of nodes of the hidden layer is set to be 20, and an activation function is not used from the hidden layer to the output layer, and the method comprises the following steps:
hN=tanh(WNhN-1+bN)
in the formula, hNDenotes the Nth hidden layer, WNAnd bNRespectively represent the Nth hidden layer hNWeight and bias of
An output layer: each node theta of the output layeraThe number of output layer nodes is set to n, corresponding to the probability that the action a of Ag is selected.
Figure RE-GDA0002727509250000071
Represents the state SiThe weight corresponding to the lower action a is calculated,
Figure RE-GDA0002727509250000072
state SiWeight corresponding to the lower action x, x ∈ SiDenotes x as state SiAll possible actions to be performed.
In this embodiment, the global state S is input into the neural network model, the neural network model outputs the probability P of each workpiece being processed, a tanhx activation function is used between the input layer and the hidden layer,
Figure RE-GDA0002727509250000073
when the neural network model outputs the probability, the finished workpieces and the workpieces which can not be machined currently are prevented from being selected in the scheduling process of the job shop, and the sum of the probabilities is ensured
Figure RE-GDA0002727509250000074
The invention designs a probability function P ═ f (a, S) for the scheduling process of a job shop based on a Softmax functionii) Indicating a status S in the work shopiProbability P, theta of executing action aiRepresents the state SiAnd the weight corresponding to each action is used for enabling the probability of selecting the workpieces which are not machined currently and the workpieces which are machined completely to be zero.
Figure RE-GDA0002727509250000075
In the formula (I), the compound is shown in the specification,
Figure RE-GDA0002727509250000076
represents the state SiThe weight corresponding to the lower action a is calculated,
Figure RE-GDA0002727509250000077
state SiWeight corresponding to the lower action x, x ∈ SiDenotes x as state SiAll possible actions to be performed.
And 3, selecting the machined workpiece according to the workshop state extracted by the neural model.
When selecting an action according to the probability P, in order to ensure that the scheduling strategy can converge and has the capability of jumping out of a local optimal solution, the action selecting mechanism is designed by combining the action selecting with the maximum probability alpha max (P) and the action selecting with the probability distribution alpha random (P), and adding uncertainty into the current optimal decision. The action selection mechanism has a manually set hyper-parameter c and a randomly generated natural number d, d belongs to (0,1), when d is greater than the hyper-parameter c, a workpiece with the maximum probability is selected for processing, and when d is less than the hyper-parameter c, the processed workpiece is selected according to probability distribution, namely:
Figure RE-GDA0002727509250000078
and 4, designing a multi-Agent interaction mechanism of the job shop to realize interaction between the workshop processing workpiece and the workshop environment.
State S in job shop schedulingiSet of follow-up local actions AiChange, whereby the local state S is visibleiSet of follow-up local actions AiThe invention therefore establishes a communication mechanism between agents by defining an action transfer function. When Ag is presentiIn-process step Oa,b,a∈AiThen the process O is completeda,bThen, AgiLocal action set A ofiIs changed into Ai:=AiA, and Agi′(i′=γ(Oa,b+1) Is expanded as A)i′:=Ai+ a. Defining an action transfer function sigma by this inventioni
Figure RE-GDA0002727509250000081
Wherein a represents a processing step Oa,bCorresponding work, b denotes a working process Oa,bCorresponding machine tool, gamma (o)a,b) Shows a working Process Oa,bCorresponding processing time, k represents all machine tools in the scheduling problem of the job shop;
and 5, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by updating the weight parameter of the neural network by using a policy gradient algorithm. As shown in fig. 3, the core of policygidi algorithm is to update the policy according to the reward function J (θ) after the policy is completed, which includes:
Figure RE-GDA0002727509250000082
the function J (θ) indicates that the final state S is reached when T steps are overfThe obtained weighted reward, the weighting factor gammatDepending on the time step and the discount factor gamma, GtRepresents the weighted reward found for T steps,
Figure RE-GDA0002727509250000083
indicating a weighted average of their rewards. Aiming at the characteristic of rewarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the JSP problem target function min (C) is used when the scheduling process is finishedmax) Assigning the prize value to-CmaxAnd γ ═ 1, then:
Figure RE-GDA0002727509250000084
Figure RE-GDA0002727509250000091
the criterion of fastest expected revenue reduction is followed when the strategy is updated, so that a return function J (theta) is added to an action probability parameter thetaaDifferentiating to obtain a function gradient gaThe method comprises the following steps:
Figure RE-GDA0002727509250000092
in the formula (I), the compound is shown in the specification,
Figure RE-GDA0002727509250000093
representing the probability θ for action aaCalculating the partial derivative to obtain the function gradient gaThen to AgiMotion probability parameter θaThe updating is carried out by the following steps:
θa:=θaNga
wherein muNE R represents the update rate and N represents the number of updates.
For the probability parameter thetaaAnd after updating is completed, calling an Adadelta optimizer by using a back propagation principle to update parameters of the neural network weight W, and completing the updating of the whole strategy.

Claims (5)

1. A single job shop scheduling method for multi-Agent deep reinforcement learning is characterized by comprising the following steps:
step 1, performing distributed modeling on a job shop scheduling environment by adopting a multi-Agent method;
factorizing global state S into m Ag in the multi-Agent reinforcement learning job shop scheduling processiLocal state S ofiSequentially inputting into the multi-Agent reinforcement learning system and outputting AgiCurrently performed action aiChange global state S, obtain reward R, repeat the process until all AgiCompleting the processing task, wherein, AgiCorresponding to the ith machine tool, i is 1, m, m is the total number of the machine tools, SiIs AgiS ═ S1,…,Si,…,Sm},AiIs the local action set of Ag …;
step 2, constructing a neural network model and extracting a workshop state;
inputting the global state S into a neural network model, outputting the probability P of each workpiece to be processed by the neural network model, and adopting a probability function P (a, S) oriented to the scheduling process of the job shop when the probability is output by the neural network modelii) Indicating a status S in the work shopiProbability P, theta of executing action aiRepresents the state SiThe weight corresponding to each action is used to make the probability of selecting the workpieces which are not processed currently and the workpieces which are processed completely zero, and the following steps are provided:
Figure FDA0002481862930000011
in the formula (I), the compound is shown in the specification,
Figure FDA0002481862930000012
represents the state SiThe weight corresponding to the lower action a is calculated,
Figure FDA0002481862930000013
state SiWeight corresponding to the lower action x, x ∈ SiDenotes x as state SiAll possible actions to perform;
and 3, selecting a processing workpiece according to the workshop state extracted by the neural model:
when action selection is carried out according to the probability P, the action selection mechanism is designed by combining the action selection of the maximum probability alpha max (P) with the action selection of the action alpha random (P) according to probability distribution and adding uncertainty into the current optimal decision, the action selection mechanism has artificially set hyper-parameter c and a natural number d generated immediately, d belongs to (0,1), when d is greater than the hyper-parameter c, the workpiece with the maximum probability is selected for processing, and when d is less than the hyper-parameter c, the workpiece is selected for processing according to the probability distribution, namely:
Figure FDA0002481862930000021
step 4, designing a multi-Agent interaction mechanism of the job shop to realize interaction between workshop processing workpieces and workshop environment:
when Ag is presentiIn-process step Oa,b,a∈AiThen the process O is completeda,bThen, AgiLocal action set A ofiIs changed into Ai:=AiA, and Agi′(i′=γ(Oa,b+1) Is expanded as A)i′:=Ai+ a, defining the action transfer function σi
Figure FDA0002481862930000022
Wherein a represents a processing step Oa,bCorresponding work, b denotes a working process Oa,bCorresponding machine tool, gamma (o)a,b) Shows a working Process Oa,bCorresponding processing time, k represents all machine tools in the scheduling problem of the job shop;
and 5, designing a reward function to evaluate the whole scheduling decision, and updating the scheduling decision by updating the weight parameter of the neural network by using a policy gradient algorithm.
2. The single job shop scheduling method of multi-Agent deep reinforcement learning according to claim 1, wherein in step 1, the reward R is represented as R (S, a, S ') indicating that the reward value obtained in state S' is obtained by executing action a in state S.
3. The single-piece job shop scheduling method for multi-Agent deep reinforcement learning according to claim 1, wherein in step 1, the local state S isiFrom AgiSet of local actions A corresponding to a workpiece waiting to be machined on a machine tooliAnd the corresponding processing time (A)i) Denotes, i.e. Si=Ai∪(Ai)。
4. The single-piece job shop scheduling method of multi-Agent deep reinforcement learning according to claim 1, wherein in step 2, the neural network model is composed of an input layer, a hidden layer and an output layer, wherein:
an input layer: will the job shop state SiConversion to vector mode output SiTo the first hidden layer h1Inputting layer to hidden layer h1The tanhx activation function is adopted in the method,
Figure FDA0002481862930000031
W1and b1Respectively representing a first hidden layer h1The weight and the threshold value of (c) are:
h1=tanh(W1Si+b1)
hiding the layer: the number of nodes of the hidden layer is set to be 20, and an activation function is not used from the hidden layer to the output layer, and the method comprises the following steps:
hN=tanh(WNhN-1+bN)
in the formula, hNDenotes the Nth hidden layer, WNAnd bNRespectively represent the Nth hidden layer hNThe weight and threshold of;
an output layer: each node theta of the output layeraThe number of output layer nodes is set to n, corresponding to the probability that the action a of Ag is selected.
5. The single job shop scheduling method for multi-Agent deep reinforcement learning according to claim 4, wherein in step 5, the policygadient algorithm updates the policy according to the return function J (θ) after the policy is completed, and comprises:
Figure FDA0002481862930000032
the function J (θ) indicates that the final state S is reached when T steps are overfThe obtained weighted reward, the weighting factor gamma t depending on the time step and the discount factor gamma, GtRepresents the weighted reward found for T steps,
Figure FDA0002481862930000033
indicating a weighted average of their rewards. Aiming at the characteristic of rewarding time sequence in the JSP problem, r (t) is always 0 in the scheduling process until the JSP problem target function min (C) is used when the scheduling process is finishedmax) Assigning the prize value to-CmaxAnd when γ is 1, then:
Figure FDA0002481862930000034
Figure FDA0002481862930000035
relating the reward function J (theta) to the action probability parameter thetaaDifferentiating to obtain a function gradient gaThe method comprises the following steps:
Figure FDA0002481862930000041
in the formula (I), the compound is shown in the specification,
Figure FDA0002481862930000042
representing the probability θ for action aaCalculating the partial derivative to obtain the function gradient gaThen to AgiMotion probability parameter θaThe updating is carried out by the following steps:
θa:=θaNga
wherein muNE, R represents the updating rate, and N represents the updating times;
for the probability parameter thetaaAnd after updating is completed, calling an Adadelta optimizer by using a back propagation principle to update parameters of the neural network weight W, and completing the updating of the whole strategy.
CN202010380488.0A 2020-05-08 2020-05-08 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning Active CN111985672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010380488.0A CN111985672B (en) 2020-05-08 2020-05-08 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010380488.0A CN111985672B (en) 2020-05-08 2020-05-08 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN111985672A true CN111985672A (en) 2020-11-24
CN111985672B CN111985672B (en) 2021-08-27

Family

ID=73441772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010380488.0A Active CN111985672B (en) 2020-05-08 2020-05-08 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN111985672B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112150088A (en) * 2020-11-26 2020-12-29 深圳市万邑通信息科技有限公司 Huff-puff flexible intelligent assembly logistics path planning method and system
CN112598309A (en) * 2020-12-29 2021-04-02 浙江工业大学 Job shop scheduling method based on Keras
CN112700099A (en) * 2020-12-24 2021-04-23 亿景智联(北京)科技有限公司 Resource scheduling planning method based on reinforcement learning and operation research
CN112884239A (en) * 2021-03-12 2021-06-01 重庆大学 Aerospace detonator production scheduling method based on deep reinforcement learning
CN113093673A (en) * 2021-03-31 2021-07-09 南京大学 Method for optimizing workshop operation schedule by using mean field action value learning
CN113222253A (en) * 2021-05-13 2021-08-06 珠海埃克斯智能科技有限公司 Scheduling optimization method, device and equipment and computer readable storage medium
CN113361915A (en) * 2021-06-04 2021-09-07 聪明工厂有限公司 Flexible job shop scheduling method based on deep reinforcement learning and multi-agent graph

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571570A (en) * 2011-12-27 2012-07-11 广东电网公司电力科学研究院 Network flow load balancing control method based on reinforcement learning
CN103248693A (en) * 2013-05-03 2013-08-14 东南大学 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning
CN108282587A (en) * 2018-01-19 2018-07-13 重庆邮电大学 Mobile customer service dialogue management method under being oriented to strategy based on status tracking
CN108573303A (en) * 2018-04-25 2018-09-25 北京航空航天大学 It is a kind of that recovery policy is improved based on the complex network local failure for improving intensified learning certainly
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN110648049A (en) * 2019-08-21 2020-01-03 北京大学 Multi-agent-based resource allocation method and system
CN110691422A (en) * 2019-10-06 2020-01-14 湖北工业大学 Multi-channel intelligent access method based on deep reinforcement learning
CN110991972A (en) * 2019-12-14 2020-04-10 中国科学院深圳先进技术研究院 Cargo transportation system based on multi-agent reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571570A (en) * 2011-12-27 2012-07-11 广东电网公司电力科学研究院 Network flow load balancing control method based on reinforcement learning
CN103248693A (en) * 2013-05-03 2013-08-14 东南大学 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning
CN108282587A (en) * 2018-01-19 2018-07-13 重庆邮电大学 Mobile customer service dialogue management method under being oriented to strategy based on status tracking
CN108573303A (en) * 2018-04-25 2018-09-25 北京航空航天大学 It is a kind of that recovery policy is improved based on the complex network local failure for improving intensified learning certainly
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN110648049A (en) * 2019-08-21 2020-01-03 北京大学 Multi-agent-based resource allocation method and system
CN110691422A (en) * 2019-10-06 2020-01-14 湖北工业大学 Multi-channel intelligent access method based on deep reinforcement learning
CN110991972A (en) * 2019-12-14 2020-04-10 中国科学院深圳先进技术研究院 Cargo transportation system based on multi-agent reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吉靖: ""局部感知情形下的车间调度建模与优化"", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112150088A (en) * 2020-11-26 2020-12-29 深圳市万邑通信息科技有限公司 Huff-puff flexible intelligent assembly logistics path planning method and system
CN112700099A (en) * 2020-12-24 2021-04-23 亿景智联(北京)科技有限公司 Resource scheduling planning method based on reinforcement learning and operation research
CN112598309A (en) * 2020-12-29 2021-04-02 浙江工业大学 Job shop scheduling method based on Keras
CN112598309B (en) * 2020-12-29 2022-04-19 浙江工业大学 Job shop scheduling method based on Keras
CN112884239A (en) * 2021-03-12 2021-06-01 重庆大学 Aerospace detonator production scheduling method based on deep reinforcement learning
CN112884239B (en) * 2021-03-12 2023-12-19 重庆大学 Space detonator production scheduling method based on deep reinforcement learning
CN113093673A (en) * 2021-03-31 2021-07-09 南京大学 Method for optimizing workshop operation schedule by using mean field action value learning
CN113222253A (en) * 2021-05-13 2021-08-06 珠海埃克斯智能科技有限公司 Scheduling optimization method, device and equipment and computer readable storage medium
CN113222253B (en) * 2021-05-13 2022-09-30 珠海埃克斯智能科技有限公司 Scheduling optimization method, device, equipment and computer readable storage medium
CN113361915A (en) * 2021-06-04 2021-09-07 聪明工厂有限公司 Flexible job shop scheduling method based on deep reinforcement learning and multi-agent graph

Also Published As

Publication number Publication date
CN111985672B (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN111985672B (en) Single-piece job shop scheduling method for multi-Agent deep reinforcement learning
CN104268722B (en) Dynamic flexible job-shop scheduling method based on multi-objective Evolutionary Algorithm
CN108846570B (en) Method for solving resource-limited project scheduling problem
CN113792924A (en) Single-piece job shop scheduling method based on Deep reinforcement learning of Deep Q-network
CN113011612B (en) Production and maintenance scheduling method and system based on improved wolf algorithm
Ueda et al. An emergent synthesis approach to simultaneous process planning and scheduling
CN115454005A (en) Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene
CN115130789A (en) Distributed manufacturing intelligent scheduling method based on improved wolf optimization algorithm
CN112348314A (en) Distributed flexible workshop scheduling method and system with crane
Cao et al. An adaptive multi-strategy artificial bee colony algorithm for integrated process planning and scheduling
Xue et al. Estimation of distribution evolution memetic algorithm for the unrelated parallel-machine green scheduling problem
Bekker Applying the cross-entropy method in multi-objective optimisation of dynamic stochastic systems
CN113139747A (en) Method for reordering coating of work returning vehicle based on deep reinforcement learning
CN113406939A (en) Unrelated parallel machine dynamic hybrid flow shop scheduling method based on deep Q network
Li et al. An improved whale optimisation algorithm for distributed assembly flow shop with crane transportation
CN112488543B (en) Intelligent work site intelligent scheduling method and system based on machine learning
Yan et al. A job shop scheduling approach based on simulation optimization
Kim Permutation-based elitist genetic algorithm using serial scheme for large-sized resource-constrained project scheduling
Nugraheni et al. Hybrid Metaheuristics for Job Shop Scheduling Problems.
Han et al. Research on optimization method of routing buffer linkage based on Q-learning
CN112734286B (en) Workshop scheduling method based on multi-strategy deep reinforcement learning
Fujii et al. Integration of process planning and scheduling using multi-agent learning
CN117950379A (en) Intelligent workshop real-time rescheduling method based on deep circulation Q network
CN117215275B (en) Large-scale dynamic double-effect scheduling method for flexible workshop based on genetic programming
CN116384602A (en) Multi-target vehicle path optimization method, system, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant