CN113988443A - Automatic wharf cooperative scheduling method based on deep reinforcement learning - Google Patents

Automatic wharf cooperative scheduling method based on deep reinforcement learning Download PDF

Info

Publication number
CN113988443A
CN113988443A CN202111299059.1A CN202111299059A CN113988443A CN 113988443 A CN113988443 A CN 113988443A CN 202111299059 A CN202111299059 A CN 202111299059A CN 113988443 A CN113988443 A CN 113988443A
Authority
CN
China
Prior art keywords
time
network
stage
bridge
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111299059.1A
Other languages
Chinese (zh)
Other versions
CN113988443B (en
Inventor
张煜
尹星
田宏伟
郑倩倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202111299059.1A priority Critical patent/CN113988443B/en
Publication of CN113988443A publication Critical patent/CN113988443A/en
Application granted granted Critical
Publication of CN113988443B publication Critical patent/CN113988443B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06313Resource planning in a project environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06316Sequencing of tasks or work
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Mathematical Physics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic wharf cooperative scheduling method based on deep reinforcement learning, aiming at the field of automatic port scheduling, wherein the traditional precise algorithm and approximate algorithm cannot meet the characteristics that a large-scale scheduling problem cannot be solved quickly and a dynamic production environment is difficult to adapt. The invention takes each device in the container loading and unloading process as an intelligent agent, thereby avoiding the situation that the dispatching plan needs to be re-planned when the machine fails and ensuring that the dispatching is more flexible.

Description

Automatic wharf cooperative scheduling method based on deep reinforcement learning
Technical Field
The invention relates to the field of port scheduling, in particular to an automatic wharf collaborative scheduling method based on deep reinforcement learning.
Background
With the rapid development of port trade in recent years, the automatic container terminal gradually becomes an important transportation hub, a shore bridge, an ART and a field bridge are cooperatively used for completing a container loading and unloading task, and the loading and unloading efficiency of the automatic container terminal is mainly influenced by the joint scheduling of the 3 devices, so that the automatic container terminal has important significance for the optimization research of the coordinated scheduling of the 3 devices.
The operation process of the container among the shore bridge, the ART and the field bridge can be regarded as a mixed flow shop scheduling problem, the flow shop scheduling problem is a typical NP-Hard problem, and the traditional solving method aiming at the flow shop scheduling problem at present has an accurate algorithm and an approximate algorithm, wherein the approximate algorithm is divided into a heuristic algorithm and a meta-heuristic algorithm. The traditional solution method has many limitations: the precise algorithm is only suitable for solving small-scale problems, and the practicability is poor; although the traditional heuristic and meta-heuristic methods can solve the approximately optimal solution of the problem in a short time, the generated scheduling result is directed at a static production environment and cannot be well applicable to a real dynamic production environment, such as emergency situations such as machine faults.
In recent years, with the rise of machine learning and neural networks, a new idea is brought to solve the scheduling problem of the hybrid flow shop. At present, the most widely used reinforcement Learning algorithm aiming at the workshop scheduling problem is Q-Learning, but the Q-Learning algorithm is a table type method, a large amount of space is needed for storing a value function, and the hidden danger of dimension explosion exists for the large-scale workshop scheduling problem.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an automatic wharf cooperative scheduling method based on deep reinforcement learning aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides an automatic wharf cooperative scheduling method based on deep reinforcement learning, which comprises the following steps:
step 1, modeling the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage hybrid flow shop scheduling model with irrelevant parallel machines;
step 2, modeling the mixed flow shop scheduling model into a multi-agent Markov decision process;
step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parameters
Figure BDA0003337564390000021
With operator network parameters
Figure BDA0003337564390000022
Initializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current network
Figure BDA0003337564390000023
And
Figure BDA0003337564390000024
updating parameters of critical target network and actor target network by soft method
Figure BDA0003337564390000025
And
Figure BDA0003337564390000026
until reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time CmaxAnd generating an optimal scheduling plan.
Further, the hybrid flow shop scheduling model in step 1 of the present invention specifically is:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART; the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
Further, the markov decision process in step 2 of the present invention specifically includes:
the intelligent agent: each device is taken as an intelligent agent, the devices comprise a shore bridge, an ART and a field bridge, and m devices correspond to m intelligent agents;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second category corresponds to an ART transportation phase and comprises 5 heuristic priority rules: firstly, a first machining principle is adopted, a workpiece with the shortest machining time is preferentially selected, a workpiece with the shortest residual machining time is preferentially selected, a workpiece with the smallest ratio of the process time to the total machining time is preferentially selected, and a workpiece with the shortest residual machining time except the current process is preferentially selected;
the reward function: the scheduling objective is to minimize the maximum completion time, since the production cycle is the sum of the processing times, and therefore, will beThe time reward r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; the long distance reward is set as
Figure BDA0003337564390000031
Wherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
Further, the three-stage hybrid flow shop scheduling model with unrelated parallel machines in step 1 of the present invention is specifically:
step 11, simplifying the actual scene, and making the following assumptions on the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, including the moving time of a shore bridge/a field bridge, ART turning around and avoidance time, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment fault condition and the box turnover time are not considered;
step 12, the mathematical model is as follows:
Figure BDA0003337564390000041
the constraint s.t. is:
Figure BDA0003337564390000042
Figure BDA0003337564390000043
Figure BDA0003337564390000044
Figure BDA0003337564390000045
Figure BDA0003337564390000046
Figure BDA0003337564390000047
Figure BDA0003337564390000048
Figure BDA0003337564390000049
Figure BDA00033375643900000410
Figure BDA00033375643900000411
Figure BDA00033375643900000412
Figure BDA00033375643900000413
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) shows that in stages 2 and 3, the beta j startsThe time served; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)
Figure BDA0003337564390000051
sikAnd eikThe value intervals of the three decision variables;
the model parameters were as follows:
i/j represents the serial number of the shellfish/container, k represents the serial number of the stage, MkIndicating the serial number of the device at stage k, mkRepresenting the total number of devices in phase k, n representing the total number of containers, niRepresenting the number of containers in each bay, fmkIndicating the earliest time at which the device can start operating in phase k, pikRepresents the operating time of the berm i in the k stage,
Figure BDA0003337564390000052
representing the preparation time of the device m in two adjacent phases, aiIndicating the first container number in the decibel, biIndicating the last container number in the shellfish, sikRepresents the start time of operation i in stage k, eikIndicating the time the job i was completed in stage k,
Figure BDA0003337564390000053
indicating that container i/j is 1 at stage k when operated by device m, otherwise 0, ΩkRepresents the set of shellfish numbers contained in phase k, phi represents the set of precedence relationships that shellfish are served, and N represents a positive number greater than a certain threshold.
Further, the Johnson rule of the present invention specifically is:
the task with long operation time of the bridge and short operation time of the shore bridge is preferably selected, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
Further, the criticc network of the present invention adopts a CNN network architecture, which includes an input layer, a convolutional layer, and a full connection layer, wherein:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and the size of a convolution kernel is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
Further, the structure of the operator network of the present invention adopts a CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.
Further, the specific formula of the epsilon-greedy strategy in the step 4 is as follows:
Figure BDA0003337564390000061
wherein the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly and randomly selected in all actions with a fixed probability epsilon.
Further, in the criticic network adopting the CNN neural network architecture of the present invention:
the parameter update rule of the neural network CNN is based on a loss function, which is the following formula:
Figure BDA0003337564390000062
wherein, thetacAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is a target value;
in addition, the loss function for an actor network is:
Laa)=-∑logπ(a∣s,θa)Lcc)
wherein, thetaaFor the parameters of the operator network, L is the smaller the probability of taking some action a, but the greater the return obtainedaa) The value of (c) is increased.
Further, the specific method in step 4 of the present invention is:
in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adopted, the operator network is divided into an operator current network and an operator target network, and the critic network is divided into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:
Figure BDA0003337564390000071
the loss function of the state current network is adjusted to:
Figure BDA0003337564390000072
the parameters of the operator target network and the critic target network are fixed for a period of time and are updated according to the parameters of the current network after n steps are set according to the following formula:
Figure BDA0003337564390000073
Figure BDA0003337564390000074
wherein τ is an update coefficient, the value of τ is smaller than a certain threshold, and the update mode called soft update is called as target network.
The invention has the following beneficial effects: the invention relates to an automatic wharf cooperative scheduling method based on deep reinforcement learning, which models the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage mixed flow shop scheduling problem with irrelevant parallel machines, and sets each mixed flow shop scheduling problemThe system is regarded as an intelligent agent, so that the condition that a scheduling plan needs to be re-planned when equipment fails can be avoided, and the scheduling is more flexible; the method uses a function approximation neural network algorithm, solves the problem of insufficient storage space in a multi-agent environment by using an empirical playback technology, and can effectively eliminate an adjacent state stAnd st+1The correlation between the intelligent agents improves the updating efficiency of the intelligent agents and reduces the iteration times.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a flowchart of a job shop scheduling method based on deep reinforcement learning according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a neural network according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a critical current network, an actor current network, a critical target network, and an actor target network according to an embodiment of the invention.
FIG. 4 is a schematic diagram of a parameter updating method of the critical current network, the actual current network, the critical target network and the actual target network according to the embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 shows a flowchart of an automated wharf co-scheduling method based on deep reinforcement learning, which specifically includes the following steps:
step 1, as shown in fig. 2, modeling the cooperative operation process of the automatic container wharf shore bridge, the ART and the field bridge into a three-stage hybrid flow shop scheduling model with unrelated parallel machines:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART (Intelligent Robot of transportation); the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
Step 2, modeling the scheduling problem of the mixed flow shop into a multi-agent Markov decision process:
the intelligent agent: each device (a shore bridge, an ART and a field bridge) is used as an agent, and m devices correspond to m agents in total;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second corresponding ART transportation stage is composed of 5 heuristic priority rules, namely FIFO (first come first process principle), SPT (preferentially selecting the workpiece with the shortest processing time), LWKR (preferentially selecting the workpiece with the shortest remaining processing time), SPT/TWK (preferentially selecting the workpiece with the smallest ratio of process time to total processing time) and SRM (preferentially selecting the workpiece with the shortest remaining processing time except the current process);
the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; setting a long-distance reward to
Figure BDA0003337564390000091
Wherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
Step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parameters
Figure BDA0003337564390000092
With operator network parameters
Figure BDA0003337564390000093
Initializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current network
Figure BDA0003337564390000094
And
Figure BDA0003337564390000095
updating parameters of critical target network and actor target network by soft method
Figure BDA0003337564390000096
And
Figure BDA0003337564390000097
until reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time C by the environmentmax
The three-stage mixed flow shop scheduling mathematical model with the irrelevant parallel machine is concretely as follows:
through the simplification of the actual scene, the following assumptions are made for the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, such as the moving time of a shore bridge/a field bridge, ART turning around and avoiding time and the like, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment failure condition is not considered, and the box turnover time is not considered.
The mathematical model is as follows:
Figure BDA0003337564390000101
the constraint s.t. is:
Figure BDA0003337564390000102
Figure BDA0003337564390000103
Figure BDA0003337564390000104
Figure BDA0003337564390000105
Figure BDA0003337564390000106
Figure BDA0003337564390000107
Figure BDA0003337564390000108
Figure BDA0003337564390000109
Figure BDA00033375643900001010
Figure BDA00033375643900001011
Figure BDA00033375643900001012
Figure BDA00033375643900001013
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) represents the time at which the bite j starts to be serviced in phase 2 and phase 3; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)
Figure BDA0003337564390000111
sikAnd eikThe value intervals of the three decision variables;
the model parameters are as follows:
Figure BDA0003337564390000112
the Johnson rule aims to preferentially select tasks with long operation time of a bridge and short operation time of a shore bridge, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
The critic network has a structure of a CNN network architecture, and the network architecture is shown in fig. 3, and is composed of an input layer, a convolutional layer, and a full connection layer, where:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and kernel _ size is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
The structure of the operator network is also a typical CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.
The selection of the agent action adopts an epsilon-greedy strategy, which is expressed by a public expression as:
Figure BDA0003337564390000121
the strategy means that the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly randomly selected among all actions with a fixed probability epsilon.
The parameter update rule of the neural network CNN is based on a loss function. The loss function of the critic network is given by:
Figure BDA0003337564390000122
wherein, thetaCAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is the target value.
In addition, the loss function for an actor network is:
Laa)=-∑logπ(a∣s,θa)Lcc)
wherein, thetaaFor the parameters of an actor network, L is the probability of taking some action a is smaller, but the return obtained is largeraa) The value of (c) is increased.
In order to improve the stability of the algorithm, a method of reducing the correlation between the current Q value and the target Q value is adopted, as shown in fig. 4, an operator network is divided into an action current network and an action target network, and a critic network is divided into a state current network and a state target network. Wherein, the action current network loss function is adjusted as:
Figure BDA0003337564390000131
the loss function of the state current network is adjusted to:
Figure BDA0003337564390000132
the parameters of the action target network and the state target network are fixed for a period of time and are not changed, and after n steps are set, the parameters are updated according to the parameters of the current network according to the following formula:
Figure BDA0003337564390000133
Figure BDA0003337564390000134
wherein τ is an update coefficient, and generally takes a smaller value.
The target network refers to this parameter update method as soft update.
Finally, it should be noted that the above-mentioned embodiments are only intended to illustrate and explain the present invention, and are not intended to limit the present invention within the scope of the described embodiments.
Furthermore, it will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that many variations and modifications may be made in accordance with the teachings of the present invention, all of which fall within the scope of the invention as claimed.

Claims (10)

1. An automatic wharf cooperative scheduling method based on deep reinforcement learning is characterized by comprising the following steps:
step 1, modeling the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage hybrid flow shop scheduling model with irrelevant parallel machines;
step 2, modeling the mixed flow shop scheduling model into a multi-agent Markov decision process;
step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parameters
Figure FDA0003337564380000011
With operator network parameters
Figure FDA0003337564380000012
Initializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current network
Figure FDA0003337564380000013
And
Figure FDA0003337564380000014
updating parameters of critical target network and actor target network by soft method
Figure FDA0003337564380000015
And
Figure FDA0003337564380000016
until reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time CmaxAnd generating an optimal scheduling plan.
2. The automatic wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the hybrid flow shop scheduling model in the step 1 is specifically:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART; the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
3. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the markov decision process in the step 2 is specifically:
the intelligent agent: each device is taken as an intelligent agent, the devices comprise a shore bridge, an ART and a field bridge, and m devices correspond to m intelligent agents;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second category corresponds to an ART transportation phase and comprises 5 heuristic priority rules: firstly, a first machining principle is adopted, a workpiece with the shortest machining time is preferentially selected, a workpiece with the shortest residual machining time is preferentially selected, a workpiece with the smallest ratio of the process time to the total machining time is preferentially selected, and a workpiece with the shortest residual machining time except the current process is preferentially selected;
the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; the long distance reward is set as
Figure FDA0003337564380000021
Wherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
4. The automated wharf co-scheduling method based on deep reinforcement learning according to claim 1, wherein the three-stage hybrid flow shop scheduling model with uncorrelated parallel machines in step 1 is specifically:
step 11, simplifying the actual scene, and making the following assumptions on the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, including the moving time of a shore bridge/a field bridge, ART turning around and avoidance time, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment fault condition and the box turnover time are not considered;
step 12, the mathematical model is as follows:
Figure FDA0003337564380000031
the constraint s.t. is:
Figure FDA0003337564380000032
Figure FDA0003337564380000033
Figure FDA0003337564380000034
Figure FDA0003337564380000035
Figure FDA0003337564380000036
Figure FDA0003337564380000037
Figure FDA0003337564380000038
Figure FDA0003337564380000039
Figure FDA00033375643800000310
Figure FDA00033375643800000311
Figure FDA00033375643800000312
Figure FDA00033375643800000313
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) represents the time at which the bite j starts to be serviced in phase 2 and phase 3; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)
Figure FDA0003337564380000041
sikAnd eikThe value intervals of the three decision variables;
the model parameters were as follows:
i/j represents the serial number of the shellfish/container, k represents the serial number of the stage, MkIndicating the serial number of the device at stage k, mkRepresenting the total number of devices in phase k, n representing the total number of containers, niRepresenting the number of containers in each bay, fmkIndicating the earliest time at which the device can start operating in phase k, pikRepresents the operating time of the berm i in the k stage,
Figure FDA0003337564380000042
representing the preparation time of the device m in two adjacent phases, aiIndicating the first container number in the decibel, biIndicating the last container number in the shellfish, sikRepresents the start time of operation i in stage k, eikIndicating the time the job i was completed in stage k,
Figure FDA0003337564380000043
indicating that container i/j is 1 at stage k when operated by device m, otherwise 0, ΩkRepresents the set of shellfish numbers contained in phase k, phi represents the set of precedence relationships that shellfish are served, and N represents a positive number greater than a certain threshold.
5. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 3, wherein the Johnson rule is specifically as follows:
the task with long operation time of the bridge and short operation time of the shore bridge is preferably selected, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
6. The automated wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the critic network adopts a CNN network architecture and comprises an input layer, a convolutional layer and a full connection layer, wherein:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and the size of a convolution kernel is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
7. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein the structure of the operator network adopts a CNN network architecture, and the structure is the same as that of a critic network, except that the output is a specific action.
8. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the epsilon-greedy strategy in the step 4 has a specific formula as follows:
Figure FDA0003337564380000051
wherein the action with the largest current Q value is selected with a probability of 1-, and is uniformly and randomly selected in all actions with a fixed probability epsilon.
9. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein in the criticic network adopting the CNN neural network architecture:
the parameter update rule of the neural network CNN is based on a loss function, which is the following formula:
Figure FDA0003337564380000052
wherein, thetacAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is a target value;
in addition, the loss function for an actor network is:
Laa)=-∑logπ(a∣s,θa)Lcc)
wherein, thetaaFor the parameters of the operator network, L is the smaller the probability of taking some action a, but the greater the return obtainedaa) The value of (c) is increased.
10. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 8, wherein the specific method in the step 4 is as follows:
in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adoptedThe method comprises the following steps of dividing an actor network into an actor current network and an actor target network, and dividing a critic network into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:
Figure FDA0003337564380000061
the loss function of the state current network is adjusted to:
Figure FDA0003337564380000062
the parameters of the operator target network and the critic target network are fixed for a period of time and are updated according to the parameters of the current network after n steps are set according to the following formula:
Figure FDA0003337564380000063
Figure FDA0003337564380000064
wherein τ is an update coefficient, the value of τ is smaller than a certain threshold, and the update mode called soft update is called as target network.
CN202111299059.1A 2021-11-04 2021-11-04 Automatic wharf collaborative scheduling method based on deep reinforcement learning Active CN113988443B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111299059.1A CN113988443B (en) 2021-11-04 2021-11-04 Automatic wharf collaborative scheduling method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111299059.1A CN113988443B (en) 2021-11-04 2021-11-04 Automatic wharf collaborative scheduling method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN113988443A true CN113988443A (en) 2022-01-28
CN113988443B CN113988443B (en) 2024-06-28

Family

ID=79746379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111299059.1A Active CN113988443B (en) 2021-11-04 2021-11-04 Automatic wharf collaborative scheduling method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN113988443B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115542849A (en) * 2022-08-22 2022-12-30 苏州诀智科技有限公司 Container wharf intelligent ship control and distribution method, system, storage medium and computer

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740979A (en) * 2016-01-29 2016-07-06 上海海事大学 Intelligent dispatching system and method for multi-AGV (Automatic Guided Vehicle) of automatic container terminal
CN112434870A (en) * 2020-12-01 2021-03-02 大连理工大学 Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740979A (en) * 2016-01-29 2016-07-06 上海海事大学 Intelligent dispatching system and method for multi-AGV (Automatic Guided Vehicle) of automatic container terminal
CN112434870A (en) * 2020-12-01 2021-03-02 大连理工大学 Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115542849A (en) * 2022-08-22 2022-12-30 苏州诀智科技有限公司 Container wharf intelligent ship control and distribution method, system, storage medium and computer
CN115542849B (en) * 2022-08-22 2023-12-05 苏州诀智科技有限公司 Container terminal intelligent ship control and dispatch method, system, storage medium and computer

Also Published As

Publication number Publication date
CN113988443B (en) 2024-06-28

Similar Documents

Publication Publication Date Title
CN112884239B (en) Space detonator production scheduling method based on deep reinforcement learning
Kim et al. A look-ahead dispatching method for automated guided vehicles in automated port container terminals
TWI663568B (en) Material scheduling method and system based on real-time status of semiconductor device
CN111882215B (en) Personalized customization flexible job shop scheduling method containing AGV
CN111160755B (en) Real-time scheduling method for aircraft overhaul workshop based on DQN
CN114611897B (en) Intelligent production line self-adaptive dynamic scheduling strategy selection method
CN112434870B (en) Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas
CN112836974B (en) Dynamic scheduling method for multiple field bridges between boxes based on DQN and MCTS
CN115454005A (en) Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene
CN113988443A (en) Automatic wharf cooperative scheduling method based on deep reinforcement learning
CN110554673B (en) Intelligent RGV processing system scheduling method and device
CN116500986A (en) Method and system for generating priority scheduling rule of distributed job shop
CN115793657B (en) Distribution robot path planning method based on temporal logic control strategy
CN111353646A (en) Steel-making flexible scheduling optimization method with switching time, system, medium and equipment
CN115689049A (en) Multi-target workshop scheduling method for improving gray wolf optimization algorithm
CN117314055A (en) Intelligent manufacturing workshop production-transportation joint scheduling method based on reinforcement learning
CN117331700B (en) Computing power network resource scheduling system and method
CN117891220A (en) Distributed mixed flow shop scheduling method based on multi-agent deep reinforcement learning
CN117196261B (en) Task instruction distribution method based on field bridge operation range
Kouvakas et al. A modular supervisory control scheme for the safety of an automated manufacturing system
CN113050644A (en) AGV (automatic guided vehicle) scheduling method based on iterative greedy evolution
CN112395690A (en) Reinforced learning-based shipboard aircraft surface guarantee flow optimization method
CN112053046B (en) Automatic container terminal AGV reentry and reentry path planning method with time window
Panov et al. Automatic formation of the structure of abstract machines in hierarchical reinforcement learning with state clustering
Liao et al. Learning to schedule job-shop problems via hierarchical reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant