CN113988443A - Automatic wharf cooperative scheduling method based on deep reinforcement learning - Google Patents
Automatic wharf cooperative scheduling method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN113988443A CN113988443A CN202111299059.1A CN202111299059A CN113988443A CN 113988443 A CN113988443 A CN 113988443A CN 202111299059 A CN202111299059 A CN 202111299059A CN 113988443 A CN113988443 A CN 113988443A
- Authority
- CN
- China
- Prior art keywords
- time
- network
- stage
- bridge
- scheduling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000002787 reinforcement Effects 0.000 title claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 21
- 238000004519 manufacturing process Methods 0.000 claims abstract description 6
- 230000009471 action Effects 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 24
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 238000003754 machining Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000002360 preparation method Methods 0.000 claims description 8
- 235000015170 shellfish Nutrition 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 6
- 238000013178 mathematical model Methods 0.000 claims description 4
- 238000003860 storage Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000007514 turning Methods 0.000 claims description 3
- 230000007306 turnover Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06313—Resource planning in a project environment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06316—Sequencing of tasks or work
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/08—Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Mathematical Physics (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Development Economics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Game Theory and Decision Science (AREA)
- General Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Algebra (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Biodiversity & Conservation Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an automatic wharf cooperative scheduling method based on deep reinforcement learning, aiming at the field of automatic port scheduling, wherein the traditional precise algorithm and approximate algorithm cannot meet the characteristics that a large-scale scheduling problem cannot be solved quickly and a dynamic production environment is difficult to adapt. The invention takes each device in the container loading and unloading process as an intelligent agent, thereby avoiding the situation that the dispatching plan needs to be re-planned when the machine fails and ensuring that the dispatching is more flexible.
Description
Technical Field
The invention relates to the field of port scheduling, in particular to an automatic wharf collaborative scheduling method based on deep reinforcement learning.
Background
With the rapid development of port trade in recent years, the automatic container terminal gradually becomes an important transportation hub, a shore bridge, an ART and a field bridge are cooperatively used for completing a container loading and unloading task, and the loading and unloading efficiency of the automatic container terminal is mainly influenced by the joint scheduling of the 3 devices, so that the automatic container terminal has important significance for the optimization research of the coordinated scheduling of the 3 devices.
The operation process of the container among the shore bridge, the ART and the field bridge can be regarded as a mixed flow shop scheduling problem, the flow shop scheduling problem is a typical NP-Hard problem, and the traditional solving method aiming at the flow shop scheduling problem at present has an accurate algorithm and an approximate algorithm, wherein the approximate algorithm is divided into a heuristic algorithm and a meta-heuristic algorithm. The traditional solution method has many limitations: the precise algorithm is only suitable for solving small-scale problems, and the practicability is poor; although the traditional heuristic and meta-heuristic methods can solve the approximately optimal solution of the problem in a short time, the generated scheduling result is directed at a static production environment and cannot be well applicable to a real dynamic production environment, such as emergency situations such as machine faults.
In recent years, with the rise of machine learning and neural networks, a new idea is brought to solve the scheduling problem of the hybrid flow shop. At present, the most widely used reinforcement Learning algorithm aiming at the workshop scheduling problem is Q-Learning, but the Q-Learning algorithm is a table type method, a large amount of space is needed for storing a value function, and the hidden danger of dimension explosion exists for the large-scale workshop scheduling problem.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an automatic wharf cooperative scheduling method based on deep reinforcement learning aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides an automatic wharf cooperative scheduling method based on deep reinforcement learning, which comprises the following steps:
step 2, modeling the mixed flow shop scheduling model into a multi-agent Markov decision process;
step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parametersWith operator network parametersInitializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1;
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current networkAndupdating parameters of critical target network and actor target network by soft methodAnduntil reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time CmaxAnd generating an optimal scheduling plan.
Further, the hybrid flow shop scheduling model in step 1 of the present invention specifically is:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART; the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
Further, the markov decision process in step 2 of the present invention specifically includes:
the intelligent agent: each device is taken as an intelligent agent, the devices comprise a shore bridge, an ART and a field bridge, and m devices correspond to m intelligent agents;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second category corresponds to an ART transportation phase and comprises 5 heuristic priority rules: firstly, a first machining principle is adopted, a workpiece with the shortest machining time is preferentially selected, a workpiece with the shortest residual machining time is preferentially selected, a workpiece with the smallest ratio of the process time to the total machining time is preferentially selected, and a workpiece with the shortest residual machining time except the current process is preferentially selected;
the reward function: the scheduling objective is to minimize the maximum completion time, since the production cycle is the sum of the processing times, and therefore, will beThe time reward r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; the long distance reward is set asWherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
Further, the three-stage hybrid flow shop scheduling model with unrelated parallel machines in step 1 of the present invention is specifically:
step 11, simplifying the actual scene, and making the following assumptions on the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, including the moving time of a shore bridge/a field bridge, ART turning around and avoidance time, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment fault condition and the box turnover time are not considered;
step 12, the mathematical model is as follows:
the constraint s.t. is:
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) shows that in stages 2 and 3, the beta j startsThe time served; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)sikAnd eikThe value intervals of the three decision variables;
the model parameters were as follows:
i/j represents the serial number of the shellfish/container, k represents the serial number of the stage, MkIndicating the serial number of the device at stage k, mkRepresenting the total number of devices in phase k, n representing the total number of containers, niRepresenting the number of containers in each bay, fmkIndicating the earliest time at which the device can start operating in phase k, pikRepresents the operating time of the berm i in the k stage,representing the preparation time of the device m in two adjacent phases, aiIndicating the first container number in the decibel, biIndicating the last container number in the shellfish, sikRepresents the start time of operation i in stage k, eikIndicating the time the job i was completed in stage k,indicating that container i/j is 1 at stage k when operated by device m, otherwise 0, ΩkRepresents the set of shellfish numbers contained in phase k, phi represents the set of precedence relationships that shellfish are served, and N represents a positive number greater than a certain threshold.
Further, the Johnson rule of the present invention specifically is:
the task with long operation time of the bridge and short operation time of the shore bridge is preferably selected, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
Further, the criticc network of the present invention adopts a CNN network architecture, which includes an input layer, a convolutional layer, and a full connection layer, wherein:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and the size of a convolution kernel is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
Further, the structure of the operator network of the present invention adopts a CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.
Further, the specific formula of the epsilon-greedy strategy in the step 4 is as follows:
wherein the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly and randomly selected in all actions with a fixed probability epsilon.
Further, in the criticic network adopting the CNN neural network architecture of the present invention:
the parameter update rule of the neural network CNN is based on a loss function, which is the following formula:
wherein, thetacAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is a target value;
in addition, the loss function for an actor network is:
La(θa)=-∑logπ(a∣s,θa)Lc(θc)
wherein, thetaaFor the parameters of the operator network, L is the smaller the probability of taking some action a, but the greater the return obtaineda(θa) The value of (c) is increased.
Further, the specific method in step 4 of the present invention is:
in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adopted, the operator network is divided into an operator current network and an operator target network, and the critic network is divided into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:the loss function of the state current network is adjusted to:the parameters of the operator target network and the critic target network are fixed for a period of time and are updated according to the parameters of the current network after n steps are set according to the following formula:
wherein τ is an update coefficient, the value of τ is smaller than a certain threshold, and the update mode called soft update is called as target network.
The invention has the following beneficial effects: the invention relates to an automatic wharf cooperative scheduling method based on deep reinforcement learning, which models the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage mixed flow shop scheduling problem with irrelevant parallel machines, and sets each mixed flow shop scheduling problemThe system is regarded as an intelligent agent, so that the condition that a scheduling plan needs to be re-planned when equipment fails can be avoided, and the scheduling is more flexible; the method uses a function approximation neural network algorithm, solves the problem of insufficient storage space in a multi-agent environment by using an empirical playback technology, and can effectively eliminate an adjacent state stAnd st+1The correlation between the intelligent agents improves the updating efficiency of the intelligent agents and reduces the iteration times.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a flowchart of a job shop scheduling method based on deep reinforcement learning according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a neural network according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a critical current network, an actor current network, a critical target network, and an actor target network according to an embodiment of the invention.
FIG. 4 is a schematic diagram of a parameter updating method of the critical current network, the actual current network, the critical target network and the actual target network according to the embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 shows a flowchart of an automated wharf co-scheduling method based on deep reinforcement learning, which specifically includes the following steps:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART (Intelligent Robot of transportation); the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
Step 2, modeling the scheduling problem of the mixed flow shop into a multi-agent Markov decision process:
the intelligent agent: each device (a shore bridge, an ART and a field bridge) is used as an agent, and m devices correspond to m agents in total;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second corresponding ART transportation stage is composed of 5 heuristic priority rules, namely FIFO (first come first process principle), SPT (preferentially selecting the workpiece with the shortest processing time), LWKR (preferentially selecting the workpiece with the shortest remaining processing time), SPT/TWK (preferentially selecting the workpiece with the smallest ratio of process time to total processing time) and SRM (preferentially selecting the workpiece with the shortest remaining processing time except the current process);
the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; setting a long-distance reward toWherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
Step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parametersWith operator network parametersInitializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1;
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current networkAndupdating parameters of critical target network and actor target network by soft methodAnduntil reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time C by the environmentmax。
The three-stage mixed flow shop scheduling mathematical model with the irrelevant parallel machine is concretely as follows:
through the simplification of the actual scene, the following assumptions are made for the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, such as the moving time of a shore bridge/a field bridge, ART turning around and avoiding time and the like, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment failure condition is not considered, and the box turnover time is not considered.
The mathematical model is as follows:
the constraint s.t. is:
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) represents the time at which the bite j starts to be serviced in phase 2 and phase 3; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)sikAnd eikThe value intervals of the three decision variables;
the model parameters are as follows:
the Johnson rule aims to preferentially select tasks with long operation time of a bridge and short operation time of a shore bridge, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
The critic network has a structure of a CNN network architecture, and the network architecture is shown in fig. 3, and is composed of an input layer, a convolutional layer, and a full connection layer, where:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and kernel _ size is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
The structure of the operator network is also a typical CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.
The selection of the agent action adopts an epsilon-greedy strategy, which is expressed by a public expression as:
the strategy means that the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly randomly selected among all actions with a fixed probability epsilon.
The parameter update rule of the neural network CNN is based on a loss function. The loss function of the critic network is given by:
wherein, thetaCAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is the target value.
In addition, the loss function for an actor network is:
La(θa)=-∑logπ(a∣s,θa)Lc(θc)
wherein, thetaaFor the parameters of an actor network, L is the probability of taking some action a is smaller, but the return obtained is largera(θa) The value of (c) is increased.
In order to improve the stability of the algorithm, a method of reducing the correlation between the current Q value and the target Q value is adopted, as shown in fig. 4, an operator network is divided into an action current network and an action target network, and a critic network is divided into a state current network and a state target network. Wherein, the action current network loss function is adjusted as:the loss function of the state current network is adjusted to:the parameters of the action target network and the state target network are fixed for a period of time and are not changed, and after n steps are set, the parameters are updated according to the parameters of the current network according to the following formula:
wherein τ is an update coefficient, and generally takes a smaller value.
The target network refers to this parameter update method as soft update.
Finally, it should be noted that the above-mentioned embodiments are only intended to illustrate and explain the present invention, and are not intended to limit the present invention within the scope of the described embodiments.
Furthermore, it will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that many variations and modifications may be made in accordance with the teachings of the present invention, all of which fall within the scope of the invention as claimed.
Claims (10)
1. An automatic wharf cooperative scheduling method based on deep reinforcement learning is characterized by comprising the following steps:
step 1, modeling the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage hybrid flow shop scheduling model with irrelevant parallel machines;
step 2, modeling the mixed flow shop scheduling model into a multi-agent Markov decision process;
step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parametersWith operator network parametersInitializing the environment to obtain an initial state S0Initializing an action space;
step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probabilitytExecuting corresponding scheduling operation to obtain the current state stUpdating the corresponding state matrix and observing the real-time reward rtSense the next state st+1;
Step 5, sequencing(s)t,at,rt,st+1) Storing the data into an experience pool D as a data set for training the current network;
step 6, training the critic current network and the actor current network, and updating the parameters of the current networkAndupdating parameters of critical target network and actor target network by soft methodAnduntil reaching the set iteration times;
step 7, after all the operations are finished, calculating the maximum completion time CmaxAnd generating an optimal scheduling plan.
2. The automatic wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the hybrid flow shop scheduling model in the step 1 is specifically:
the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART; the third stage is a storage yard unloading stage, and the parallel machines are field bridges.
3. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the markov decision process in the step 2 is specifically:
the intelligent agent: each device is taken as an intelligent agent, the devices comprise a shore bridge, an ART and a field bridge, and m devices correspond to m intelligent agents;
state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;
an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second category corresponds to an ART transportation phase and comprises 5 heuristic priority rules: firstly, a first machining principle is adopted, a workpiece with the shortest machining time is preferentially selected, a workpiece with the shortest residual machining time is preferentially selected, a workpiece with the smallest ratio of the process time to the total machining time is preferentially selected, and a workpiece with the shortest residual machining time except the current process is preferentially selected;
the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r isk=λtpWherein λ is [0,1 ]]Constant of tpThe processing time of each machine; the long distance reward is set asWherein γ is one [0,1 ]]Number between, CoptFor optimal scheduling results, CmaxIs the predicted maximum completion time.
4. The automated wharf co-scheduling method based on deep reinforcement learning according to claim 1, wherein the three-stage hybrid flow shop scheduling model with uncorrelated parallel machines in step 1 is specifically:
step 11, simplifying the actual scene, and making the following assumptions on the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, including the moving time of a shore bridge/a field bridge, ART turning around and avoidance time, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment fault condition and the box turnover time are not considered;
step 12, the mathematical model is as follows:
the constraint s.t. is:
wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) represents the time at which the bite j starts to be serviced in phase 2 and phase 3; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)sikAnd eikThe value intervals of the three decision variables;
the model parameters were as follows:
i/j represents the serial number of the shellfish/container, k represents the serial number of the stage, MkIndicating the serial number of the device at stage k, mkRepresenting the total number of devices in phase k, n representing the total number of containers, niRepresenting the number of containers in each bay, fmkIndicating the earliest time at which the device can start operating in phase k, pikRepresents the operating time of the berm i in the k stage,representing the preparation time of the device m in two adjacent phases, aiIndicating the first container number in the decibel, biIndicating the last container number in the shellfish, sikRepresents the start time of operation i in stage k, eikIndicating the time the job i was completed in stage k,indicating that container i/j is 1 at stage k when operated by device m, otherwise 0, ΩkRepresents the set of shellfish numbers contained in phase k, phi represents the set of precedence relationships that shellfish are served, and N represents a positive number greater than a certain threshold.
5. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 3, wherein the Johnson rule is specifically as follows:
the task with long operation time of the bridge and short operation time of the shore bridge is preferably selected, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.
6. The automated wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the critic network adopts a CNN network architecture and comprises an input layer, a convolutional layer and a full connection layer, wherein:
an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and the size of a convolution kernel is 1 multiplied by 1;
and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;
an output layer: the output layer has only one node, the value.
7. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein the structure of the operator network adopts a CNN network architecture, and the structure is the same as that of a critic network, except that the output is a specific action.
8. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the epsilon-greedy strategy in the step 4 has a specific formula as follows:
wherein the action with the largest current Q value is selected with a probability of 1-, and is uniformly and randomly selected in all actions with a fixed probability epsilon.
9. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein in the criticic network adopting the CNN neural network architecture:
the parameter update rule of the neural network CNN is based on a loss function, which is the following formula:
wherein, thetacAs a parameter of the critic network, Q (s, a, θ)c) For the estimation, r + γ maxa′Q(s′,a′,θc) Is a target value;
in addition, the loss function for an actor network is:
La(θa)=-∑logπ(a∣s,θa)Lc(θc)
wherein, thetaaFor the parameters of the operator network, L is the smaller the probability of taking some action a, but the greater the return obtaineda(θa) The value of (c) is increased.
10. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 8, wherein the specific method in the step 4 is as follows:
in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adoptedThe method comprises the following steps of dividing an actor network into an actor current network and an actor target network, and dividing a critic network into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:the loss function of the state current network is adjusted to:the parameters of the operator target network and the critic target network are fixed for a period of time and are updated according to the parameters of the current network after n steps are set according to the following formula:
wherein τ is an update coefficient, the value of τ is smaller than a certain threshold, and the update mode called soft update is called as target network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111299059.1A CN113988443B (en) | 2021-11-04 | 2021-11-04 | Automatic wharf collaborative scheduling method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111299059.1A CN113988443B (en) | 2021-11-04 | 2021-11-04 | Automatic wharf collaborative scheduling method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113988443A true CN113988443A (en) | 2022-01-28 |
CN113988443B CN113988443B (en) | 2024-06-28 |
Family
ID=79746379
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111299059.1A Active CN113988443B (en) | 2021-11-04 | 2021-11-04 | Automatic wharf collaborative scheduling method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113988443B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115542849A (en) * | 2022-08-22 | 2022-12-30 | 苏州诀智科技有限公司 | Container wharf intelligent ship control and distribution method, system, storage medium and computer |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740979A (en) * | 2016-01-29 | 2016-07-06 | 上海海事大学 | Intelligent dispatching system and method for multi-AGV (Automatic Guided Vehicle) of automatic container terminal |
CN112434870A (en) * | 2020-12-01 | 2021-03-02 | 大连理工大学 | Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas |
-
2021
- 2021-11-04 CN CN202111299059.1A patent/CN113988443B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740979A (en) * | 2016-01-29 | 2016-07-06 | 上海海事大学 | Intelligent dispatching system and method for multi-AGV (Automatic Guided Vehicle) of automatic container terminal |
CN112434870A (en) * | 2020-12-01 | 2021-03-02 | 大连理工大学 | Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115542849A (en) * | 2022-08-22 | 2022-12-30 | 苏州诀智科技有限公司 | Container wharf intelligent ship control and distribution method, system, storage medium and computer |
CN115542849B (en) * | 2022-08-22 | 2023-12-05 | 苏州诀智科技有限公司 | Container terminal intelligent ship control and dispatch method, system, storage medium and computer |
Also Published As
Publication number | Publication date |
---|---|
CN113988443B (en) | 2024-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112884239B (en) | Space detonator production scheduling method based on deep reinforcement learning | |
Kim et al. | A look-ahead dispatching method for automated guided vehicles in automated port container terminals | |
TWI663568B (en) | Material scheduling method and system based on real-time status of semiconductor device | |
CN111882215B (en) | Personalized customization flexible job shop scheduling method containing AGV | |
CN111160755B (en) | Real-time scheduling method for aircraft overhaul workshop based on DQN | |
CN114611897B (en) | Intelligent production line self-adaptive dynamic scheduling strategy selection method | |
CN112434870B (en) | Dual-automation field bridge dynamic scheduling method for vertical arrangement of container areas | |
CN112836974B (en) | Dynamic scheduling method for multiple field bridges between boxes based on DQN and MCTS | |
CN115454005A (en) | Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene | |
CN113988443A (en) | Automatic wharf cooperative scheduling method based on deep reinforcement learning | |
CN110554673B (en) | Intelligent RGV processing system scheduling method and device | |
CN116500986A (en) | Method and system for generating priority scheduling rule of distributed job shop | |
CN115793657B (en) | Distribution robot path planning method based on temporal logic control strategy | |
CN111353646A (en) | Steel-making flexible scheduling optimization method with switching time, system, medium and equipment | |
CN115689049A (en) | Multi-target workshop scheduling method for improving gray wolf optimization algorithm | |
CN117314055A (en) | Intelligent manufacturing workshop production-transportation joint scheduling method based on reinforcement learning | |
CN117331700B (en) | Computing power network resource scheduling system and method | |
CN117891220A (en) | Distributed mixed flow shop scheduling method based on multi-agent deep reinforcement learning | |
CN117196261B (en) | Task instruction distribution method based on field bridge operation range | |
Kouvakas et al. | A modular supervisory control scheme for the safety of an automated manufacturing system | |
CN113050644A (en) | AGV (automatic guided vehicle) scheduling method based on iterative greedy evolution | |
CN112395690A (en) | Reinforced learning-based shipboard aircraft surface guarantee flow optimization method | |
CN112053046B (en) | Automatic container terminal AGV reentry and reentry path planning method with time window | |
Panov et al. | Automatic formation of the structure of abstract machines in hierarchical reinforcement learning with state clustering | |
Liao et al. | Learning to schedule job-shop problems via hierarchical reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |