CN113988443A

CN113988443A - Automatic wharf cooperative scheduling method based on deep reinforcement learning

Info

Publication number: CN113988443A
Application number: CN202111299059.1A
Authority: CN
Inventors: 张煜; 尹星; 田宏伟; 郑倩倩
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-01-28
Anticipated expiration: 2041-11-04
Also published as: CN113988443B

Abstract

The invention discloses an automatic wharf cooperative scheduling method based on deep reinforcement learning, aiming at the field of automatic port scheduling, wherein the traditional precise algorithm and approximate algorithm cannot meet the characteristics that a large-scale scheduling problem cannot be solved quickly and a dynamic production environment is difficult to adapt. The invention takes each device in the container loading and unloading process as an intelligent agent, thereby avoiding the situation that the dispatching plan needs to be re-planned when the machine fails and ensuring that the dispatching is more flexible.

Description

Automatic wharf cooperative scheduling method based on deep reinforcement learning

Technical Field

The invention relates to the field of port scheduling, in particular to an automatic wharf collaborative scheduling method based on deep reinforcement learning.

Background

With the rapid development of port trade in recent years, the automatic container terminal gradually becomes an important transportation hub, a shore bridge, an ART and a field bridge are cooperatively used for completing a container loading and unloading task, and the loading and unloading efficiency of the automatic container terminal is mainly influenced by the joint scheduling of the 3 devices, so that the automatic container terminal has important significance for the optimization research of the coordinated scheduling of the 3 devices.

The operation process of the container among the shore bridge, the ART and the field bridge can be regarded as a mixed flow shop scheduling problem, the flow shop scheduling problem is a typical NP-Hard problem, and the traditional solving method aiming at the flow shop scheduling problem at present has an accurate algorithm and an approximate algorithm, wherein the approximate algorithm is divided into a heuristic algorithm and a meta-heuristic algorithm. The traditional solution method has many limitations: the precise algorithm is only suitable for solving small-scale problems, and the practicability is poor; although the traditional heuristic and meta-heuristic methods can solve the approximately optimal solution of the problem in a short time, the generated scheduling result is directed at a static production environment and cannot be well applicable to a real dynamic production environment, such as emergency situations such as machine faults.

In recent years, with the rise of machine learning and neural networks, a new idea is brought to solve the scheduling problem of the hybrid flow shop. At present, the most widely used reinforcement Learning algorithm aiming at the workshop scheduling problem is Q-Learning, but the Q-Learning algorithm is a table type method, a large amount of space is needed for storing a value function, and the hidden danger of dimension explosion exists for the large-scale workshop scheduling problem.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an automatic wharf cooperative scheduling method based on deep reinforcement learning aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides an automatic wharf cooperative scheduling method based on deep reinforcement learning, which comprises the following steps:

step 1, modeling the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage hybrid flow shop scheduling model with irrelevant parallel machines;

step 2, modeling the mixed flow shop scheduling model into a multi-agent Markov decision process;

step 3, initializing an experience pool D and the capacity N thereof, and initializing critic network parameters

With operator network parameters

Initializing the environment to obtain an initial state S₀Initializing an action space;

step 4, according to the epsilon-greedy strategy, randomly selecting an action a according to the epsilon probability_tExecuting corresponding scheduling operation to obtain the current state s_tUpdating the corresponding state matrix and observing the real-time reward r_tSense the next state s_t+1；

Step 5, sequencing(s)_t,a_t,r_t,s_t+1) Storing the data into an experience pool D as a data set for training the current network;

step 6, training the critic current network and the actor current network, and updating the parameters of the current network

And

updating parameters of critical target network and actor target network by soft method

And

until reaching the set iteration times;

step 7, after all the operations are finished, calculating the maximum completion time C_maxAnd generating an optimal scheduling plan.

Further, the hybrid flow shop scheduling model in step 1 of the present invention specifically is:

the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART; the third stage is a storage yard unloading stage, and the parallel machines are field bridges.

Further, the markov decision process in step 2 of the present invention specifically includes:

the intelligent agent: each device is taken as an intelligent agent, the devices comprise a shore bridge, an ART and a field bridge, and m devices correspond to m intelligent agents;

state space: representing the state characteristics as three matrixes, namely a processing time matrix T consisting of the processing time of each procedure, an equipment operation matrix M consisting of the current handling equipment operation state and an operation completion matrix consisting of each operation completion state;

an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second category corresponds to an ART transportation phase and comprises 5 heuristic priority rules: firstly, a first machining principle is adopted, a workpiece with the shortest machining time is preferentially selected, a workpiece with the shortest residual machining time is preferentially selected, a workpiece with the smallest ratio of the process time to the total machining time is preferentially selected, and a workpiece with the shortest residual machining time except the current process is preferentially selected;

the reward function: the scheduling objective is to minimize the maximum completion time, since the production cycle is the sum of the processing times, and therefore, will beThe time reward r is defined as: r is_k＝λt_pWherein λ is [0,1 ]]Constant of t_pThe processing time of each machine; the long distance reward is set as

Wherein γ is one [0,1 ]]Number between, C_optFor optimal scheduling results, C_maxIs the predicted maximum completion time.

Further, the three-stage hybrid flow shop scheduling model with unrelated parallel machines in step 1 of the present invention is specifically:

step 11, simplifying the actual scene, and making the following assumptions on the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, including the moving time of a shore bridge/a field bridge, ART turning around and avoidance time, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment fault condition and the box turnover time are not considered;

step 12, the mathematical model is as follows:

the constraint s.t. is:

wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) shows that in stages 2 and 3, the beta j startsThe time served; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)

s_ikAnd e_ikThe value intervals of the three decision variables;

the model parameters were as follows:

i/j represents the serial number of the shellfish/container, k represents the serial number of the stage, M_kIndicating the serial number of the device at stage k, m_kRepresenting the total number of devices in phase k, n representing the total number of containers, n_iRepresenting the number of containers in each bay, f_mkIndicating the earliest time at which the device can start operating in phase k, p_ikRepresents the operating time of the berm i in the k stage,

representing the preparation time of the device m in two adjacent phases, a_iIndicating the first container number in the decibel, b_iIndicating the last container number in the shellfish, s_ikRepresents the start time of operation i in stage k, e_ikIndicating the time the job i was completed in stage k,

indicating that container i/j is 1 at stage k when operated by device m, otherwise 0, Ω_kRepresents the set of shellfish numbers contained in phase k, phi represents the set of precedence relationships that shellfish are served, and N represents a positive number greater than a certain threshold.

Further, the Johnson rule of the present invention specifically is:

the task with long operation time of the bridge and short operation time of the shore bridge is preferably selected, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.

Further, the criticc network of the present invention adopts a CNN network architecture, which includes an input layer, a convolutional layer, and a full connection layer, wherein:

an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and the size of a convolution kernel is 1 multiplied by 1;

and (3) rolling layers: the number of nodes of the convolution layer is set according to the size of the input state matrix, and a relu activation function is used between the convolution layer and the output layer;

an output layer: the output layer has only one node, the value.

Further, the structure of the operator network of the present invention adopts a CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.

Further, the specific formula of the epsilon-greedy strategy in the step 4 is as follows:

wherein the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly and randomly selected in all actions with a fixed probability epsilon.

Further, in the criticic network adopting the CNN neural network architecture of the present invention:

the parameter update rule of the neural network CNN is based on a loss function, which is the following formula:

wherein, theta_cAs a parameter of the critic network, Q (s, a, θ)_c) For the estimation, r + γ max_a′Q(s′,a′,θ_c) Is a target value;

in addition, the loss function for an actor network is:

L_a(θ_a)＝-∑logπ(a∣s,θ_a)L_c(θ_c)

wherein, theta_aFor the parameters of the operator network, L is the smaller the probability of taking some action a, but the greater the return obtained_a(θ_a) The value of (c) is increased.

Further, the specific method in step 4 of the present invention is:

in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adopted, the operator network is divided into an operator current network and an operator target network, and the critic network is divided into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:

the loss function of the state current network is adjusted to:

the parameters of the operator target network and the critic target network are fixed for a period of time and are updated according to the parameters of the current network after n steps are set according to the following formula:

wherein τ is an update coefficient, the value of τ is smaller than a certain threshold, and the update mode called soft update is called as target network.

The invention has the following beneficial effects: the invention relates to an automatic wharf cooperative scheduling method based on deep reinforcement learning, which models the cooperative operation process of an automatic container wharf shore bridge, ART and a field bridge into a three-stage mixed flow shop scheduling problem with irrelevant parallel machines, and sets each mixed flow shop scheduling problemThe system is regarded as an intelligent agent, so that the condition that a scheduling plan needs to be re-planned when equipment fails can be avoided, and the scheduling is more flexible; the method uses a function approximation neural network algorithm, solves the problem of insufficient storage space in a multi-agent environment by using an empirical playback technology, and can effectively eliminate an adjacent state s_tAnd s_t+1The correlation between the intelligent agents improves the updating efficiency of the intelligent agents and reduces the iteration times.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

fig. 1 is a flowchart of a job shop scheduling method based on deep reinforcement learning according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a neural network according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a critical current network, an actor current network, a critical target network, and an actor target network according to an embodiment of the invention.

FIG. 4 is a schematic diagram of a parameter updating method of the critical current network, the actual current network, the critical target network and the actual target network according to the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 shows a flowchart of an automated wharf co-scheduling method based on deep reinforcement learning, which specifically includes the following steps:

step 1, as shown in fig. 2, modeling the cooperative operation process of the automatic container wharf shore bridge, the ART and the field bridge into a three-stage hybrid flow shop scheduling model with unrelated parallel machines:

the container is regarded as a workpiece, and the processing operation sequentially comprises 3 stages: the first stage is a wharf frontier stage, and the parallel machine is a shore bridge; the second stage is a horizontal transportation stage, and the parallel machine is ART (Intelligent Robot of transportation); the third stage is a storage yard unloading stage, and the parallel machines are field bridges.

Step 2, modeling the scheduling problem of the mixed flow shop into a multi-agent Markov decision process:

the intelligent agent: each device (a shore bridge, an ART and a field bridge) is used as an agent, and m devices correspond to m agents in total;

an action space: dividing the motion space into 2 types according to stage characteristics, wherein the first type is a heuristic Johnson rule and corresponds to a parallel machine in a quayside crane unloading stage and a parallel machine in a yard unloading stage; the second corresponding ART transportation stage is composed of 5 heuristic priority rules, namely FIFO (first come first process principle), SPT (preferentially selecting the workpiece with the shortest processing time), LWKR (preferentially selecting the workpiece with the shortest remaining processing time), SPT/TWK (preferentially selecting the workpiece with the smallest ratio of process time to total processing time) and SRM (preferentially selecting the workpiece with the shortest remaining processing time except the current process);

the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r is_k＝λt_pWherein λ is [0,1 ]]Constant of t_pThe processing time of each machine; setting a long-distance reward to

With operator network parameters

And

And

until reaching the set iteration times;

step 7, after all the operations are finished, calculating the maximum completion time C by the environment_max。

The three-stage mixed flow shop scheduling mathematical model with the irrelevant parallel machine is concretely as follows:

through the simplification of the actual scene, the following assumptions are made for the model: (1) operations of the shore bridge and the yard bridge are performed in shell units, and the shore bridge/yard bridge cannot move to the next shell before unloading of one shell is completed; (2) the preparation time of the equipment at each stage is longer, such as the moving time of a shore bridge/a field bridge, ART turning around and avoiding time and the like, so the preparation time of the equipment is considered when calculating the total completion time; (3) the equipment failure condition is not considered, and the box turnover time is not considered.

The mathematical model is as follows:

the constraint s.t. is:

wherein, the formula (1) is an objective function and shows that the maximum completion time is shortest; formula (2) indicates that each decibel has and only has one device service at each stage; equation (3) represents that each and only one of the decibels being serviced is guaranteed to precede and follow; equation (4) indicates that the job in phase 1 is constrained by priority Φ, job i completes after job j; equation (5) represents the time at which the bite j starts to be serviced in phase 2 and phase 3; equation (6) represents the start time of each task in phase 1; equation (7) represents the end time of each task in phase 1; formula (8) represents the earliest time for each shore bridge to start working; formula (9) indicates that any job bit must be processed by the previous stage of equipment before the next stage of processing can be performed; equation (10) represents that the time at which the bite starts to be serviced is less than/equal to the time at which it ends to be serviced; expression of formula (11) -formula (13)

s_ikAnd e_ikThe value intervals of the three decision variables;

the model parameters are as follows:

the Johnson rule aims to preferentially select tasks with long operation time of a bridge and short operation time of a shore bridge, so that ART can transport containers to the bridge as soon as possible, and meanwhile, the bridge preferentially processes the containers with longer operation time, so that the idle time of the bridge is shortest.

The critic network has a structure of a CNN network architecture, and the network architecture is shown in fig. 3, and is composed of an input layer, a convolutional layer, and a full connection layer, where:

an input layer: the number of input channels is 3, namely 3 matrixes are used for processing a time matrix T, a machine operation matrix M and an operation completion matrix J, and kernel _ size is 1 multiplied by 1;

an output layer: the output layer has only one node, the value.

The structure of the operator network is also a typical CNN network architecture, which is the same as that of the critic network, except that the output is a specific action.

The selection of the agent action adopts an epsilon-greedy strategy, which is expressed by a public expression as:

the strategy means that the action with the largest current Q value is selected with a probability of 1-epsilon, and uniformly randomly selected among all actions with a fixed probability epsilon.

The parameter update rule of the neural network CNN is based on a loss function. The loss function of the critic network is given by:

wherein, theta_CAs a parameter of the critic network, Q (s, a, θ)_c) For the estimation, r + γ max_a′Q(s′,a′,θ_c) Is the target value.

In addition, the loss function for an actor network is:

L_a(θ_a)＝-∑logπ(a∣s,θ_a)L_c(θ_c)

wherein, theta_aFor the parameters of an actor network, L is the probability of taking some action a is smaller, but the return obtained is larger_a(θ_a) The value of (c) is increased.

In order to improve the stability of the algorithm, a method of reducing the correlation between the current Q value and the target Q value is adopted, as shown in fig. 4, an operator network is divided into an action current network and an action target network, and a critic network is divided into a state current network and a state target network. Wherein, the action current network loss function is adjusted as:

the loss function of the state current network is adjusted to:

the parameters of the action target network and the state target network are fixed for a period of time and are not changed, and after n steps are set, the parameters are updated according to the parameters of the current network according to the following formula:

wherein τ is an update coefficient, and generally takes a smaller value.

The target network refers to this parameter update method as soft update.

Finally, it should be noted that the above-mentioned embodiments are only intended to illustrate and explain the present invention, and are not intended to limit the present invention within the scope of the described embodiments.

Furthermore, it will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that many variations and modifications may be made in accordance with the teachings of the present invention, all of which fall within the scope of the invention as claimed.

Claims

1. An automatic wharf cooperative scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

With operator network parameters

And

And

until reaching the set iteration times;

2. The automatic wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the hybrid flow shop scheduling model in the step 1 is specifically:

3. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 1, wherein the markov decision process in the step 2 is specifically:

the reward function: the scheduling objective is to minimize the maximum completion time, and since the production cycle is the sum of the processing times, the instant prize r is defined as: r is_k＝λt_pWherein λ is [0,1 ]]Constant of t_pThe processing time of each machine; the long distance reward is set as

4. The automated wharf co-scheduling method based on deep reinforcement learning according to claim 1, wherein the three-stage hybrid flow shop scheduling model with uncorrelated parallel machines in step 1 is specifically:

step 12, the mathematical model is as follows:

the constraint s.t. is:

s_ikAnd e_ikThe value intervals of the three decision variables;

the model parameters were as follows:

5. The automated wharf collaborative scheduling method based on deep reinforcement learning of claim 3, wherein the Johnson rule is specifically as follows:

6. The automated wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the critic network adopts a CNN network architecture and comprises an input layer, a convolutional layer and a full connection layer, wherein:

an output layer: the output layer has only one node, the value.

7. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein the structure of the operator network adopts a CNN network architecture, and the structure is the same as that of a critic network, except that the output is a specific action.

8. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 1, wherein the epsilon-greedy strategy in the step 4 has a specific formula as follows:

wherein the action with the largest current Q value is selected with a probability of 1-, and is uniformly and randomly selected in all actions with a fixed probability epsilon.

9. The automated wharf co-scheduling method based on deep reinforcement learning of claim 6, wherein in the criticic network adopting the CNN neural network architecture:

in addition, the loss function for an actor network is:

L_a(θ_a)＝-∑logπ(a∣s,θ_a)L_c(θ_c)

10. The automatic wharf co-scheduling method based on deep reinforcement learning of claim 8, wherein the specific method in the step 4 is as follows:

in order to improve the stability of the algorithm, a method for reducing the correlation between the current Q value and the target Q value is adoptedThe method comprises the following steps of dividing an actor network into an actor current network and an actor target network, and dividing a critic network into a critic current network and a critic target network; wherein, the action current network loss function is adjusted as:

the loss function of the state current network is adjusted to: