CN116500994B

CN116500994B - Dynamic multi-target scheduling method for low-carbon distributed flexible job shop

Info

Publication number: CN116500994B
Application number: CN202310494027.XA
Authority: CN
Inventors: 陈光柱; 陈懿; 廖晓鹃; 侯英杰
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2024-05-03
Anticipated expiration: 2043-05-05
Also published as: CN116500994A

Abstract

The invention discloses a dynamic multi-target scheduling method for a low-carbon distributed flexible job shop, and belongs to the field of shop scheduling. Establishing a dynamic multi-target scheduling planning model of the low-carbon distributed flexible job shop according to the low-carbon job scheduling requirement of the low-carbon distributed flexible job shop; constructing a state space of a low-carbon distributed flexible job shop, designing an action space of a composite scheduling rule, and providing an instant rewarding function and a round rewarding function; providing a Rainbow DQN deep reinforcement learning algorithm, and solving a dynamic multi-target scheduling planning model of the low-carbon distributed flexible job shop; the Rainbow agent constantly interacts with the scheduling environment to obtain a better scheduling rule at the scheduling point. By the method, the decision efficiency of manufacturing enterprises is improved, the better solution can be adaptively and rapidly generated, the loss caused by delay time is effectively reduced, the energy consumption is reduced, and the method has robustness and generalization.

Description

Dynamic multi-target scheduling method for low-carbon distributed flexible job shop

Technical Field

The invention relates to the field of workshop scheduling, in particular to a dynamic multi-target scheduling method of a low-carbon distributed flexible job workshop based on deep reinforcement learning.

Background

In recent years, with the development of globalization and science and technology, numerous manufacturing enterprises gradually change from a traditional single job shop mode to a distributed job shop mode, thereby reducing labor and raw material costs and improving production efficiency. Compared with the traditional flexible job shop scheduling problem, the distributed flexible job shop breaks through the limitation of job shop uniqueness. Each workpiece may be transported to multiple job shops at different locations, and each process may be assigned to a different facility for processing. Thus, the distributed production mode is more suitable for the actual production environment. Because the distributed flexible job shops face more complicated and diversified emergency events, when dynamic scheduling is needed in emergency, rescheduling from zero consumes time, requires stronger expert experience, and is difficult to meet the requirement of a real-time production environment with higher scheduling quality. In addition, the low-carbon manufacturing mode is a new scheduling mode, and the problem is increasingly focused by academia and engineering due to the increase of energy cost and the aggravation of environmental pollution.

Flexible job shop scheduling is an NP-Hard problem, the extended problem is more complex, and the scheduling objective is shifted from solution optimality to fast rationality due to the uncertainty of the scheduling process. The low-carbon distributed flexible job shop has the characteristics of multiple constraint conditions, complex and changeable environment and strong dynamic property. Job shop scheduling strategies traditionally emphasize economic factors such as time to finish and equipment utilization, while ignoring energy and environmental factors that result in energy consumption during processing and transportation. In the field of distributed job shop scheduling, most conventional scheduling models do not allow workpieces to move between workshops. It is noted that, the scheduling algorithm widely used by the manufacturing enterprises is a heuristic algorithm, and although the heuristic algorithm has a rapid scheduling speed, the scheduling effect decreases with the increase of the scheduling scale.

Disclosure of Invention

Therefore, aiming at the defects or improvement demands of the prior art, the invention provides a dynamic multi-target scheduling method of a low-carbon distributed flexible job shop based on deep reinforcement learning, which aims at continuously interacting with a scheduling environment by a Rainbow agent under a Rainbow DQN framework to obtain a better scheduling rule of each rescheduling point or decision point; and an offline training scheduling strategy is adopted, so that a new scheduling problem is quickly solved on line, time is consumed in the training process, and a better solution is generated in a self-adaptive manner in the application process.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A dynamic multi-target scheduling method for a low-carbon distributed flexible job shop comprises the following steps:

s1, establishing a dynamic multi-target scheduling planning model of a low-carbon distributed flexible job shop according to low-carbon job scheduling requirements of the low-carbon distributed flexible job shop;

S2, constructing a state space of a low-carbon distributed flexible job shop, designing an action space of a compound scheduling rule, and providing an instant rewarding function and a round rewarding function;

And S3, providing a Rainbow DQN deep reinforcement learning algorithm, and solving a dynamic multi-target scheduling planning model of the low-carbon distributed flexible job shop.

Specifically, in the step S1, the dynamic multi-objective scheduling planning model of the low-carbon distributed flexible job shop is composed of a multi-objective function and a series of constraint conditions:

The multi-objective function includes calculating a total delay time function and a total energy consumption function. The total delay time function is calculated from the workpiece cutoff time, the time required to complete the workpiece process, and the decision variables. The total energy consumption function is calculated from the process energy consumption, the equipment idle energy consumption and the transportation energy consumption.

The series constraint is set as follows:

(1) Defining a process can only be performed on one piece of equipment in one plant;

(2) Each process can be processed only after the process is reached;

(3) The completion time of the procedure is equal to the start time plus the processing time;

(4) The process of each workpiece must follow a front-to-back priority order;

(5) If the process of processing different workpieces on one device is to be performed, the process must be performed sequentially;

(6) The transportation time between the workshop and the equipment is not considered at the same time, and the transportation time between the equipment is not considered when the transportation time of the workshop is considered;

Specifically, in the step S2, a state space of the low-carbon distributed flexible job shop is constructed, an action space of the composite scheduling rule is designed, and an instant rewarding function and a round rewarding function are provided. The state space comprehensively reflects the production state of the rescheduling point or the decision point and contains the state information of 19 low-carbon distributed flexible job shops. The state space comprises an expected delay rate, an actual delay rate, an expected weighted delay rate, an average utilization rate of all equipment in all job shops, a standard deviation of the utilization rate of the equipment, an average completion rate of all workpieces, an average completion rate of all working procedures, a standard deviation of the completion rate of all workpieces, an energy consumption index of all completed working procedures and a simplified completion time of the last working procedure processed on the equipment at the rescheduling point. Based on the state space, 7 workpiece selection rules and 6 equipment allocation rules are set, and then a total of 42 compound scheduling rules are obtained through Cartesian products. The first 10 rules with the best average result are selected as the action space. The instant prize function includes an economic indicator, an energy consumption indicator, and an equipment indicator. The economic index is calculated according to the actual delay rate, the predicted weighted delay rate, the predicted delay rate, the average utilization rate of equipment, the minimum delay time and the current delay time. The energy consumption index is calculated from the lowest total energy consumption and the current total energy consumption. And the rewarding function of the equipment index is to give negative rewards to the simplification completion time of the last procedure on all the equipment, and feed back the negative rewards to the intelligent agent, so that the intelligent agent converges more quickly, and a better convergence effect is achieved. The round rewarding function generates a negative value, and the low-carbon distributed flexible job shop calculates the total delay time and the total energy consumption of each round of training, and the larger the two values are, the larger the punishment that the scheduling environment feeds back to the intelligent agent is.

Specifically, step S3 proposes a Rainbow DQN deep reinforcement learning algorithm, and solves a dynamic multi-objective scheduling planning model of a low-carbon distributed flexible job shop; the Rainbow DQN deep reinforcement learning algorithm comprises a Rainbow intelligent body and a scheduling environment of a low-carbon distributed flexible job shop; the interaction of the Rainbow agent with the scheduling environment of the low-carbon distributed flexible job shop is a discrete-time markov decision process model.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides a mathematical model of a low-carbon distributed flexible job shop, aims at the real shop scheduling problem, makes up for some defects in the scheduling field according to the characteristics of the shop, and expands the scheduling research field of the flexible job shop in practical sense;

(2) The Rainbow DQN algorithm used in the invention selects a composite scheduling rule according to the current state information and the future state information of the low-carbon distributed flexible job shop, and balances two optimization targets according to constraint conditions of the optimization targets;

(3) Compared with a single composite scheduling rule and standard DQN, the Rainbow DQN provided by the invention can sense the state of a low-carbon distributed flexible job shop at a rescheduling point or a decision point, and a better scheduling rule is selected to meet the production efficiency and low energy consumption;

(4) The invention discloses a low-carbon distributed flexible job shop scheduling method based on deep reinforcement learning, which comprises three parts of simulation environment, offline training and actual application, wherein a deep reinforcement learning agent in the offline training interacts with the scheduling environment and trains the agent to learn scheduling knowledge from the interactive composite scheduling rules and state information; the intelligent agent in the practical application directly utilizes the scheduling knowledge saved in the offline training and provides a quick and reasonable scheduling scheme for the new scheduling instance from the scheduling environment.

Drawings

FIG. 1 is a diagram of a Rainbow DQN architecture in a low carbon distributed flexible job shop

FIG. 2 is a parameter of a mathematical model

FIG. 3 is a diagram of the relevant parameters of a formula in the state space

FIG. 4 is a training effect of total delay time

FIG. 5 is a training effect of total energy consumption

Detailed Description

The above-described aspects are further described below with reference to the drawings and examples. Embodiments of the present invention include, but are not limited to, the following examples.

The implementation of the invention mainly comprises the steps of establishing a dynamic multi-objective scheduling planning model of a low-carbon distributed flexible job shop, establishing a state space of the low-carbon distributed flexible job shop, designing an action space of a composite scheduling rule, providing an instant rewarding function and a round rewarding function, and providing a Rainbow DQN deep reinforcement learning algorithm to solve the model. The method comprises the following specific steps:

S1, according to low-carbon job scheduling requirements of a low-carbon distributed flexible job shop, establishing a multi-objective function and a series constraint condition of a dynamic multi-objective scheduling planning model of the low-carbon distributed flexible job shop, wherein the main purposes are to reduce total delay time of a processing process and reduce total energy consumption. This step includes establishing multiple objective functions and establishing a series of constraints, fig. 2 containing parameters of the mathematical model:

(1) The multi-objective function includes calculating a total delay time function and a total energy consumption function:

The total delay time function TT is calculated by the cut-off time D _i of the workpieces J _i and the completion time CT _i of the workpieces J _i, wherein N is the total number of the workpieces;

The total energy consumption function TE is calculated by the equipment processing energy consumption, the equipment idle energy consumption and the workpiece transportation energy consumption;

TE＝procE+idleE+transE

Wherein procE represents computing equipment processing energy consumption, idleE represents computing equipment idle energy consumption, and transE represents computing workpiece transportation energy consumption;

procE is represented as:

Wherein, Representing the processing time of device k in plant f to perform process O _ij, and pe _fk represents the unit processing energy consumption of device k in plant f,/>A 0-1 decision variable indicating whether or not the process O _ij is performed on the equipment k in the shop F, F being the total number of the shop, M _f being the total number of the equipment in the shop F, n _i being the total number of the processes of the work J _i;

idleE is represented as:

Wherein idleE denotes a computing device idle energy consumption, ie _fk denotes a unit idle energy consumption of a device k in the plant f, S _g,h denotes a start time of the process O _gh, C _i,j denotes an end time of the process O _ij, y _ij,gh denotes a 0-1 decision variable for determining whether a post process of the process O _ij is O _gh, A 0-1 decision variable indicating whether or not the process O _gh is performed on the equipment k in the shop f, n _i is the total number of processes for the workpiece J _i, and n _g is the total number of processes for the workpiece J _g;

transE is represented as:

Wherein transE denotes the calculated work transport energy consumption, te denotes the unit transport energy consumption between workshops/facilities, transF _fu denotes the transport time of the work from workshop f to workshop u, transM _lk denotes the transport time of the work from facility l to facility k, A0-1 decision variable representing a determination of whether a workpiece is transported from shop floor f to shop floor u,/>A 0-1 decision variable representing a decision to determine whether a workpiece is transported from device l to device k in the same shop;

(2) The series of constraints include:

① Defining a process can only be performed on one piece of equipment in one plant;

② Each process can only be processed after reaching, and A _i represents the reaching time of the workpiece J _i;

③ The completion time C _i,j of process O _ij must be equal to the start time S _i,j plus the processing time

④ The process of each workpiece must follow a front-to-back priority order;

⑤ If the process of processing different workpieces on one device is to be performed, the process must be performed sequentially;

⑥ The transportation time between the workshop and the equipment is not considered at the same time, and the transportation time between the equipment is not considered when the transportation time of the workshop is considered.

S2, constructing a state space of the low-carbon distributed flexible job shop, designing an action space of a compound scheduling rule, and providing an instant rewarding function and a round rewarding function.

(1) State space of low-carbon distributed flexible job shop (reference is made to FIGS. 2 and 3 for details of parameters):

The state space of the low-carbon distributed flexible job shop comprehensively reflects the production state of the rescheduling point or the decision point, and contains 19 pieces of state information of the low-carbon distributed flexible job shop. The state space of the low-carbon distributed flexible job shop includes the predicted delay rate, the actual delay rate, the predicted weighted delay rate, the average utilization rate of all devices in all job shops, the standard deviation of the device utilization rate, the average completion rate of all workpieces, the average completion rate of all processes, the standard deviation of the completion rate of all workpieces, the energy consumption index of all completed processes, and the simplified completion time of the last process processed on the device at the rescheduling point or decision point.

① Predicted delay rate Tard _e (t):

wherein TardJ _e (t) represents a workpiece set predicted to be delayed at a rescheduling point or a decision point t, ucompJ (t) represents a workpiece set not completed in processing at the rescheduling point or the decision point t, NPO _i(t)<n_i and EDT _i (t) >0 determine predicted delay conditions of the workpiece at the rescheduling point or the decision point t, NPO _i (t) represents the number of processes completed by the workpiece J _i at the rescheduling point or the decision point t, and EDT _i (t) represents predicted delay time of the workpiece J _i at the rescheduling point or the decision point t;

② Actual delay rate Tard _a (t):

Wherein TardJ _a (t) represents the set of workpieces actually delayed at the rescheduling point or decision point t, and is represented by NPO _i(t)<n_i and Judging the actual delay condition of the workpiece at a rescheduling point or a decision point,/>Indicating the completion time of the machined process of the workpiece J _i;

③ The predicted weighted delay rate WTard _e (t):

Wherein, Representing the predicted time required to machine the remainder of the workpiece J _i at the rescheduling or decision point t,/>Representing the average processing time of the processing procedures O _ij on all available equipment in all workshops;

Wherein, Representing the remaining estimated transit time of the workpiece J _i at the rescheduling or decision point,/>Indicating the average transport time of the equipment where the work J _i has completed the process to the equipment where the next process is processed, F _i,j indicating the set of workplaces where the process O _ij is possible;

④ Average utilization UR _ave (t) for all devices in all workshops at rescheduling point or decision point t:

Wherein, Representing the device utilization of device k in plant f at rescheduling or decision point t,/>The completion time of the last procedure on the equipment k in the workshop f is represented;

⑤ Standard deviation UR _std (t) of device utilization for rescheduling point or decision point t:

⑥ Average completion rate CRO _ave (t) for all workpieces at rescheduling point or decision point t:

⑦ Average completion rate CRJ _ave (t) at rescheduling point or decision point t for all processes:

Wherein CRJ _i (t) represents the completion rate of workpiece J _i at the rescheduling point or decision point t;

⑧ Standard deviation CRJ _std (t) of all work-piece completion rates at rescheduling point or decision point t:

⑨ The rescheduling point or decision point t completes the energy consumption index ECI (t) of the process:

Wherein, Representing the minimum energy consumption required to complete a process at the rescheduling or decision point t,/>Representing a set of equipment within the plant f that can process the process O _ij;

Wherein, Representing the maximum energy consumption required to complete the process at the rescheduling point or decision point t;

Wherein, An intermediate value of energy consumption required for completing the process at the rescheduling point or decision point t;

⑩ Simplified completion time RCTM _fk (t) for the last process that rescheduling point or decision point t was processed on device k in shop f:

wherein RCTM _fk (t) contains state information of all devices, Representing the completion time of the last process processed on the device k in the shop f at the rescheduling point or decision point T _cur representing the average completion time of the last process for each device in each shop;

(2) Action space of the composite scheduling rules (for details of parameters reference is made to fig. 2 and 3):

based on the state space of the low-carbon distributed flexible job shop, 7 job selection rules and 6 device allocation rules are set, and then 42 total compound scheduling rules are obtained through Cartesian products. The first 10 rules with the best average result are selected as the action space of the compound scheduling rule.

The work piece selection rules are as follows:

① Workpiece selection rule 1: at the rescheduling point or decision point t, if TardJ _a (t) is not an empty set, selecting the largest EDT _i(t)·W_i in the actual delayed workpiece set as the next scheduling procedure, wherein W _i represents the weight of the workpiece J _i, namely the machining emergency degree; if TardJ _a (t) is empty, selecting the next scheduling procedure with the smallest average relaxation time ST _i (t) in the unfinished workpiece set;

② Workpiece selection rule 2: at a rescheduling point or decision point t, if TardJ _a (t) is not an empty set, selecting the largest EDT _i(t)·W_i in the actual delay workpiece set as the next scheduling procedure; if TardJ _a (t) is empty, selecting the minimum critical ratio CR (i) in the unfinished workpiece set as the next scheduling procedure;

③ Workpiece selection rule 3: based on T _cur, sequencing the workpieces according to the expected weighted delay EDT _i(t)·W_i, and selecting the process with the largest EDT _i(t)·W_i value as the next scheduling process; if there are multiple identical values, randomly selecting one;

④ Workpiece selection rule 4: randomly selecting one workpiece from the unfinished workpiece set;

⑤ Workpiece selection rule 5: at a rescheduling point or decision point t, if TardJ _a (t) is not an empty set, a critical ratio of weighted delays in the actual delay workpiece set is selected The largest is the next scheduling procedure; if TardJ _a (t) is empty, then select incomplete workpart set/> The minimum is the next scheduling procedure;

⑥ Workpiece selection rule 6: selecting a workpiece with the lowest completion rate CRJ _i (t) in the unfinished workpiece set at a rescheduling point or a decision point t;

⑦ Workpiece selection rule 7: at a rescheduling point or a decision point t, assigning a priority according to the expiration date of the workpiece; the earlier the expiration date, the higher the processing priority, and the earliest work piece in the set of unfinished work pieces is selected;

the device allocation rules are as follows:

① Device allocation rule 1: the earliest available device m _k is selected, Refers to the transportation time from the pre-process to the plant f equipment m _k;

② Device allocation rule 2: selecting the available equipment with the lowest energy consumption (transportation energy consumption, processing energy consumption and idle energy consumption)

③ Device allocation rule 3: selecting device utilizationLowest available device/>

④ Device allocation rule 4: selecting available equipment with shortest processing time;

⑤ Device allocation rule 5: selecting available equipment with shortest finishing time of the previous working procedure;

⑥ Device allocation rule 6: selecting the available equipment with the least use times in all the processing procedures of the next round;

(3) Reward function

And the reward function judges whether the network selection strategy acts or not by feeding back the reward value to the Rainbow intelligent agent. As described above, the dynamic multi-objective function of a low-carbon distributed flexible job shop is to minimize the total delay time and total energy consumption. Thus, the bonus functions of the present invention include an instant bonus function and a round bonus function:

Reward＝R_t+ER

① Instant prize function R _t: the instant rewarding function comprises an economic index, an energy consumption index and an equipment index;

the basic formula of the rewards eco _t for calculating the economic index is as follows:

eco_t＝eco_tarda+eco_wtard+eco_tarde+eco_ur+eco_tardc

Wherein eco _tarda denotes that a corresponding bonus value is given according to the actual delay rate Tard _a (t) of the current state and the next state; eco _wtard denotes that corresponding prize values are awarded in accordance with the predicted weighted delay rates WTard _e (t) for the current state and the next state; eco _tarde denotes that corresponding prize values are awarded in accordance with the predicted delay rates Tard _e (t) for the current state and the next state; eco _ur denotes that a corresponding bonus value is given according to the device average utilization UR _ave (t) of the current state and the next state; eco _tardc represents calculating a prize value from the minimum total delay time minTard and the current total delay time currentTard in the training process; t represents a current rescheduling point or decision point, and t+1 represents a next rescheduling point or decision point;

The rewarding ene _t of the energy consumption index is calculated, and the basic formula is as follows:

ene_t＝ene_ECI+ene_CE

wherein, ene _ECI calculates the rewarding value according to the energy consumption index ECI (t) of the current state and the next state, and ene _CE calculates according to the minimum total energy consumption MINENERGY and the current total energy consumption currentEnergy in the training process:

the weighted sum of the economic index and the energy consumption index is adopted to form instant rewards of rescheduling points or decision points, and the parameter beta epsilon [0,1] is used for balancing the economic index and the energy consumption index;

R_t＝β·eco_t+(1-β)·ene_t

Calculating an equipment index RCTM _fk (t), giving negative rewards to the simplification completion time of the last process on all equipment by the equipment index, and feeding back the negative rewards to the intelligent body by a strongly correlated negative rewards value, so that the intelligent body converges more quickly, and a better convergence effect is achieved;

② Round prize function ER:

Round rewards are a negative value; after the intelligent agent selects actions and completes the scheduling process through the workshop environment, total delay time CT _episode and total energy consumption TE _episode are generated;

the larger the values of total delay time and total energy consumption, the greater the penalty the environment feeds back to the agent.

S3, providing a Rainbow DQN deep reinforcement learning algorithm, and solving a dynamic multi-target scheduling planning model of the low-carbon distributed flexible job shop; the Rainbow DQN deep reinforcement learning algorithm comprises a Rainbow agent and a scheduling environment of a low-carbon distributed flexible job shop; as shown in fig. 1, the scheduling environment in the Rainbow DQN produces workpieces by co-production between different low-carbon flexible job shops. It contains multiple low-carbon flexible workshops at different geographic locations, which may contain inconsistent numbers and types of machines. All of the work pieces are distributed to the processing machines of the different low carbon flexible job shops according to a predetermined or inherent sequence of operations. All the operations of the working procedures can be completed in the same workshop or can be transferred to different workshops. The Rainbow DQN takes as input the information of the workpiece and the equipment, the prediction network provides a predicted value for the state observation, and then transmits it to the Rainbow agent to output the learned actions. The scheduling context then performs this action, storing the experience in a priority experience replay buffer, sampling it for learning. The interaction between the Rainbow agent and the scheduling environment of the low-carbon distributed flexible job shop is a discrete time Markov decision process model; in the interface of the discrete time Rainbow agent and the scheduling environment, at the time t, the solving process is as follows:

(1) Obtaining an observation result s _t epsilon s by the Rainbow agent to observe the state of the environment, wherein s represents a state space set of the low-carbon distributed flexible job shop;

(2) The Rainbow agent selects an action a _t epsilon a according to the observation result, wherein a is an action space set of the compound scheduling rule; as the iteration times increase, the randomness of the selection actions of the Rainbow agent gradually decreases, and the probability of the selection actions in the priority experience replay buffer area gradually increases; resampling the noise network before each action is selected (i.e. at the beginning of a round), scheduling with a fixed noise network, and updating the noise of the neural network until the scheduling is finished so as to improve the action exploration capability of the deep reinforcement learning model. The generally linear layer of the neural network is expressed as the following formula:

y＝ωx+b

Where x is the input layer, ω is the weight matrix, and b is the bias. The improved linear noise floor is defined as the following formula:

Where μ ^ω and μ ^b are the means to which the parameters ω and b need to obey, σ ^ω and σ ^b represent the variance due to noise, and ε is random noise of the same dimension. Noise weights and noise deviations are denoted ω=μ ^ω+σ^ω⊙ε^ω and b=μ ^b+σ^b⊙ε^b, respectively.

(3) The environment gives the value Reward of the Rainbow agent rewarding function according to the action selected by the Rainbow agent, and enters the next state s _t+1; after the Rainbow agent obtains the rewarding value, sampling the experience group (s _t,a_t,Reward,s_t+1) in the round of scheduling, and storing the sampled experience group into a priority experience replay buffer area;

And calculating the absolute value of the time sequence differential deviation by utilizing the Q values output by the evaluation network and the target network of the Rainbow DQN, and measuring the degree of priority learning by utilizing the Q values. The larger the timing difference deviation result, the more samples that need to be learned, i.e. the higher the priority. The priority of experience is proportional to the time differential bias, and the experience in the experience pool is ordered by the absolute value of the time differential bias, with the priority experience pool playing back those experiences of high bias more frequently.

Where p _t denotes the priority of experience, r _t+1 denotes the acquired single step rewards, γ _t+1 denotes the discount coefficient, s _t+1 denotes the state at the next moment, a _t+1 denotes the selected action, θ denotes the evaluation network parameter, θ ^- denotes the target network parameter, ω denotes the hyper-parameter determining the distribution shape.

(4) The Rainbow DQN is provided with an evaluation network and a target network, the two network structures are completely consistent, the parameter updating frequency of the evaluation network is 1 step, and the target network parameters are updated into the parameters of the evaluation network after 200 steps, namely the two network parameters are the same, so that the convergence of the scheduling result is realized.

In addition, the network structure of the Rainbow DQN fuses the network structures of Double DQNs and Dueling DQN, and the Double DQNs solve the overestimation problem of Q learning through the selection of decoupling actions and the calculation of target Q values. The algorithm constructs two action value functions, the agent determines the action through the evaluation network, and calculates the value of the action using the target network when estimating the reward.

Wherein,The target Q value is represented by r _t+1, the acquired single-step prize is represented by γ, the discount coefficient is represented by s _t+1, the next time state is represented by a motion space, θ represents an evaluation network parameter, θ ^- represents a target network parameter, and Q (s _t+1, a; θ) represents the Q value of the selected motion a in the next time state calculated by the evaluation network.

Dueling DQN proposes two value calculation branches, one for predicting state values and the other for predicting state-related action dominance values. The state function is used only to predict whether a state is good or bad, while the action dominance function is used only to predict the importance of each action in that state.

Wherein θ represents a shared neural network parameter, α and β represent network parameters of an action dominance function and a state value function, respectively, V (s _t; θ, β) represents a state value function, and a scalar is output; a (s _t,a_t; theta, alpha) is an action dominance function to output a vector, and the length of the vector is equal to the size of an action space; Representing the average of all motion dominance values in the current state, for ensuring the constraint that the expected value is not 0, thereby increasing the output stability of the motion dominance function and the state value function.

The Rainbow DQN adopts a multi-step reinforcement learning idea, and N-step regression is used for replacing single-step regression, so that the target value in the early training stage is estimated more accurately, and the training speed is increased. The loss function L _N-step is as follows:

Where γ is the discount coefficient, r _t+k is the reward obtained by the t+kth action, θ ^- represents the parameter of the target network, and the gradient of the loss is only back-propagated to the parameter θ of the online network.

The training curves of total delay time and total energy consumption are shown in fig. 4 and 5, respectively, wherein the light area represents the upper and lower bounds of the optimal target value in the multiple training, and the dark curve represents the average value of the multiple training. It can be found from the graph that the selection of actions tends to explore in the initial training scheduling strategy stage, the two optimization target values are at a higher level, and most of the selected action strategies cannot complete normal scheduling tasks. However, as the number of exercises increases, the false motion selection is gradually replaced by an excellent motion, the final total delay time is reduced to about 250, and the total energy consumption is reduced to about 9000. Experimental results show that the Rainbow intelligent agent can select excellent actions at a rescheduling point or a decision point through self-learning under the condition that the state of a machine is changed so as to optimize the total delay time and the total energy consumption target value. The feasibility and effectiveness of the method and model in solving the dynamic multi-objective problem of the low-carbon distributed flexible job shop are verified.

Other less than perfect matters are known in the art.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The dynamic multi-target scheduling method for the low-carbon distributed flexible job shop is characterized by comprising the following steps of:

s1: establishing a multi-objective function and a series constraint condition of a dynamic multi-objective scheduling planning model of the low-carbon distributed flexible job shop according to the low-carbon job scheduling requirement of the low-carbon distributed flexible job shop;

TE＝procE+idleE+transE

procE is represented as:

idleE is represented as:

Wherein idleE denotes a computing device idle energy consumption, ie _fk denotes a unit idle energy consumption of a device k in the plant f, S _g,h denotes a start time of the process O _gh, C _i,j denotes an end time of the process O _ij, y _ij,gh denotes a 0-1 decision variable for determining whether a post process of the process O _ij is O _gh, A 0-1 decision variable indicating whether or not the process O _gh is performed on the equipment k in the shop f, n _i is the total number of processes for the workpiece J _i, and n ₁ is the total number of processes for the workpiece J _g;

transE is represented as:

(2) The series of constraints include:

④ The process of each workpiece must follow a front-to-back priority order;

⑥ The transportation time between the workshop and the equipment is not considered at the same time, and the transportation time between the equipment is not considered when the transportation time of the workshop is considered;

S2: constructing a state space of a low-carbon distributed flexible job shop, designing an action space of a composite scheduling rule, and providing an instant rewarding function and a round rewarding function;

(1) State space of low-carbon distributed flexible job shop

① Predicted delay rate Tard _e (t):

② Actual delay rate Tard _a (t):

③ The predicted weighted delay rate WTard _e (t):

Wherein, Representing the predicted time required to machine the remainder of the workpiece J _i at the rescheduling or decision point t,/>Representing the average processing time of the processing procedures O _ij on all available equipment in all workshops; w _i represents the weight of the workpiece J _i, namely the machining emergency degree;

⑦ Average completion rate CRI _ave (t) for all procedures at rescheduling point or decision point t:

Wherein, Representing the minimum energy consumption required to complete a process at the rescheduling or decision point t,/>Representing a set of equipment in the workshop u that can process the process O _ij;

(2) Action space of composite scheduling rule

The work piece selection rules are as follows:

① Workpiece selection rule 1: at a rescheduling point or decision point t, if TardJ _a (t) is not an empty set, selecting the largest EDT _i(t)·W_i in the actual delayed workpiece set as the next scheduling procedure, representing the weight of the workpiece, namely, if TardJ _a (t) is empty in the processing emergency degree, selecting the smallest average relaxation time ST _i (t) in the unfinished workpiece set as the next scheduling procedure;

⑤ Workpiece selection rule 5: at a rescheduling point or decision point t, if TardJ _a (t) is not an empty set, a critical ratio of weighted delays in the actual delay workpiece set is selected The largest is the next scheduling procedure; if TardJ _a (t) is empty, then select incomplete workpart set/>The minimum is the next scheduling procedure;

the device allocation rules are as follows:

② Device allocation rule 2: selecting available equipment with lowest energy consumption

(3) Reward function

The reward functions include an instant reward function and a round reward function;

Reward＝R_Q+ER

eco_t＝eco_tarda+eco_wtard+eco_tarde+eco_ur+eco_tardc

ene_t＝ene_ECI+ene_CE

R_Q＝β·eco_t+(1-β)·ene_t

② Round prize function ER:

the larger the value of the total delay time and the total energy consumption, the larger the penalty of environment feedback to the agent;

S3: and a Rainbow DQN deep reinforcement learning algorithm is provided, and a dynamic multi-target scheduling planning model of the low-carbon distributed flexible job shop is solved.

2. The method for dynamic multi-objective scheduling of a low-carbon distributed flexible job shop according to claim 1, wherein step S3 provides a Rainbow DQN deep reinforcement learning algorithm, and solves a dynamic multi-objective scheduling planning model of the low-carbon distributed flexible job shop; the Rainbow DQN deep reinforcement learning algorithm comprises a Rainbow agent and a scheduling environment of a low-carbon distributed flexible job shop; the interaction between the Rainbow agent and the scheduling environment of the low-carbon distributed flexible job shop is a discrete time Markov decision process model; in the interface of the discrete time Rainbow agent and the scheduling environment, at the time t, the solving process is as follows:

(2) The Rainbow agent selects an action a _t epsilon a according to the observation result, wherein a is an action space set of the compound scheduling rule; as the iteration times increase, the randomness of the selection actions of the Rainbow agent gradually decreases, and the probability of the selection actions in the priority experience replay buffer area gradually increases;