CN114154821A

CN114154821A - Intelligent scheduling dynamic scheduling method based on deep reinforcement learning

Info

Publication number: CN114154821A
Application number: CN202111390067.7A
Authority: CN
Inventors: 宇文东方; 万光华
Original assignee: Xiamen Shenfuzhi Technology Co ltd
Current assignee: Xiamen Shenfuzhi Technology Co ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-08

Abstract

The invention relates to the technical field of intelligent scheduling, and discloses an intelligent scheduling dynamic scheduling method based on deep reinforcement learning, which comprises the following steps: 1) reading information; 2) processing data; 3) building a deep reinforcement learning frame; 4) considering the starting time and the ending time of each process, 5), dividing the order ending time into each process, building a deep reinforcement learning frame, using an Asynchronous Advantage Actor criticic (A3C) model, requiring the maximum reward value and the maximum entropy output by each selected action, randomizing the strategy by the method, dispersing the probability of each output action as much as possible instead of concentrating on one action, using the deep learning frame of A3C, having a high solving speed, and supporting the use requirement of a factory for intelligent production twice a day.

Description

Intelligent scheduling dynamic scheduling method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of intelligent scheduling, in particular to an intelligent scheduling method based on deep reinforcement learning.

Background

In the prior art, the intelligent scheduling dynamic scheduling method is mostly based on an optimization method and an approximation/heuristic algorithm. In recent years, many scholars have also begun to use deep reinforcement learning to solve various dynamic scheduling problems, including intelligent production dynamic scheduling problems. The optimization method mainly comprises a Mixed Integer Linear Programming (MILP), a branch-and-bound method, a Laplace relaxation method and the like; the approximation/heuristic method is introduced originally because of small calculation amount and easy realization of the algorithm, and mainly comprises a priority assignment rule (PDR), a Neural Network (NN) and a neighborhood search method (NS), wherein the neighborhood search method comprises an approximation optimization method which can be called as a sub-heuristic (Meta-heuristic) such as Tabu Search (TS), Genetic Algorithm (GA) and Simulated Annealing (SA), and the optimization method is mainly limited by the calculation scale. Since there are (n!). times.m possible solutions to an n.times.m intelligent scheduling problem, it is computationally infeasible for large scale problems to use an accurate solution.

At present, the research of a deep reinforcement learning model (DRL) on the intelligent scheduling dynamic scheduling problem is greatly developed, and the deep reinforcement learning is widely applied to solving various dynamic scheduling problems. Compared with the traditional heuristic priority scheduling rule, the model is more flexible, a reinforcement learning environment can model random decision and flexible problems, such as non-deterministic operation re-entry, serial-parallel sequence among processes, optional production lines in the processes, optional production lines in equipment, and the like, but most of the processing methods are still in the theoretical research stage, complex constraint modeling for real requirements of factories cannot be performed, and an intelligent production scheduling dynamic scheduling method for meeting the real requirements of the factories cannot be provided for the problems of shutdown, random processing time, order ending time and the like of some random factories; in addition, real plant requirements are generally to consider Advanced Planning Scheduling (APS), model short-term plans and medium-term plans separately, and ensure the accuracy of the short-term plans and the fast solution of the long-term plans, which is also a field that cannot be covered by the current mainstream deep reinforcement learning model.

Therefore, an intelligent scheduling dynamic scheduling method based on deep reinforcement learning is provided.

Disclosure of Invention

The invention aims to provide an intelligent scheduling method based on deep reinforcement learning, which solves the problem that the intelligent scheduling of a factory, which is real-time, autonomous and unmanned, is difficult to realize in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: the intelligent scheduling dynamic scheduling method based on deep reinforcement learning comprises the following steps:

s1: reading the order condition, the material quantity, the calendar of workers and the production calendar of a production line received by a factory at the current moment;

s2: processing the read original data, and distinguishing a short-term plan and a long-term plan according to the delivery date of the order and the condition of the required materials;

s3: building a deep reinforcement learning framework, and inputting and training production line, working procedure and productivity feature vectors to obtain a target strategy network of a target intelligent agent;

s4: considering the starting time and the ending time of each process and the time calendar of each production line and machine;

s5: and splitting the order ending time into the processes.

Further, in S1, where the order data includes the number of the required products and the product delivery deadline, each product needs to go through several processes, each process has a certain serial-parallel sequence, and the on-line switch machine or material needs a certain equipment transfer time, and there is usually a minimum waiting time or a maximum waiting time constraint between the serial processes.

Further, in S2, the short-term plan requires a fine processing in units of minutes, and all the processes in the part of the order are all arranged on the production line; the order of the long-term plan only needs to evaluate the resource conditions of material quantity, production line, machine, productivity and the like, an early warning is given when a resource bottleneck exists, a rough scheduling result in a day unit is provided when no resource bottleneck exists, the quantity of workers, machines and production line on each time node is calculated, and a resource time axis in a minute unit is generated by combining a worker shift calendar and a production line production calendar.

Further, in S3, using the Asynchronous variable Actor Critic (A3C) model, in addition to requiring the reward value to be maximum, the entropy output of each selected action is also required to be maximum, by which the strategy is randomized with the probability of each action being output being as distributed as possible rather than concentrated on one action.

Further, in S3, a scheduling target of the deep neural network in the current time target policy network is obtained, and the selection probability corresponding to each optimization target is obtained by inputting a classification function after processing the characteristic vectors of the production line, the process, and the capacity state.

Further, for the constraints described in S1, the earliest start time and the expected end time of the process are introduced, and all the constraints are converted into time axes on the process and the production line to be uniformly controlled and updated.

Further, in S4, the following steps are included:

s41: firstly, initializing a part with a requirement of a front process to a larger value;

s42: when all the pre-processes are completely finished, updating the starting time to be the maximum ending time of all the pre-processes;

s43: secondly, for the situation with the minimum waiting time or the maximum waiting time constraint, after the front process is finished, the start time needs to be updated to the end time plus the minimum/maximum waiting time of the front process.

Further, in S5, since the deep reinforcement learning model needs to update the reward value repeatedly, each order needs to be guaranteed to be completed before the delivery date as much as possible, and the importance of the order is taken into consideration, the reward function needs to be designed according to static attributes such as whether the order is urgent or not, by dividing the total available time into available times for each process according to a certain rule.

Further, the offline training comprises the following steps:

s01: generating each production line as a target strategy network of an intelligent agent;

s02: updating a reward function network used by the reward value;

s03: and storing the state feature vector of each intermediate state, and performing parameter initialization on each network.

In each training period, a new training environment is randomly generated, A3C is used for off-line pre-training all intelligent bodies, an optimal process-production line distribution scheme is generated according to a target strategy network of each intelligent body, target decision states such as the latest end time, the idle time ratio and whether the end time of each process is later than the expectation are considered according to each production line, a reward function network is generated, a target state value network and a state feature vector of a target intelligent body are updated through a minimum square error loss function (MSE), and the process is continuously carried out until the distribution schemes of all the processes meet the use requirements finally.

Further, in the intelligent scheduling process, firstly, a current production line scheduling feature vector is read, then a current executable process vector is screened according to the preorder process and material conditions, meanwhile, the production line and the process vector are used as input and trained in the deep reinforcement learning intelligent body network to obtain a process-production line assignment rule of the current time, then, according to whether all processes are allocated to a production line, if not, the time is updated according to a time axis movement rule, then, the intelligent body reward network is updated according to a reward value, finally, according to the completed tasks, the production line and the processes are updated, a new process-production line assignment rule is entered, if yes, whether the iteration is carried out for the maximum times or the target function is converged is continuously judged, if yes, a deep reinforcement learning intelligent scheduling result is output, if not, the production line scheduling feature vector is read again, until outputting the intelligent scheduling result of the deep reinforcement learning.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the intelligent scheduling method based on the deep reinforcement learning, a deep reinforcement learning frame is built, an Asynchronous Advantage Actor critical (A3C) model is used, the maximum reward value is required, the maximum entropy output by selecting an action each time is also required, the strategy is randomized through the method, the probability of each output action is dispersed as far as possible instead of being concentrated on one action, the solving speed is high by using the deep learning frame of A3C, and the use requirement that a factory does intelligent scheduling twice a day can be supported.

2. The invention provides an intelligent scheduling method based on deep reinforcement learning, which considers the starting time and the ending time of each process and the time calendar of each production line and machine, introduces the earliest starting time and the expected ending time of the process, converts all the constraints into time axes on the process and the production line for unified control and updating, firstly, for the part with the requirement of the preposed process, the starting time needs to be initialized to a larger value, and when all the preposed processes are completed, the starting time is updated to the maximum ending time of all the preposed processes; secondly, for the condition of having the minimum waiting time or the maximum waiting time constraint, after the front-end process is finished, the start time needs to be updated to the end time of the front-end process plus the minimum/maximum waiting time, so that the real requirements of the factory can be fully considered, a certain serial-parallel sequence exists among the processes, a certain equipment transfer time is needed for switching machines or materials on a production line, the serial processes usually have the minimum waiting time or the maximum waiting time constraint, each order has an order deadline time, and the order has a priority.

3. The intelligent scheduling method based on deep reinforcement learning processes the read original data, firstly, distinguishes short-term plans and long-term plans according to order delivery date and required material conditions, wherein the short-term plans need to be finely processed in units of minutes, and all procedures in the part of orders are completely scheduled to a production line; the order of the long-term plan only needs to evaluate the resource conditions of material quantity, production line, machine, productivity and the like, early warning is given when a resource bottleneck exists, a rough production scheduling result in days is provided when no resource bottleneck exists, then the quantity of workers, machines and production line on each time node is calculated, a resource time axis in minutes is generated by combining a worker shift calendar and a production line production calendar, the order can be distinguished according to Advanced Planning (APS), the short-term plan and the medium-term plan are separately modeled, meanwhile, the accuracy of the short-term plan and the fast solution of the long-term plan are guaranteed, and the time for solving the intelligent production scheduling problem of the factory can be greatly shortened.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is an overall flowchart of the intelligent scheduling method based on deep reinforcement learning according to the present invention;

FIG. 2 is a flow chart of the intelligent scheduling method based on deep reinforcement learning according to the present invention;

FIG. 3 is a flowchart of the offline training of the intelligent scheduling method based on deep reinforcement learning according to the present invention;

FIG. 4 is a flowchart of a time control method of the intelligent scheduling dynamic scheduling method based on deep reinforcement learning according to the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

In addition, the term "plurality" shall mean two as well as more than two.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, the intelligent scheduling method based on deep reinforcement learning includes the following steps:

s5: and splitting the order ending time into the processes.

At S1, the order data includes the number of the required products and the product delivery deadline, each product needs to go through several processes, each process has a certain serial-parallel sequence, and the on-line switch machine or material needs a certain equipment transfer time, and the serial processes usually have the minimum latency or maximum latency constraints.

In S2, the short-term plan requires a fine processing in units of minutes, and all the processes in the part of the order are arranged on the production line; the order of the long-term plan only needs to evaluate the resource conditions of material quantity, production line, machine, productivity and the like, an early warning is given when a resource bottleneck exists, a rough scheduling result in a day unit is provided when no resource bottleneck exists, the quantity of workers, machines and production line on each time node is calculated, and a resource time axis in a minute unit is generated by combining a worker shift calendar and a production line production calendar.

In S3, an Asynchronous variable access Actor critical (A3C) model is used, which requires the maximum reward value and the maximum entropy output for each action, so that the strategy is randomized, the probability of each output action is dispersed as much as possible rather than concentrated on one action, the scheduling target of the deep neural network in the current time target strategy network is obtained, and the production line, process, and capacity state feature vectors are processed and then input to a classification function to obtain the selection probability corresponding to each optimization target.

Referring to fig. 1 and 4, for the constraints described in S1, the earliest start time and the expected end time of the process are introduced, and all the constraints are converted into a time axis on the process and the production line to be uniformly controlled and updated.

The following steps are included in S4:

In S5, since the deep reinforcement learning model needs to update the reward value repeatedly, each order needs to be guaranteed to be completed before the delivery date as much as possible, and the importance of the order is taken into consideration, the reward function needs to be designed according to static attributes such as whether the order is urgent or not, by dividing the total available time into available time for each process according to a certain rule.

Referring to fig. 1 and 3, the off-line training includes the steps of:

s02: updating a reward function network used by the reward value;

Referring to fig. 2, in the intelligent scheduling process, firstly, the scheduling feature vector of the production line at the current moment is read, then the current executable process vector is screened according to the preorder process and material condition, meanwhile, the production line and the process vector are used as input to train in the deep reinforcement learning intelligent network, the process-production line assignment rule at the current moment is obtained, then, according to whether all the processes are allocated to the production line, if not, the time is updated according to the time axis movement rule, then, the intelligent reward network is updated according to the reward value, finally, the production line and the process are updated according to the completed tasks, the new process-production line assignment rule is entered, if yes, whether the iteration is carried out for the maximum times or the target function is converged is continuously judged, if yes, the deep reinforcement learning intelligent scheduling result is output, if not, the scheduling feature vector of the production line is read again, until outputting the intelligent scheduling result of the deep reinforcement learning.

In summary, the following steps: the invention provides an intelligent production scheduling dynamic scheduling method based on deep reinforcement learning, which comprises the following steps: s1: reading the order condition, the material quantity, the calendar of workers and the production calendar of a production line received by a factory at the current moment; s2: processing the read original data, and distinguishing a short-term plan and a long-term plan according to the delivery date of the order and the condition of the required materials; s3: building a deep reinforcement learning framework, and inputting and training production line, working procedure and productivity feature vectors to obtain a target strategy network of a target intelligent agent; s4: considering the starting time and the ending time of each process and the time calendar of each production line and machine; s5: splitting order ending time into each procedure, building a deep reinforcement learning frame, using an Asynchronous Advantage Actor criticic (A3C) model, requiring maximum reward value and maximum entropy output by selecting actions each time, randomizing strategies by the method, dispersing the probability of each output action as much as possible instead of concentrating on one action, using the A3C deep learning frame, having high solving speed, supporting the use requirement of a factory for intelligent production twice a day, considering the starting time and the ending time of each procedure, the time calendar of each production line and machine, introducing the earliest starting time and the expected ending time of the procedure, converting all constraints into time axes on the procedure and the production line to uniformly control and update, firstly, for a part with a requirement of a front procedure, initializing the starting time to be a large value, when all the pre-processes are completely finished, updating the starting time to be the maximum ending time of all the pre-processes; secondly, for the condition of having the minimum waiting time or the maximum waiting time constraint, after the preset process is finished, the starting time needs to be updated to the ending time plus the minimum/maximum waiting time of the preset process, so that the real requirements of a factory can be fully considered, a certain serial-parallel sequence exists among processes, a certain equipment transfer time is needed for switching machines or materials on a production line, the serial process usually has the minimum waiting time or the maximum waiting time constraint, each order has an order ending time and has a priority, finally, the read-in original data is processed, firstly, a short-term plan and a long-term plan are distinguished according to the delivery date of the order and the condition of the required materials, wherein the short-term plan needs to be finely processed in a unit of minutes, and all the processes in the part of the order are completely arranged on the production line; the order of the long-term plan only needs to evaluate the resource conditions of material quantity, production line, machine, productivity and the like, early warning is given when a resource bottleneck exists, a rough production scheduling result in days is provided when no resource bottleneck exists, then the quantity of workers, machines and production line on each time node is calculated, a resource time axis in minutes is generated by combining a worker shift calendar and a production line production calendar, the order can be distinguished according to Advanced Planning (APS), the short-term plan and the medium-term plan are separately modeled, meanwhile, the accuracy of the short-term plan and the fast solution of the long-term plan are guaranteed, and the time for solving the intelligent production scheduling problem of the factory can be greatly shortened.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The intelligent scheduling dynamic scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

s5: and splitting the order ending time into the processes.

2. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 1, wherein: at S1, the order data includes the number of the required products and the product delivery deadline, each product needs to go through several processes, each process has a certain serial-parallel sequence, and the on-line switch machine or material needs a certain equipment transfer time, and the serial processes usually have the minimum latency or maximum latency constraints.

3. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 2, wherein: in S2, the short-term plan requires a fine processing in units of minutes, and all the processes in the part of the order are arranged on the production line; the order of the long-term plan only needs to evaluate the material quantity, the production line, the machine and the capacity resource conditions, an early warning is given when a resource bottleneck exists, a rough scheduling result in a day unit is provided when no resource bottleneck exists, the quantity of workers, the machine and the production line on each time node is calculated, and a resource time axis in a minute unit is generated by combining a worker shift calendar and a production line production calendar.

4. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 3, wherein: in S3, using the Asynchronous variable Actor critical (A3C) model, in addition to requiring the maximum reward value, the maximum entropy output per selection action is required, by which the strategy is randomized with the probability of each action being output being as discrete as possible rather than concentrated on one action.

5. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 4, wherein: in S3, a scheduling target of the deep neural network in the current time target policy network is obtained, and the selection probability corresponding to each optimization target is obtained by inputting a classification function after processing the characteristic vectors of the production line, the process, and the capacity state.

6. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 5, wherein: for the constraints described in S1, the earliest start time and the expected end time of the process are introduced, and all the constraints are converted into time axes on the process and the production line to be uniformly controlled and updated.

7. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 6, wherein: the following steps are included in S4:

8. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 7, wherein: in S5, since the deep reinforcement learning model needs to update the reward value repeatedly, each order needs to be guaranteed to be completed before the delivery date as much as possible, and the importance of the order is taken into consideration, the reward function needs to be designed according to static attributes such as whether the order is urgent or not, by dividing the total available time into available time for each process according to a certain rule.

9. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 8, wherein: the off-line training comprises the following steps:

s02: updating a reward function network used by the reward value;

In each training period, a new training environment is randomly generated, A3C is used for off-line pre-training all intelligent agents, an optimal process-production line distribution scheme is generated according to a target strategy network of each intelligent agent, a reward function network is generated according to the fact that the production lines consider the latest end time, the idle time ratio and whether the process end time is later than the expected target decision state, the target state value network and the state feature vector of the target intelligent agent are updated through a minimum square error loss function (MSE), and the process is continuously carried out until the distribution schemes of all the processes meet the use requirements finally.

10. The intelligent production scheduling dynamic scheduling method based on deep reinforcement learning of claim 9, wherein: in the intelligent production scheduling process, firstly, the scheduling characteristic vector of the production line at the current moment is read, then the vector of the current executable procedure is screened according to the preorder procedure and the material condition, and simultaneously, the production line and the procedure vector are used as input, training in a deep reinforcement learning intelligent network to obtain a process-production line assignment rule at the current moment, and distributing all processes to a production line according to whether all the processes are distributed or not, if not, updating time according to the time axis moving rule, updating the intelligent agent reward network according to the reward value, updating production lines and working procedures according to the completed tasks, entering a new working procedure-production line assignment rule, if yes, continuously judging whether the iteration reaches the maximum times or the target function is converged, if so, outputting an intelligent scheduling result of the deep reinforcement learning, otherwise, and reading the production line scheduling feature vector again until the intelligent scheduling result of the deep reinforcement learning is output.