WO2023046258A1

WO2023046258A1 - Method for generating an optimized production scheduling plan in a flexible manufacturing system

Info

Publication number: WO2023046258A1
Application number: PCT/EP2021/075879
Authority: WO
Inventors: Schirin BÄR; Jörn PESCHKE
Original assignee: Siemens Aktiengesellschaft
Priority date: 2021-09-21
Filing date: 2021-09-21
Publication date: 2023-03-30

Abstract

The invention relates to a method for generating an optimized production scheduling plan for processing of workpieces in a Flexible Manufacturing System The basic idea of the invention is to observe the behavior of the scheduling system that is currently used within the production system, learn from the observations, and optimize where necessary. This will be done while training a Deep Reinforcement Learning (RL) System for an online scheduling system. The solution supports different local and global optimization goals, e. g. makespan minimization or capacity utilization, which can be adapted independently for each manufactured product

Description

Method for generating an optimized production scheduling plan in a Flexible Manufacturing System

The invention relates to a method for generating an optimized production scheduling plan for processing of workpieces in a Flexible Manufacturing System.

Scheduling is the process of arranging, controlling and optimizing work and workloads in a production process or manufacturing process. Scheduling is used to allocate plant and machinery resources (in the following also referred to as modules) , plan human resources, plan production processes and purchase materials .

It is an important tool for manufacturing and engineering, where it can have a major impact on the productivity of a process. In manufacturing, the purpose of scheduling is to minimize the production time and costs, by telling a production facility when to make, with which staff, and on which equipment. Production scheduling aims to maximize the efficiency of the operation and reduce costs.

The key goals in manufacturing are, in a nutshell, makespan minimization, optimal capacity utilization, and finalizing orders in time.

Conventional (offline) scheduling approaches normally generate a long-term plan (e. g. for a week) , which then will be executed. In case of larger deviations during execution it is required to trigger a new calculation of the schedule. This calculation is often time-consuming and therefore not an appropriate measure to react on dynamic changes. Thus, offline scheduler typically generate a high-level plan, which serves as input for the detailed execution. The evaluation of the constraints to which the choice of machines and the precise ordering of the production operations are subject is far too complex to be carried out by a human planner. Standard scheduling software is also not prepared to deal with a complex web of flexibility degrees of freedom and constraints .

A flexible Product Manufacturing Planning can react faster and more flexible to deviations, but if the changes are too big (e. g. changes in operation sequences) a rescheduling is required here as well.

One solution known to the professional uses a specific combination of skill matching and constraint solving techniques with a scheduler but is rather complex in execution.

A rescheduling is typically required for different changes in a production system, some examples are:

- Dynamic changes of availability of production resources caused by failure of machines or non-availability of tools or material ,

- Failure of (parts of) transportation systems resulting in unattainability of certain production resources, or

- Delay of Operations in a way that the planned assignment of other operation is no longer feasible.

If a recalculation of the schedule takes more time than the required response time for dispatching and control and/or routing decisions, this leads also to delays and standstills of the production.

Even new approaches for dynamic production control have limitations with respect to fast adaptation while ensuring an optimized production flow. While ad-hoc dispatching approaches ensure a flexible reaction to changes, they have no forecast capabilities for the actual execution flow and therefore cannot avoid resulting drawbacks such as bottleneck situation in resource usage or give estimations for finalization of a certain order. Furthermore the logic for ad-hoc decisions is configured and implemented in the systems and cannot be adapted to different situations without reconfiguration of the system.

Specific scheduling approaches can provide a solution to ensure the computability of schedules for flexible production systems. Nevertheless, the limitations for rescheduling, as described before, restrict the applicability, if delays in production shall be avoided. This is specifically true for the combination of large production systems with high flexibility (in terms of redundant equipment) and required fast reaction times in production control.

Decentralized approaches (e. g. agent-based systems) have similar limitations, while a fast reaction to changes in the production environment is possible, ensuring an overall optimal solution typically requires coordination or negations between agents, which are time-consuming procedures, also limiting the possible reaction time.

Another problem of all the above mentioned approaches is the fact, that the reaction of the system to specific situation has to be engineered either explicitly (which is not flexible) or formulated as mathematical expression as input for a runtime system which then solves a optimization problem. The later one is often not (or only with high effort) possible for a production planning expert.

It is the task of this invention to provide a solution for the above cited tasks, that is able to react in time and still use s the given flexibility of the production system to meet the overall goals for the production orders and the production system .

The named tas k is solved by the features of the independent patent claim 1

The method to generate an optimi zed production scheduling plan for proces sing of a workpiece in a Flexible Manufacturing System, by training a reinforcement learning system, wherein the Flexible Manufacturing System cons ists of interconnected modules for proces s-steps of machining the workpiece , wherein at least some of the modules represent decision-making points for the s cheduling plan , consist s of the following steps : a ) in a f irst phase of training , training data is used that is derived from the actual Flexible Manufacturing System, b ) in a second pha se of training , training data is used , that derived f rom a simulation model of the Flexible Manufacturing System, and c ) in a third phase of training , further training data is used from the actual Flexible Manufacturing System, wherein the further training data used in the third phase i s data of critical situations that occurred in the Flexible Manufacturing System .

Further advantageous embodiments of the invention are described in the dependent claims

A self-learning approach , ba sed on machine learning techniques as des cribed in the following avoids these problems .

With the claimed method and system a method is proposed to create a scheduling solution that can be used within flexible manufacturing systems (FMS) , which is a self-learning solution that outperforms any scheduling solution that is currently used within a plant.

The basic idea is to observe the behavior of the scheduling system that is currently used within the production system, which we call "conventional" scheduling solution within the following, learn from the observations, and optimize where necessary. This will be done while training a Deep Reinforcement Learning (RL) System for an online scheduling system and let one profit from various advantages.

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment.

A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives the current stateand reward r_t. It then chooses an action a_t from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state and the rewardr_t+iassociated with the transition (, r_t, a_t, ) is determined.

The goal of a reinforcement learning agent is to learn a policy that maximizes the expected cumulative reward. Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) .

The advantages of the proposed approach, in comparison to the usage of normal machine learning solutions to provide a schedule for production control of such a solution , is the shorter training phase in general and the higher certainty if results provided by the machine learning-based scheduling will be better or at least equal compared to conventional scheduling solutions .

The environment is typically stated in the form of a Markov deci sion proces s (MDP ) , because many reinforcement learning algorithms for thi s context use dynamic programming techniques . Reinforcement learning uses a Markov deci sion proce s s MDPs where the probabilities or rewards are unknown . It can solve Markov decision proces ses without explicit specification of the trans ition probabilities ; the values of the transition probabilities are needed in value and policy iteration . In reinforcement learning , instead of explicit specification of the trans ition probabilities , the transition probabilities are acces sed through a simulator that is typically restarted many time s from a uniformly random initial state .

The main difference between the clas sical dynamic programming methods and reinforcement learning algorithms is that the latter do not as sume knowledge of an exact mathematical model of the Markov deci sion proce s s , MDP and they target large MDPs where exact methods become infeasible .

Deep reinforcement learning extends reinforcement learning by using a deep neural network and without explicitly designing the state space . '

The great advantage by using deep Reinforcement Learning RL , meaning to use Reinforcement Learning methods together with neural function approximators , such as neural networks , is : The action we expect from the Reinforcement Learning agent to perform the product control and therefore the scheduling is received immediately, because of the short inference time of the neural network.

In one embodiment of the invention, to use Deep Reinforcement Learning methods for controlling products through the plant, including transportation decisions and dispatching to modules, the Markov decision process can be defined as follows:

The state of the system includes information about the module topology of the Flexible Manufacturing System FMS, the position of all products currently produced, as well as their job specification (available modules and characteristic per operation) and progress.

The action of the Reinforcement Learning agent is to choose the direction of the product. In the described embodiment, the workpiece to be machined is either to stay on the conveyor belt or to go into a module to process a machining operation .

An agent is called at each decision-making point, e. g. each junction within the Flexible Manufacturing System FMS, to perform the direction decision including module assignment. The Reinforcement Learning agent is trained as known by the classical methodology, observing the state input each time step, selecting and performing an action and receiving a reward to pivot strategy if needed.

In another embodiment of the invention, the Markov decision process MDP can also be defined in a different way. In the first case Reinforcement Learning RL agents control the products (or workpieces deemed to become products) , but there are also approaches to control the machines. For a better understanding , the invention is also depicted in the figures . Thereby shows

Figure 1 the system and method during the training phases , Figure 2 the Reinforcement Learning agent to the real Flexible Manufacturing System FMS and

Figure 3 the three training phases of the proposed method .

The procedure cons ists of three training phase s that are depicted in Figures 1 , 2 and 3 .

In the first phase 1 , the Reinforcement Learning agent , 12 that is the key-component of the ( Deep ) Reinforcement Learning scheduling system, is trained to become as good as the conventional scheduling , then in training phase 2 , fine-tuned to become better and la stly, in training phase 3 , critical situations are explicitly given to the Reinforcement Learning agent to be able to cope with it during execution .

Such critical situations in a Flexible Manufacturing System in the worst case lead to a situation where the automation system does not know exactly the current state of the Production Services and therefore it can lead to unpredictable proces s behavior .

In a machining station for example , once the orientation of a workpiece has been checked, a gripper or any robot arm is responsible for picking the workpiece and placing it in a conveyor system . Any interruption during the execution of this tas k implies a los s of reference of the current state , and therefore is considered a critical s ituation .

The communication with the robot arm in which the PLC that holds the corresponding production service , does not know the position of the robot , the connection to the robot needs to be recovered and the execution of the code resumed . In another example, where multiple actuators can perform their operations in parallel on a workpiece, as they are allocated and arranged in a suitable manner, a critical situation can arise, due to a collision of actuators which leads to the need to interrupt the execution.

Within the three training phases 1 to 3, we observe the behavior of the conventional scheduling system 41 to collect data 11, that can be used to train the Reinforcement Learning agent 12 by observing every product within the Flexible Manufacturing System FMS 31 and store the state information every time a product passes by a as decision making point, which we defined in one advantageous embodiment of the invention as the junctions of the Flexible Manufacturing System.

This data in the proposed, Reinforcement Learning System consists of the tuple

[s, a, s' , r, o] , where s is the current state within the Flexible Manufacturing System FMS (e. g. on each junction including the information mentioned in the description of the state) , a is the action that was performed on the according junction, whether the product goes into the next module or stays on the conveyor belt, s' is the following state with the according information, r is the reward to be received after reaching the next state s ' and o is the optimization objective the considered product is currently optimized for.

By collecting these tuples, the behavior of the conventional scheduling system 41 is translated and mapped to the data 11 that can be used to train a Reinforcement Learning agent. Usually, this data would be gained by applying the Reinforcement Learning agent directly to the real environment or a simulation of the environment and e. g. also be stored within a replay memory 11. From this replay memory data is sampled and used for updating the policy of the Reinforcement Learning agent .

For each optimization objective, either sub-policies can be trained, which are then used accordingly during runtime, or this information is also given to the Reinforcement Learning agent as an input. With this initial training phase, we aim to bring the Reinforcement Learning agent to the same level as the conventional scheduling system.

In the end of the training, a scheduling execution 42 order is created, which can control the Flexible Manufacturing system via the Machine interface 33 of the Flexible Manufacturing system.

In the second phase 2 of the depicted method, the Reinforcement Learning scheduling system will be optimized. Therefore, situations are identified, in which the performance of the conventional scheduling were poor. In a next step, the Reinforcement Learning scheduling system will be trained, for exactly these situations, i. e. in a simulation environment with a high exploration rate, which is a hyperparameter within the Reinforcement Learning training that can be set.

Within the next phase 3, situations that are known to be very complex, exotic or difficult to express in a mathematical way ( such as complex constraints that are for the user hard to describe ) will be explicitly learned , trained and optimized . The user knows the se situations and can create such situations within the Flexible Manufacturing System FMS actively . The same procedure as described in phase 1 is applied, to collect the initial data from the real plant . This data is then added to the memory 11 to train the Reinforcement Learning agent with . By this phase , there is no need to describe the constraints to add it to the reward function , but advantageously directly add the s ituations to the replay memory .

Finally, when applying the Reinforcement Learning agent to the real Flexible Manufacturing System FMS , there can be a continuous training pha se that utili zes a simulation model , also known as digital shadow , of the applied Reinforcement Learning agent , which i s shown in Figure 2 . The presentation in f igure 2 corresponds in most part s to figure 1 , only training phase 3 i s omitted . Identical reference signs denote identical parts of the procedure .

The data of the applied Reinforcement Learning agent is collected again in phase 1 , as described above for Figure 1 , stored in the replay memory 11 and used to train and improve the "digital shadow" Reinforcement Learning agent by giving an action 22 and collecting the state and reward, 23 .

The tracking of the performance of the ( digital shadow) Reinforcement Learning agent can be easily done by comparing the received reward after its actions with the reward received by the applied Reinforcement Learning agent . When the digital shadow gets remarkably better , it can be deployed to the real system . By this , we achieve a continuous improvement of the online scheduling systems , including exotic and challenging situations that are j ust experienced over the period of time . During runtime , the Reinforcement Learning instance is applied for as many products that should be produced . Another advantageous embodiment is training multiple Reinforcement Learning agents and deploy them . At each decis ion-making point , the agent controlling the considered product is called . The current state is given to it as an input and as an action , the direction decision is given , re sulting into an online scheduling approach controlling all products at runtime .

The described invention provides a number of advantages compared to conventional s cheduling approaches :

For any kind of conventional scheduling system that is applied, this invention will improve the performance of the overall s cheduling system specif ically in terms of reaction to change s during execution time and unforeseen situations .

The strategy the RL learned is applicable to unseen situations , as thi s is one of the main characteristics of neural networks : generali zation . This means that the RL agent doesn ' t learn the state-action mapping by hard , but rather learns to interpret the situation and how to act within it .

By training a shadow scheduling with the observed data during runtime continuous improvement will be reached .

Fast adaption of s chedule in case of deviations which requires a re-s cheduling without dependency from calculation time .

Exotic situations can be solved without the need of des cribing constraints mathematically . The solution supports different local and global optimization goals (e. g. makespan minimization, capacity utilization) which can be adapted independently for each product.

Claims

Patent claims

1. Method to generate an optimized production scheduling plan (41) for processing of a workpiece in a Flexible Manufacturing System (31) , by training a reinforcement learning system, the Flexible Manufacturing System (31) consisting of interconnected modules (Ml, ... M5) for process-steps of machining the workpiece, wherein at least some of the modules represent decision-making points for the production scheduling plan, with the following steps: a) in a first phase of training, training data (11) is used that is derived from the behavior of the actual Flexible Manufacturing System (31) , b) in a second phase of training, training data is used, that is derived from a simulation model (21) of the Flexible Manufacturing System, and c) in a third phase (3) of training, further training data is used from the actual Flexible Manufacturing System (31) , characterized in that the further training data used in the third phase (3) is data of critical situations (32) that occurred in the Flexible Manufacturing System (FMS) .

2. Method according to claim 1, characterized in that the reinforcement learning system used is a Deep Reinforcement Learning System.

3. Method according to claim 1 or 2, characterized in that the Markov Decision process used in the Reinforcement Learning System is used for Scheduling for process-steps of machining the workpiece in order to produce a product, taking in consideration at least information regarding product history, transportation decisions or dispatching of the workpiece to a module.

4. Method according to claim 1 or 2, characterized in that the Markov Decision process used in the Reinforcement Learning System is used for scheduling of process-steps for a module of the Flexible Manufacturing System, taking in consideration at least information on the topology of the Flexible Manufacturing System (31) or job specification of modules.

5. Method according to one of the preceding claims, characterized in that the Reinforcement Learning System works with a tuple of values [s, a, s' , r, o] , wherein s is the current state within the Flexible Manufacturing System FMS, a is the action that was performed, s' is the following state, r is the reward to be received after reaching the next state s ' and o is the optimization objective the considered product is currently optimized for.

6. Method according to claim 5, characterized in that situations during the production, in which the performance of the scheduling did not meet the optimization objective are identified and the training of the generation of the production scheduling plan is optimized by training the Reinforcement Learning System with the identified situations .

7. Method according to one of the preceding claims, characterized in that for the Flexible Manufacturing System (31) multiple reinforcement learning systems are trained, one for each workpiece that is manufactured or one for each module that is used in the Flexible Manufacturing System.