EP4139849A1

EP4139849A1 - Method for configuring components in a system by means of multi-agent reinforcement learning, computer-readable storage medium, and system

Info

Publication number: EP4139849A1
Application number: EP20735060.4A
Authority: EP
Inventors: Michael Wieczorek; Schirin BÄR; Jörn PESCHKE
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2023-03-01
Also published as: CN115699030A; WO2021249616A1; US20230259073A1

Abstract

Software systems consisting of a plurality of components often require said components to be configured so that said components can perform their task in an optimal manner for a particular application. The invention relates to a method for configuring a software system which consists of a plurality of components. To this end, two different alternatives are provided: a) mode 1, i.e. with offensive training, for quickly learning new situations: the range of values and the step size of the parameters are restricted to such an extent that only non-critical changes are possible with one action. Alternatively, b) mode 2 is used, i.e. defensive training, with continuous learning: the range of values and the step size of the parameters are restricted so that the changes do not significantly worsen the target variables; the Epsilon-Greedy value ε is set to a lower value.

Description

description

Method for configuring components in a system with the aid of multi-agent reinforcement learning, computer-readable storage medium and system

Software systems that consist of several components, he often require a configuration of these components in order to enable an optimal execution of the task of this component for a special application. In simpler cases, this can be done manually or with the help of a control circuit.

Examples of such configurations are the distribution of the computing load over several processor cores, the size of the shared memory or the maximum number of possible communication packets.

If the influencing factors (manipulated variables, disturbance variables, controlled variables, ...) become more numerous and the relationships more complex, finding an optimum is very difficult and, if necessary, only with the help of empirical optimization approaches or an adapted / trained AI model possible with the help of machine learning.

Generally speaking, machine learning is divided into unsupervised machine learning, supervised machine learning and reinforcement learning or “reinforcement learning” or “reinforcement learning”, which focuses on finding intelligent solutions to complex control problems.

It becomes even more difficult if the component in question undergoes changes during runtime and training data is / was not available for these cases. The dynamic addition of further components with new parameters and influences also increases the complexity of the task. In addition, cross-component boundary conditions must also be included. will hold. These can also change over the lifetime of the component.

For aspects that only occur at runtime, such as changes within a component, adding further components or changes to the superordinate boundary conditions, it is usually necessary to adapt the configuration of the computer system. In the case of an AI-based solution, post-training of the AI model is then necessary while the overall system is running. It must be ensured that no changes are made during an exploration that lead to undesired behavior in the productive system.

The adjustments serve z. B. the following goals:

- Increase of productivity,

- improvement of quality,

- increase in data throughput,

- ensuring stability through to increasing stability,

- Increase in reserves for maximum capacity utilization,

- cushioning of power peaks, and

- Early detection of instabilities (storage, network, communication, ...).

An example of such a system is the central component of Siemens HMI Operate - the Control Access Point (CAP) - as well as the components involved (COS task, NCK, ...), whose interaction today is during the runtime of a static configuration / Are subject to parameterization and can therefore only react inadequately or not at all to different load scenarios. In particular for future applications in the area of OPC-UA, Big Data, Smart Data, Edge, modular operating concepts or production / machine-specific applications, the current performance of the more or less statically interacting components will no longer be sufficient as before a higher data throughput is achieved must be, but at the same time stability and reserves for potential load peaks must be guaranteed.

Today complex industrial control systems such as B. CNC machines, each with cooperating components often configured separately or optimized. Adjustments to the trained system in a changing environment - especially during runtime - are carried out manually, if at all.

For example, a few manipulated variables of the system (e.g. the number of threads depending on the number of cores) are manually changed based on empirical values (possibly also application-specific) before a new start of the operating program in order to set the HMI Operate to a special one To parameterize the scenario. Only a few parameters can be adjusted while the system is running in order to ensure a higher throughput or better stability.

Today's solutions do not take into account the fact that adjustments during runtime can only be carried out in a secure framework in order to prevent undesired behavior during productive operation.

In the document US 2019/0244099 Al, a reinforcement learning system is already described, which carries out training of a system during the runtime of the system, for example also in an industrial environment, for controlling robots to complete a specific task.

The invention is therefore based on the problem of specifying a method that is intended to enable the parameterization / configuration of complex systems to be adapted at runtime.

The problem is solved by the method according to the features of patent claim 1. Furthermore, the problem is solved by a computer-readable storage medium according to the features len of claim 8 and a system according to the features of claim 15.

Machine learning concepts are used for the solution.

The method according to the invention is used to configure components in a system with factors that influence the components, the components being in an operative relationship to one another, and the status of the components being determinable by collecting internal measured variables and the status of the system being determinable can be determined by measured variables of the overall system, with the means of a reinforcement learning system. It is based on at least one agent and information on the associated environment. During the runtime of the system, the components in the system are first placed in a training mode in order to start a training phase in a first state. The method has the following steps: a) at least one agent of an associated component is called, b) after the agent's first training action, the status of the components and / or the overall system is reevaluated by again collecting the Measured variables, in order to then carry out one of the following steps, depending on the result of the survey: c) if the measured variables are constant or improved: carry out the next action c2) if the measured variables deteriorate: set the epilon-greedy value for the training Zero, carry out the next action, c3) in the event of a critical deterioration in the measured variables, especially with real-time behavior, the training is interrupted and the system is returned to the initial state, continue with steps a, d) in cases c1 and c2, repeat steps b) and c) until completion of a reinforcement learning episode, e) update the agent's strategy (policy), f) Repeat steps a to e for the next agent.

Advantageous exemplary embodiments are specified in the subclaims.

The invention is illustrated below by figures. It shows

Figure 1 is an architecture overview of the system, with a reinforcement learning engine,

FIG. 2 a reinforcement learning agent and

FIG. 3 shows a schematic sequence of defensive online training.

An already pre-trained system and model can also be used as a starting point, which is provided by the manufacturer, for example. This is also the case when building on an existing system. However, the agent can also be trained in the real application right from the start, if this is not possible otherwise, if a simulation would be too complex or too imprecise. This so-called offensive online training can possibly lead to a long training duration if no pre-trained model was used. In order to then adapt these models / agents to new, changed or changing environments (possibly even during production), i. H. After training the configuration parameters (retraining), what is known as defensive online training is used, in which there is advantageously a separate agent for each component (multi-agent system).

Multi-agent systems are known from the following publications, among others:

Lowe R., Harb J., Wu Y., Abbeel P., Tamar A., Mordatch I .: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, arXiv: 1706.02275v3 or Rashid T., Samvelyan M., Schroeder de Witt C., Farquhar G., Foerster J., Whiteson S., QMIX:

Monotony Value Function Factorisation for Deep Multi-Agent Reinforcement Learning, arXiv: 1803.11485v2, or Kaiqing Zhang, Zhuoran Yang, Tarier Basar: Multi-Agent Reinforcement Learning: A Selective OverView of Theories and Algorithms, arXiv: 1911.10635.

In the current state of the art, there are two common forms of a multi-agent reinforcement learning system:

The following description can also be understood by way of example in FIG. 1. The status of each agent A, A1, A2 includes, in a first instance, the status of its own component and the status of the overall system, in particular the measured variables (e.g. CPU load or network load) that define the properties to be optimized characterize the overall system.

Or, in the alternative form, it only includes the partial status of the system to be optimized, but then includes a downstream network that receives the actions of the other (dependent) agents as input and thus indirect information about the overall status.

The possible actions of an agent A, A1, A2 relate to the change in the configuration parameters of the individual component. All agents are advantageously combined in an AI-RL engine, which can lie outside the components or is connected to instances of agents within the components.

The agents are configured for individual components as shown in more detail in Figure 2:

the actions a, a _± : changes in component-specific control variables 113, 116 within a restricted value range, for example input of min-max values or a specified set of values (e.g. allocation of additional memory only in 1024 kByte steps).

- The status sl, s2, s3, si _{+ i} : contains measured variables that describe the status of the overall system, for example diagnostic data, internal measured variables and much more

- The environment E, El, E2: corresponds to the respective compo te, z. B. Control Access Point CAP, Numeric Control Kernel

NCK, ...

- the reward r, ri, ri _{+ i} : is calculated from the respective measured variables

With this type of learning in the production environment, it is important that the constant change (trial and error, "trial & error") of the control variables is controlled and done carefully, so that the adaptation always moves within a certain, predetermined, possibly multidimensional parameter space This does not endanger the production (workpieces) and the safety of the system and any people moving in it.

A distinction is made between two modes for adapting the system during runtime, each of which is implemented by the corresponding modes of the AI-RL engine.

In order to train the system after a major change or for a new application, an explicit training phase is carried out during runtime, in which the user initiates typical runs (with CNC machines e.g. the production of a typical product) with the aim of to train the system for the new conditions. This exercise phase comprises a predefined period of time or a number of parts and is typically carried out at the end user in the production environment.

In the course of the training, the agent selects the configuration parameters of the system from his defined scope of action. This consists of discrete parameters that are so restricted that that z. B. Damage to products or the machine as well as too high a drop in performance of the system, avoided who the.

A so-called "greedy" algorithm is used, whereby a next subsequent state is selected step by step, which promises the greatest profit or the best result at the time of the election. The exploration-exploitation rate (exploration vs. exploitation) is defined in this way that the agent often makes randomized decisions in order to try out a lot of new things (exploration) and thus to enable a quick adaptation to the changed system term directions is avoided.

The second mode is intended to make continuous adjustments with minor changes during operation, i.e. the production phase. Also due to random changes by the agents, the target variables to be optimized for the overall system should not deteriorate so much that z. B. the quality of the work piece to be manufactured drops below a certain limit value.

In an advantageous embodiment of the invention, this second mode is used during normal productive operation, i. H. the target values may only deteriorate to the extent that this is directly acceptable for the resulting properties throughput, quality, wear and tear, etc. of production. In practice, for example, this could be implemented in such a way that the fluctuations observed in normal operation (i.e. without optimization) plus a certain mark-up would be tolerated.

This is achieved in that, on the one hand, the discrete steps in the changes to the configuration parameters are so small that a defined limit value (e.g. performance value, load, speed or temperature) can be reached with one action. cannot be fallen below. In addition, the proportion of random changes made by the agent for exploration is relatively small, for example the epsilon greedy value is set to e = 10%.

Figure 3 shows schematically how defensive online training is implemented. At the beginning of the training, the agent is in state sl and selects a random action al (see Fig. 2). The real system adopts the configuration parameter changes selected by the action and continues the process. When the system has stabilized after the changes, the measured data define the subsequent state s2. If the state s2 represents a deterioration in the target variables compared to the previous state sl, the epsilon greedy value e is set directly to zero in order not to allow any further exploration. The agent should bring the system back to the starting position sl with his previous knowledge. After a defined episode length (e.g. of a maximum of 10 steps), the agent's strategy (also called policy) is updated. The strategy describes the relationship between the state of the system and the action that an agent carries out based on it. The agent can then react better and better through adaptation; this is where the actual learning process takes place. In real-time critical systems, if there is a deterioration from one state to the next (e.g. if you expect to miss the time specifications), the episode is ended immediately and the system is reset directly to the saved configuration parameters of state sl. The agent's strategy (policy) is updated immediately after the episode is canceled. This avoids a deterioration beyond the specified limits as a result of changes selected randomly one after the other.

The interaction of the individual agents can be done in principle Lich according to known methods such. B. described in the publications mentioned above. However, a special procedure can also be used here to limit the effects of adjustments. The agents are executed in a specific sequence, with components with a high potential for change (ie with themselves and also with the greatest impact on the overall system) being called first and those with low impact last. So that the overall system does not end up in an undesirable state due to mutually reinforcing negative changes, the execution of the steps of the first agent is repeated if the target variables have deteriorated. The same applies to the subsequent agents.

Overall, the process thus comprises the process with the following steps:

1. The system is configured according to the training mode. Two different alternatives are offered for this: a) Mode 1, ie with offensive training, for quickly learning new situations: the range of values and the step size of the parameters are limited so that only uncritical changes are possible with one action. The specification for this is made explicitly by the user or analogously using pre-trained models. The epsilon greedy value e is set to a higher value, which results in a desired (larger) exploration.

Or b) mode 2 is used, a defensive training with continuous learning: the range of values and the step size of the parameters are limited in such a way that the changes do not significantly worsen the target values, the epsilon greedy value e is set to a lower value, z. B. 10% set.

2. The agent A, Al, A2 of the component with the (presumably) greatest influence is called first with the output state sl. If there is no information available about the influence of the components, the components can be called up according to a defined sequence. The determination is then made, for example, based on empirical values or results from earlier training phases or episodes. An example of this would be the use of fewer CPU cores, which for single-core applications has a smaller impact than reducing the main memory.

3. After the first action a ± of the agent A, A1, A2, the changes in the measured variables Gl, G2, ... II, 12, ... are evaluated in the new state s2.

A distinction is made between 3 cases: a) Improvement of the values: carry out the next action a ±, 30. b) Worsening of the values: the epsilon greedy value e is set to zero, then the next action is carried out until the end of the Episode in the final state sn, 40. c) Critical deterioration, usually with a negative influence on real-time behavior: Abort and transfer of the system to the initial state sl, continue with step 2, 50

4. In the first two cases (3a and 3b) the actions are carried out up to the end of the episode. The strategy (policy) of the first agent is then updated.

5. Steps 2-4 are then carried out for all agents.

The special method described above makes it possible before some partially, the behavior of systems that consist of several (software) components and z. B. Control (production) processes through an online training process to improve ver and thus to adapt to changed requirements or applications without significantly impairing production or even damaging machines or workpieces appear. This is achieved through the special modification of the reinforcement learning process.

The proposed online training of reinforcement learning agents is carried out in a (more or less) defensive mode in order to be possible in the real system. This defensive training strategy ensures that the machines and / or products concerned are not exposed to any negative influences.

Uncertain influences can be taken into account (e.g. temperature fluctuations), which are often neglected in a simulation. Furthermore, it is also not necessary to create a complex simulation beforehand for training the system, which then deviates further and further from the system to be simulated during the course of the training. The use of training data can therefore also be dispensed with, since the actual system data can be used during the training unit.

The initiation of the training phase and the provision of the new strategy (policy) can be carried out automatically; no manual triggering by the user is necessary. The switch to online training can take place automatically, for example by adjusting the epsilon greedy value.

Another advantage of the method is that the agent can be adapted to a changing environment during operation. So far, this has meant that the simulation has to be adjusted at great expense and, if necessary, the training has to be restarted.

The proposed method advantageously offers two training modes: either fast learning with frequently suboptimal (but always uncritical) settings or slow learning with seldom suboptimal (but uncritical) settings The use of machine learning processes (here specifically reinforcement learning) enables the dynamic adaptation of numerous parameters or a better allocation of resources in order to be prepared for the future requirements of the digitization of industrial production.

In particular, the specific runtime adjustments that take place on site, depending on the specific machine tool, the respective product and the respective phase of production, enable the customer to increase productivity and identify problems in the system more quickly (communication, Network, ...) and thus a regular better control of the entire manufacturing process.

Claims

1. Method for configuring components (CAP, NCK) in a system with factors influencing the components, whereby the components are in an effective relationship to one another, and the status of the components is determined by the collection of internal measured variables (II, 12, 13) can be determined and the state of the system by collecting measured variables of the overall system (Gl, G2, G3, Gi) can be determined with the means of a reinforcement learning system based on at least one agent (A, A1, A2) and information on the associated Environment (E, El, E2, ..) _/ at runtime of the system, the components in the system initially being set in a training mode in order to start a training phase consisting of episodes in a first state (sl) with following steps: a) at least one agent (A, A1, A2) of an associated component (CAP, NCK) with a strategy is called, b) after the first training action of the agent (A, A1, A2) the status is the compos and / or the overall system is reassessed by again surveying the measured variables (Eq of the measured variables: carry out the next action (30) c2) if the measured variables deteriorate: set the epilon greedy value for the training to zero, carry out the next action, (40) c3) if the Measured variables, the training is aborted and the system is transferred to the starting state (sl), continue with step a, (50), d) in cases cl and c2, repeat steps b) and c) (s2, S3, Sn ) until the episode is completed, e) update the agent's strategy (A, A1, A2), f) repeat steps a to e for the same or a next agent (A, A1, A2).

2. A method for configuring components according to patent claim 1, characterized in that the agents (A, A1, A2) work together as a multi-agent reinforcement learning system.

3. A method for configuring components according to Pa tent claim 1 or 2, characterized in that at least some agents present in the system (A, Al,

A2) are summarized in an AI-RL Engine (ARE), which is located outside the system.

4. A method for configuring components according to one of the preceding claims, characterized in that the components (CAP, NCK) are already preconfigured by means of another method before the method is carried out.

5. The method according to any one of the preceding claims, characterized in that the action of step b has two different characteristics:

- In a first version, a strong restriction of the value range and step size of parameters means that only uncritical changes are possible with one action, and the epsilon greedy value e is set to a value> =

10% set,

- In a second version, the range of values and the step size of the parameters are restricted so that changes do not significantly worsen the target values, the epsilon greedy value e is set to a value <= 10%.

6. The method according to any one of the preceding claims, characterized in that

In step a) of the method the agent (A, A1, A2) with the greatest influence is called first.

7. The method according to any one of the preceding claims, characterized in that the agent (A, A1, A2) selects the parameters, taking into account the associated environmental values (E, El, E2) from its defined area of action, taking into account the restriction to avoid damage.

8. Computer-readable storage medium which has stored instructions which, when executed on at least one computer, for configuring components (CAP, NCK) in a system, with factors influencing the components, the components being in an operative relationship to one another, and wherein the state of the components is determined by collecting internal measured variables (II, 12, 13), and the state of the system is determined by collecting measured variables of the overall system (Gl, G2, G3, Gi) using the means of a reinforcement -Learning systems based on at least one agent (A, A1, A2) and information on the associated environment (E, El, E2, ..), at runtime of the system, the components in the system initially set in a training mode in order to start a training phase consisting of episodes in a first state (sl), the computer being caused to carry out the following steps: a) at least one agent (A, A1, A2) e An associated component (CAP, NCK) is called with a strategy, b) after the first training action of the agent (A, A1, A2), the status of the components and / or the overall system is reassessed by again collecting the measured variables (Eq, G2, G3, Gi II, 12, 13), in order to then carry out one of the following steps depending on the result of the survey: c1) if the measured variables are constant or improved: carry out the next action by (30) c2) a worsening of the measurands: set the epilon greedy value for the training to zero, carry out the next action, (40) c3) in the event of a critical worsening of the measurands, especially with real-time behavior, the training is terminated. and the system returns to its initial state (sl), continue with step a, (50), d) in cases cl and c2, repeat steps b) and c) (s2, S3, Sn) until the episode is complete , e) Update the agent's strategy (A, A1, A2), f) Repeat steps a to e for the next agent (A, A1, A2).

9. Computer-readable storage medium according to claim 8, characterized in that the agents (A, A1, A2) work together as a multi-agent reinforcement learning system.

10. Computer-readable storage medium according to claim 8 or 9, characterized in that all agents (A, A1, A2) present in the system are combined in an AI-RL engine (ARE) which is located outside the system.

11. Computer-readable storage medium according to one of the preceding claims 8 to 10, characterized in that the components (CAP, NCK) are already preconfigured by means of another method before the method is carried out.

12. Computer-readable storage medium according to one of the preceding claims 8 to 11, characterized in that when executed by a computer, the action of step b has two different characteristics:

10% set,

13. Computer-readable storage medium according to one of the preceding claims 8 to 12, characterized in that in step a) of the method the agent (A, A1, A2) with the greatest influence is called first.

14. The method according to any one of the preceding claims 8 to 13, characterized in that the agent (A, Al, A2) selects the parameters, taking into account the associated environmental values (E, El, E2) from its defined range of action, taking into account the restriction to avoid damage.

15. System consisting of at least one computer for the configuration of components (CAP, NCK) with factors influencing the components, whereby the components are interrelated and the status of the components is determined by the collection of internal measured variables ( II, 12, 13) can be determined and the state of the system can be determined using the means of a reinforcement learning system based on at least one agent (A, Al , A2) and information on the associated environment (E, El, E2, ..), at the runtime of the system, the components in the system initially being set in a training mode in order to have a training phase in a first state (sl) start, with the following steps: a) at least one agent (A, A1, A2) of an associated component (CAP, NCK) is called, b) after the first training action of the agent (A, A1, A2) the status is the components and / or the Overall system reevaluated by again collecting the measured variables (Gl, G2, G3, Gi II, 12, 13), in order to then carry out one of the following steps depending on the result of the survey: c1) If the measured variables are constant or improved: perform the next action by (30) c2) in the event of a deterioration in the measured variables: set the epilon greedy value for the training to zero, carry out the next action, (40) c3) in the event of a critical deterioration in the measured variables, especially with real-time behavior, the training is aborted and of the system into the initial state (sl), continue with step a, (50), d) in cases cl and c2, repeat steps b) and c) (s2, S3, Sn) until the episode is completed, e ) Update the agent's strategy (A, A1, A2), f) Repeat steps a to e for the same or a next agent (A, A1, A2).