US11628562B2 - Method, device and computer program for producing a strategy for a robot - Google Patents

Method, device and computer program for producing a strategy for a robot Download PDF

Info

Publication number
US11628562B2
US11628562B2 US16/921,906 US202016921906A US11628562B2 US 11628562 B2 US11628562 B2 US 11628562B2 US 202016921906 A US202016921906 A US 202016921906A US 11628562 B2 US11628562 B2 US 11628562B2
Authority
US
United States
Prior art keywords
strategy
robot
neural network
episode
loop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/921,906
Other versions
US20210008718A1 (en
Inventor
Frank Hutter
Lior Fuks
Marius Lindauer
Noor Awad
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch GmbH filed Critical Robert Bosch GmbH
Publication of US20210008718A1 publication Critical patent/US20210008718A1/en
Assigned to ROBERT BOSCH GMBH reassignment ROBERT BOSCH GMBH ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Lindauer, Marius, HUTTER, FRANK, Fuks, Lior, Awad, Noor
Application granted granted Critical
Publication of US11628562B2 publication Critical patent/US11628562B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B17/00Systems involving the use of models or simulators of said systems
    • G05B17/02Systems involving the use of models or simulators of said systems electric
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/40Robotics, robotics mapping to robotics vision
    • G05B2219/40499Reinforcement learning algorithm

Definitions

  • the present invention relates to a method for producing a strategy so that a specifiable goal is achieved when a robot, in a particular situation, performs actions on the basis of the strategy.
  • the present invention also relates to a device and to a computer program, which are designed to implement the method.
  • an, in particular computer-implemented, method for producing a strategy i.e., policy
  • a strategy i.e., policy
  • the method begins with an initialization of the strategy ⁇ 0 and an episode length E). This is followed by a repeated execution of a loop, expediently a (computer) program loop, including the steps explained below.
  • a loop is a control structure in a programming language, which repeats an instruction block for as long as a loop condition remains valid or until an abort condition is fulfilled.
  • the loop begins with a production of a plurality of further strategies as a function of the strategy ⁇ 0 .
  • the further strategies may be produced by applying a randomly chosen variable to the strategy. This is followed by an application of the plurality of the further strategies for the respective at least one episode having the episode length E. If the strategy or the environment of the agent has probabilistic properties, then the further strategies may be applied for multiple episodes. This is followed by an ascertainment of respectively one cumulative reward F E , which is obtained when applying the respective further strategy, and by an update of the strategy ⁇ 0 as a function of a second plurality of the further strategies that attained the greatest cumulative rewards.
  • the second plurality is a specifiable number, the specifiable number being smaller than the number of all further strategies.
  • An application of the strategy may be understood as this strategy being used by an agent, in particular the robot, which performs actions as a function of the strategy, e.g., in order to explore its environment, or to achieve its goal.
  • an action of the agent is ascertained on the basis of the strategy as a function of a current state of the environment of the agent.
  • the performance of the action by the agent results in a modification of the environment.
  • This modification may be tied to a reward.
  • the reward may be a function of the action.
  • the cumulative reward is then the sum of the rewards of all actions within an episode.
  • the episode is a sequence of actions and the episode length is a number of the actions in this episode.
  • An advantage is that first solving brief and simple tasks is learned, from which first knowledge is determined for the strategy. This knowledge is then used to solve more demanding tasks with increasing episode length. A transfer of the knowledge about solving simple tasks for more complex tasks is thereby achieved.
  • Another advantage of focusing on simpler and shorter tasks at the beginning of the method is that a more stable and quicker optimization of the strategy is achieved.
  • due to the shortened episodes at the beginning only a segment of the environment is explored. This allows for learning a simple strategy, which may also be applied with promising results to the entire environment. This eventually results in a better generalization of the strategy.
  • the shortened episodes make it possible to evaluate multiple strategies within a specifiable time budget, which allows for quicker learning.
  • the present invention provides for the episode length E to be initially set to a value smaller than the expected number of actions for reaching the specifiable goal.
  • the episode length E may furthermore be set to a value such that a reward may be received or a partial goal may be reached on the first occasion.
  • the number of actions is set as a function of the maximally obtainable reward, and in particular as a function of the individual obtainable rewards through the actions.
  • the expected number of actions is preferably divided by a specifiable constant, whereby a more aggressive exploration may be set.
  • the expected number of actions is ascertained by a Monte Carlo simulation.
  • a Monte Carlo simulation is to be understood in that the agent is respectively controlled by several randomly initialized strategies. The episode length may then be selected as a function of reaching the goal and/or as a function of the progress of the agent and/or as a function of the cumulative reward then obtained.
  • time budget T is initialized.
  • the loop iterations are calculated only for as long as time remains in time budget T.
  • Time budget T may be either constant across all loop passes or may be increased, in particular doubled, after each loop pass.
  • the time budget is the time that is available for applying the further strategies and for updating the initialized strategy. The time budget is thus a possible abort condition of the loop.
  • the time budget is a physical time, which may be measured, e.g., by a stop watch. Additionally or alternatively, the time budget may be specified by a timer, which is preferably integrated in a processing unit on which the method is carried out.
  • the current state of the robot and/or a current state of the environment of the robot are detected by a sensor and that the produced strategy is used to ascertain a control variable for the robot as a function of the sensor value.
  • the strategy may be produced and used not only for controlling the robot, but also for controlling an at least partially autonomous machine, an at least partially autonomous vehicle, a tool, a machine tool or a flying object such as a drone.
  • a further aspect of the present invention provides for a use of a trained neural network in order to provide a control signal for controlling for the robot as a function of an ascertained output signal, the produced strategy according to the first aspect being implemented by the neural network.
  • the output signal corresponds to the action ascertained by the produced strategy.
  • the produced strategy preferably characterizes the parameterization of the neural network.
  • a computer program is provided.
  • the computer program is designed to carry out one of the above-mentioned methods.
  • the computer program comprises instructions that prompt a computer to carry out one of these indicated methods including all its steps when the computer program is running on the computer.
  • a machine-readable memory module is provided, on which the computer program is stored.
  • a device is provided that is designed to carry out one of the methods.
  • FIG. 1 shows a schematic representation of a robot.
  • FIG. 2 shows a schematic representation of a first pseudocode.
  • FIG. 3 shows a schematic representation of a second pseudocode.
  • FIG. 4 shows a schematic representation of a device for executing the pseudocode.
  • FIG. 1 shows a schematic representation of a robot ( 10 ).
  • the robot ( 10 ) is designed to learn autonomously a strategy (i.e., policy) by exploring, expediently by interacting with, its environment ( 11 ).
  • a decision module ( 14 ) comprising the strategy, ascertains an optimal action (a).
  • the strategy is stored in a memory P in the form of parameters ( ⁇ ) of a neural network.
  • the decision module ( 14 ) comprises this neural network, which ascertains the action (a) as a function of the detected sensor variable (x).
  • the architecture of this neural network may be for example the architecture that is described in the related art document cited at the outset.
  • the sensor variable (x) is detected by a sensor ( 13 ).
  • the latter detects a state ( 12 ) of the environment ( 11 ) of the robot ( 10 ).
  • An actuator ( 15 ) of the robot ( 10 ) may be controlled on the basis of the action (a).
  • the state ( 16 ) of the environment ( 11 ) changes.
  • the performance of the action (a) may serve to explore the environment ( 11 ) or to solve the specifiable task or to reach a specifiable goal.
  • the robot ( 10 ) further comprises a processing unit ( 17 ) and a machine-readable memory element ( 18 ).
  • a computer program may be stored on memory element ( 18 ), comprising commands which, when executed on the processing unit ( 17 ) prompt the processing unit ( 17 ) to operate the robot ( 10 ).
  • the robot ( 10 ) may also be an at least partially autonomous vehicle, a drone or a production/machine tool.
  • FIG. 2 shows in exemplary fashion a pseudocode of a method “canonical e volution s trategy (ES)” for producing the strategy for the robot ( 10 ).
  • the initial strategy ⁇ 0 is preferably a variable, which are the parameters of the neural network.
  • the initial strategy may be initialized randomly.
  • a first loop is executed via the parent population variable ⁇ in order to ascertain the constants w j .
  • the second loop is executed until time budget T is depleted.
  • the initialized strategy ⁇ 0 is mutated by applying, e.g., a random noise.
  • the performance of the mutated strategies is evaluated using the cumulative reward function F.
  • the cumulative reward function F may be a cumulative reward over an episode having an episode length E.
  • the strategies are then arranged in descending order according to their obtained cumulative reward s i .
  • the strategy is updated as a function of the top ⁇ strategies that are respectively weighted with the constant w j .
  • the updated strategy may thereupon be output or used as the final strategy in order to execute the second loop anew.
  • the renewed execution of the second loop may be repeated as often as necessary until a specifiable abort criterion is fulfilled.
  • the specifiable abort criterion may be for example that a change of the strategy is smaller than a specifiable threshold value.
  • FIG. 3 shows by way of example a pseudocode of a method to adapt time budget T and episode length E dynamically during the implementation of the ES.
  • an episode scheduler For this purpose, an episode scheduler, a time scheduler and a number of iterations N are initially provided.
  • the strategy ⁇ 0 is initialized by sampling from a normal distribution. Subsequently, a loop is executed beginning at line 2 through line 6 over the number of iterations N. First, the maximum episode length E is ascertained by the episode scheduler and optionally the maximum time budget T is ascertained by the time scheduler as a function of the current iteration n. Subsequently, the method ES is carried out using these two ascertained variables E and/or T.
  • the initial episode length E(0) may be a value smaller than an expected number of steps required for reaching the goal.
  • the initial episode length E(0) may be divided by a specifiable value, for example 2.
  • the initial episode length E(0) may be ascertained by a Monte Carlo simulation.
  • the value ⁇ may correspond to 20 minutes for example.
  • the time scheduler may keep the time budget T constant for every loop pass, it being possible for T to equal 1 hour, for example.
  • the advantage of the episode scheduler and/or of the time scheduler is that first a strategy is learned in short episodes, which is subsequently used to solve more complex tasks more effectively in longer episodes. For the knowledge of the strategy that was learned in the short episodes may be used again for solving the longer episodes.
  • the advantage of the time scheduler is that an available total time budget may be efficiently divided into partial times for the individual episode lengths.
  • FIG. 4 shows a schematic representation of a device ( 40 ) for training the decision module ( 14 ), in particular for executing the pseudocode in accordance with FIG. 2 or 3 .
  • Device ( 40 ) comprises a training module ( 41 ), which simulates e.g. the environment ( 11 ) and outputs the cumulative reward F.
  • the adaptation module ( 43 ) then updates the strategy and stores the updated strategy in memory P.

Abstract

A method for producing a strategy for a robot. The method includes the following steps: initializing the strategy and an episode length; repeated execution of the loop including the following steps: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for the length of the episode length; ascertaining respectively a cumulative reward, which is obtained in the application of the respective further strategy; updating the strategy as a function of a second plurality of the further strategies that obtained the greatest cumulative rewards. After each execution of the loop, the episode length is increased. A computer program, a device for carrying out the method, and a machine-readable memory element on which the computer program is stored, are also described.

Description

CROSS REFERENCE
The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 102019210372.3 filed on Jul. 12, 2019, which is expressly incorporated herein by reference in its entirety.
FIELD
The present invention relates to a method for producing a strategy so that a specifiable goal is achieved when a robot, in a particular situation, performs actions on the basis of the strategy. The present invention also relates to a device and to a computer program, which are designed to implement the method.
BACKGROUND INFORMATION
In their paper “Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari.” arXiv preprint arXiv:1802.08842 (2018), Chrabaszcz et al. describe an evolution strategy (ES) as an alternative to reinforcement learning.
SUMMARY
It was observed that some strategies enable agents, in particular robots, to solve complex tasks, but fail in the case of simple subtasks. So that robots are reliably controlled without exception, a method is to be presented below, which makes it possible to produce, in a simple manner, a strategy, which may be used for reliably controlling robots. Furthermore, the strategy may be extended for complex tasks in a simple manner.
In a first aspect of the present invention, an, in particular computer-implemented, method for producing a strategy (i.e., policy) is provided so that if an agent, in particular a robot, performs actions on the basis of the strategy in a particular situation, a specifiable goal is achieved or a task is performed. The method begins with an initialization of the strategy θ0 and an episode length E). This is followed by a repeated execution of a loop, expediently a (computer) program loop, including the steps explained below. A loop is a control structure in a programming language, which repeats an instruction block for as long as a loop condition remains valid or until an abort condition is fulfilled.
The loop begins with a production of a plurality of further strategies as a function of the strategy θ0. The further strategies may be produced by applying a randomly chosen variable to the strategy. This is followed by an application of the plurality of the further strategies for the respective at least one episode having the episode length E. If the strategy or the environment of the agent has probabilistic properties, then the further strategies may be applied for multiple episodes. This is followed by an ascertainment of respectively one cumulative reward FE, which is obtained when applying the respective further strategy, and by an update of the strategy θ0 as a function of a second plurality of the further strategies that attained the greatest cumulative rewards. The second plurality is a specifiable number, the specifiable number being smaller than the number of all further strategies. After each execution of all steps of the loop, the episode length E is increased.
An application of the strategy may be understood as this strategy being used by an agent, in particular the robot, which performs actions as a function of the strategy, e.g., in order to explore its environment, or to achieve its goal. When applying the strategy, an action of the agent is ascertained on the basis of the strategy as a function of a current state of the environment of the agent.
The performance of the action by the agent results in a modification of the environment. This modification may be tied to a reward. Alternatively or additionally, the reward may be a function of the action. The cumulative reward is then the sum of the rewards of all actions within an episode. The episode is a sequence of actions and the episode length is a number of the actions in this episode.
An advantage is that first solving brief and simple tasks is learned, from which first knowledge is determined for the strategy. This knowledge is then used to solve more demanding tasks with increasing episode length. A transfer of the knowledge about solving simple tasks for more complex tasks is thereby achieved. Another advantage of focusing on simpler and shorter tasks at the beginning of the method is that a more stable and quicker optimization of the strategy is achieved. Furthermore, due to the shortened episodes at the beginning, only a segment of the environment is explored. This allows for learning a simple strategy, which may also be applied with promising results to the entire environment. This eventually results in a better generalization of the strategy. Furthermore, the shortened episodes make it possible to evaluate multiple strategies within a specifiable time budget, which allows for quicker learning.
The present invention provides for the episode length E to be initially set to a value smaller than the expected number of actions for reaching the specifiable goal. The episode length E may furthermore be set to a value such that a reward may be received or a partial goal may be reached on the first occasion. It is also possible that the number of actions is set as a function of the maximally obtainable reward, and in particular as a function of the individual obtainable rewards through the actions. The expected number of actions is preferably divided by a specifiable constant, whereby a more aggressive exploration may be set.
It is further provided that the expected number of actions is ascertained by a Monte Carlo simulation. A Monte Carlo simulation is to be understood in that the agent is respectively controlled by several randomly initialized strategies. The episode length may then be selected as a function of reaching the goal and/or as a function of the progress of the agent and/or as a function of the cumulative reward then obtained.
It is furthermore provided that additionally a time budget T is initialized. The loop iterations are calculated only for as long as time remains in time budget T. Time budget T may be either constant across all loop passes or may be increased, in particular doubled, after each loop pass. The time budget is the time that is available for applying the further strategies and for updating the initialized strategy. The time budget is thus a possible abort condition of the loop. The time budget is a physical time, which may be measured, e.g., by a stop watch. Additionally or alternatively, the time budget may be specified by a timer, which is preferably integrated in a processing unit on which the method is carried out.
It is furthermore provided that the current state of the robot and/or a current state of the environment of the robot are detected by a sensor and that the produced strategy is used to ascertain a control variable for the robot as a function of the sensor value.
It should be noted that the strategy may be produced and used not only for controlling the robot, but also for controlling an at least partially autonomous machine, an at least partially autonomous vehicle, a tool, a machine tool or a flying object such as a drone.
A further aspect of the present invention provides for a use of a trained neural network in order to provide a control signal for controlling for the robot as a function of an ascertained output signal, the produced strategy according to the first aspect being implemented by the neural network. The output signal corresponds to the action ascertained by the produced strategy. The produced strategy preferably characterizes the parameterization of the neural network.
In another aspect of the present invention, a computer program is provided. The computer program is designed to carry out one of the above-mentioned methods. The computer program comprises instructions that prompt a computer to carry out one of these indicated methods including all its steps when the computer program is running on the computer. Furthermore, a machine-readable memory module is provided, on which the computer program is stored. Furthermore, a device is provided that is designed to carry out one of the methods.
Exemplary embodiments of the above-mentioned aspects are illustrated in the figures and are explained in greater detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a schematic representation of a robot.
FIG. 2 shows a schematic representation of a first pseudocode.
FIG. 3 shows a schematic representation of a second pseudocode.
FIG. 4 shows a schematic representation of a device for executing the pseudocode.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
FIG. 1 shows a schematic representation of a robot (10). The robot (10) is designed to learn autonomously a strategy (i.e., policy) by exploring, expediently by interacting with, its environment (11). Depending on the strategy and a detected sensor variable (x), a decision module (14), comprising the strategy, ascertains an optimal action (a). In one exemplary embodiment, the strategy is stored in a memory P in the form of parameters (θ) of a neural network. The decision module (14) comprises this neural network, which ascertains the action (a) as a function of the detected sensor variable (x). The architecture of this neural network may be for example the architecture that is described in the related art document cited at the outset. The sensor variable (x) is detected by a sensor (13). For this purpose, the latter detects a state (12) of the environment (11) of the robot (10). An actuator (15) of the robot (10) may be controlled on the basis of the action (a). As a result of actuator (15) performing the action (a), the state (16) of the environment (11) changes. The performance of the action (a) may serve to explore the environment (11) or to solve the specifiable task or to reach a specifiable goal.
The robot (10) further comprises a processing unit (17) and a machine-readable memory element (18). A computer program may be stored on memory element (18), comprising commands which, when executed on the processing unit (17) prompt the processing unit (17) to operate the robot (10).
It should be noted that the robot (10) may also be an at least partially autonomous vehicle, a drone or a production/machine tool.
FIG. 2 shows in exemplary fashion a pseudocode of a method “canonical evolution strategy (ES)” for producing the strategy for the robot (10).
At the beginning of the pseudocode, it is necessary to specify an initial strategy θ0, a time budget T, a maximum episode length E, a population variable λ, a parent population variable μ and a mutation step variable σ and a cumulative reward function F(⋅). The initial strategy θ0 is preferably a variable, which are the parameters of the neural network. The initial strategy may be initialized randomly.
At the beginning of the pseudocode, in lines 1 and 2, a first loop is executed via the parent population variable μ in order to ascertain the constants wj.
Subsequently, the strategy is optimized by a second loop in lines 4 through 11.
The second loop is executed until time budget T is depleted. In the second loop, the initialized strategy θ0 is mutated by applying, e.g., a random noise. Thereupon, in line 7, the performance of the mutated strategies is evaluated using the cumulative reward function F. The cumulative reward function F may be a cumulative reward over an episode having an episode length E.
In line 9, the strategies are then arranged in descending order according to their obtained cumulative reward si. In the subsequent line 10, the strategy is updated as a function of the top μ strategies that are respectively weighted with the constant wj.
The updated strategy may thereupon be output or used as the final strategy in order to execute the second loop anew. The renewed execution of the second loop may be repeated as often as necessary until a specifiable abort criterion is fulfilled. The specifiable abort criterion may be for example that a change of the strategy is smaller than a specifiable threshold value.
FIG. 3 shows by way of example a pseudocode of a method to adapt time budget T and episode length E dynamically during the implementation of the ES.
For this purpose, an episode scheduler, a time scheduler and a number of iterations N are initially provided.
In line 1 of the second pseudoalgorithm, the strategy θ0 is initialized by sampling from a normal distribution. Subsequently, a loop is executed beginning at line 2 through line 6 over the number of iterations N. First, the maximum episode length E is ascertained by the episode scheduler and optionally the maximum time budget T is ascertained by the time scheduler as a function of the current iteration n. Subsequently, the method ES is carried out using these two ascertained variables E and/or T.
Following each executed loop pass, the episode scheduler may double the episode length E: E(n)=2nE(0). The initial episode length E(0) may be a value smaller than an expected number of steps required for reaching the goal. Alternatively, the initial episode length E(0) may be divided by a specifiable value, for example 2. Alternatively, the initial episode length E(0) may be ascertained by a Monte Carlo simulation.
The time scheduler may increase the time budget T incrementally with the increasing number of executed loop passes, for example: T(n)=2nκ. The value κ may correspond to 20 minutes for example. Alternatively, the time scheduler may keep the time budget T constant for every loop pass, it being possible for T to equal 1 hour, for example.
The advantage of the episode scheduler and/or of the time scheduler is that first a strategy is learned in short episodes, which is subsequently used to solve more complex tasks more effectively in longer episodes. For the knowledge of the strategy that was learned in the short episodes may be used again for solving the longer episodes. The advantage of the time scheduler is that an available total time budget may be efficiently divided into partial times for the individual episode lengths.
FIG. 4 shows a schematic representation of a device (40) for training the decision module (14), in particular for executing the pseudocode in accordance with FIG. 2 or 3 . Device (40) comprises a training module (41), which simulates e.g. the environment (11) and outputs the cumulative reward F. The adaptation module (43) then updates the strategy and stores the updated strategy in memory P.

Claims (17)

What is claimed is:
1. A method of training a neural network, comprising:
producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including:
initializing the strategy and an episode length;
repeatedly executing a loop including the steps:
producing a plurality of further strategies as a function of the strategy;
applying the plurality of the further strategies for a respective at least one episode having the episode length;
ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy;
updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards;
wherein the episode length is increased following each execution of the loop; and
storing the parameters representing the strategy in a memory connected to the neural network.
2. The method as recited in claim 1, wherein a time budget is initialized, the loop being executed only for as long as time remains in the time budget.
3. The method as recited in claim 2, wherein the time budget is increased following every execution of the loop.
4. The method as recited in claim 1, wherein the episode length is initially set to a value smaller than an expected number of actions for reaching the specifiable goal.
5. The method as recited in claim 4, wherein the expected number of actions is ascertained by a Monte Carlo simulation.
6. The method as recited in claim 1, wherein the further strategies are sorted in descending order according to the respective cumulative reward and are respectively weighted using a second specifiable value assigned to a respective position in the order.
7. The method as recited in claim 1, wherein a current state of the robot, and/or a current state of an environment of the robot is detected using a sensor, a control variable being provided for an actuator of the robot, as a function of the sensor value using the updated strategy.
8. A method, comprising:
producing parameters of a neural network representing a strategy for controlling a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the parameters of the neural network being produced by:
initializing the strategy and an episode length;
repeatedly executing a loop including the steps:
producing a plurality of further strategies as a function of the strategy;
applying the plurality of the further strategies for a respective at least one episode having the episode length;
ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy;
updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards;
wherein the episode length is increased following each execution of the loop;
storing the parameters representing the strategy in a memory connected to the neural network; and
operating the robot using the neural network to activate an actuator of the robot to provide an action corresponding to the produced strategy as a function of a current state of the robot and/or a current state of an environment of the robot sensed by a sensor and provided to the neural network, the produced strategy being implemented by the neural network in that the neural network provides the action corresponding to the produced strategy from a state provided to the neural network.
9. A non-transitory machine-readable memory element on which is stored a computer program, which when executed by a computer causes the computer to perform a method of training a neural network, the method comprising:
producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including:
initializing the strategy and an episode length;
repeatedly executing a loop including the steps:
producing a plurality of further strategies as a function of the strategy;
applying the plurality of the further strategies for a respective at least one episode having the episode length;
ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy;
updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards;
wherein the episode length is increased following each execution of the loop; and
storing the parameters representing the strategy in a memory connected to the neural network.
10. A device, the device comprising:
a processing unit configured to execute computer program instructions to control a method of training a neural network, the method including:
producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including:
initialize the strategy and an episode length;
repeatedly execute a loop including:
producing a plurality of further strategies as a function of the strategy;
applying the plurality of the further strategies for a respective at least one episode having the episode length;
ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy; and
updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards;
wherein the episode length is increased following each execution of the loop; and
storing the parameters representing the strategy in a memory connected to the neural network.
11. The method as recited in claim 8, wherein a time budget is initialized, the loop being executed only for as long as time remains in the time budget.
12. The method as recited in claim 11, wherein the time budget is increased following every execution of the loop.
13. The method as recited in claim 8, wherein the episode length is initially set to a value smaller than an expected number of actions for reaching the specifiable goal.
14. The method as recited in claim 13, wherein the expected number of actions is ascertained by a Monte Carlo simulation.
15. The method as recited in claim 8, wherein the further strategies are sorted in descending order according to the respective cumulative reward and are respectively weighted using a second specifiable value assigned to a respective position in the order.
16. The method as recited in claim 8, wherein the robot includes at least one of: an at least partially autonomous vehicle, a drone, a production tool, or a machine tool.
17. The method as recited in claim 1, wherein the robot includes at least one of: an at least partially autonomous vehicle, a drone, a production tool, or a machine tool.
US16/921,906 2019-07-12 2020-07-06 Method, device and computer program for producing a strategy for a robot Active 2041-07-16 US11628562B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102019210372.3A DE102019210372A1 (en) 2019-07-12 2019-07-12 Method, device and computer program for creating a strategy for a robot
DE102019210372.3 2019-07-12

Publications (2)

Publication Number Publication Date
US20210008718A1 US20210008718A1 (en) 2021-01-14
US11628562B2 true US11628562B2 (en) 2023-04-18

Family

ID=74059266

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/921,906 Active 2041-07-16 US11628562B2 (en) 2019-07-12 2020-07-06 Method, device and computer program for producing a strategy for a robot

Country Status (3)

Country Link
US (1) US11628562B2 (en)
CN (1) CN112215363A (en)
DE (1) DE102019210372A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220266145A1 (en) * 2021-02-23 2022-08-25 Electronic Arts Inc. Adversarial Reinforcement Learning for Procedural Content Generation and Improved Generalization

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023004593A1 (en) * 2021-07-27 2023-02-02 华为技术有限公司 Method for simulating circuit, medium, program product, and electronic device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114807A1 (en) * 2008-11-04 2010-05-06 Honda Motor Co., Ltd. Reinforcement learning system
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
US20170228662A1 (en) * 2016-02-09 2017-08-10 Google Inc. Reinforcement learning using advantage estimates
US20180165603A1 (en) * 2016-12-14 2018-06-14 Microsoft Technology Licensing, Llc Hybrid reward architecture for reinforcement learning
US20190244099A1 (en) * 2018-02-05 2019-08-08 Deepmind Technologies Limited Continual reinforcement learning with a multi-task agent
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
US20200320435A1 (en) * 2019-04-08 2020-10-08 Sri International Multi-level introspection framework for explainable reinforcement learning agents
US20200372410A1 (en) * 2019-05-23 2020-11-26 Uber Technologies, Inc. Model based reinforcement learning based on generalized hidden parameter markov decision processes
US20210237266A1 (en) * 2018-06-15 2021-08-05 Google Llc Deep reinforcement learning for robotic manipulation
US20210319362A1 (en) * 2018-07-31 2021-10-14 Secondmind Limited Incentive control for multi-agent systems
US20210383273A1 (en) * 2018-10-12 2021-12-09 Imec Vzw Exploring an unexplored domain by parallel reinforcement
WO2021249616A1 (en) * 2020-06-08 2021-12-16 Siemens Aktiengesellschaft Method for configuring components in a system by means of multi-agent reinforcement learning, computer-readable storage medium, and system
US11222262B2 (en) * 2017-05-30 2022-01-11 Xerox Corporation Non-Markovian control with gated end-to-end memory policy networks

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114807A1 (en) * 2008-11-04 2010-05-06 Honda Motor Co., Ltd. Reinforcement learning system
US11049008B2 (en) * 2013-10-08 2021-06-29 Deepmind Technologies Limited Reinforcement learning using target neural networks
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
US20170228662A1 (en) * 2016-02-09 2017-08-10 Google Inc. Reinforcement learning using advantage estimates
US20180165603A1 (en) * 2016-12-14 2018-06-14 Microsoft Technology Licensing, Llc Hybrid reward architecture for reinforcement learning
US11222262B2 (en) * 2017-05-30 2022-01-11 Xerox Corporation Non-Markovian control with gated end-to-end memory policy networks
US20190244099A1 (en) * 2018-02-05 2019-08-08 Deepmind Technologies Limited Continual reinforcement learning with a multi-task agent
US20210237266A1 (en) * 2018-06-15 2021-08-05 Google Llc Deep reinforcement learning for robotic manipulation
US20210319362A1 (en) * 2018-07-31 2021-10-14 Secondmind Limited Incentive control for multi-agent systems
US20210383273A1 (en) * 2018-10-12 2021-12-09 Imec Vzw Exploring an unexplored domain by parallel reinforcement
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
US20200320435A1 (en) * 2019-04-08 2020-10-08 Sri International Multi-level introspection framework for explainable reinforcement learning agents
US20200372410A1 (en) * 2019-05-23 2020-11-26 Uber Technologies, Inc. Model based reinforcement learning based on generalized hidden parameter markov decision processes
WO2021249616A1 (en) * 2020-06-08 2021-12-16 Siemens Aktiengesellschaft Method for configuring components in a system by means of multi-agent reinforcement learning, computer-readable storage medium, and system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Chrabaszcz et al., "Back To Basics: Benchmarking Canonical Evolution Strategies for Playing Atari," Cornell University, 2018, pp. 1-8. https://arxiv.org/pdf/1802.08842.
Florensa, C. et al; "Automatic Goal Generation for Reinforcement Learning Agents", 2018, PMLR, Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, vol. 80, pp. 1515-1528 (Year: 2018). *
Plappert, Matthias, et al. "Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research", arXiv, Mar. 2018, Machine Learning-Artificial Intelligence-Robotics, https://arxiv.org/abs/1802.09464 (Year: 2018). *
Resnick, et al.: "Backplay: ‘Man muss immer umkehren’", Under Review Conference Paper at ICLR 2019; URL: https..//arxiv.org/pdf/1807.06919, Dec. 31, 2018, pp. 1-18.
Salimans and Chen: "Learning Montezuma's Revenge from a Single Demonstration", URL: https://arxiv.org/abs/1812.0338, Dec. 8, 2018, pp. 1-10.
Stulp and Sigaud: "Robot Skill Learning: From Reinforcement Learning to Evolution Strategies", PALADYN Journal of Behavioral Robotics, 4(1) (2013), pp. 49-61.

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220266145A1 (en) * 2021-02-23 2022-08-25 Electronic Arts Inc. Adversarial Reinforcement Learning for Procedural Content Generation and Improved Generalization
US11883746B2 (en) * 2021-02-23 2024-01-30 Electronic Arts Inc. Adversarial reinforcement learning for procedural content generation and improved generalization

Also Published As

Publication number Publication date
US20210008718A1 (en) 2021-01-14
DE102019210372A1 (en) 2021-01-14
CN112215363A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
Rajeswaran et al. Epopt: Learning robust neural network policies using model ensembles
US11521056B2 (en) System and methods for intrinsic reward reinforcement learning
JP6824382B2 (en) Training machine learning models for multiple machine learning tasks
Van Seijen et al. True online temporal-difference learning
JP5448841B2 (en) Method for computer-aided closed-loop control and / or open-loop control of technical systems, in particular gas turbines
CN111144580B (en) Hierarchical reinforcement learning training method and device based on imitation learning
US11628562B2 (en) Method, device and computer program for producing a strategy for a robot
Shani et al. Model-based online learning of POMDPs
Yang et al. Control of nonaffine nonlinear discrete-time systems using reinforcement-learning-based linearly parameterized neural networks
US20220176554A1 (en) Method and device for controlling a robot
Yamaguchi et al. Differential dynamic programming with temporally decomposed dynamics
JP2021501433A (en) Generation of control system for target system
Wan et al. Bayesian generational population-based training
Vien et al. Reinforcement learning combined with human feedback in continuous state and action spaces
Kobayashi Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning
EP3992856A1 (en) Method and system for operating a device by using hierarchical reinforcement learning
Battiti et al. An investigation of reinforcement learning for reactive search optimization
Ermis et al. A3DQN: Adaptive Anderson acceleration for deep Q-networks
JP2005078516A (en) Device, method and program for parallel learning
Grigsby et al. Towards automatic actor-critic solutions to continuous control
Morales Deep Reinforcement Learning
Buffet et al. FF+ FPG: Guiding a Policy-Gradient Planner.
JP2021192141A (en) Learning device, learning method, and learning program
CN115668215A (en) Apparatus and method for training parameterized strategy

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: ROBERT BOSCH GMBH, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUTTER, FRANK;FUKS, LIOR;LINDAUER, MARIUS;AND OTHERS;SIGNING DATES FROM 20210224 TO 20210620;REEL/FRAME:056735/0534

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STCF Information on status: patent grant

Free format text: PATENTED CASE