EP3625731A1 - Architecture de récompense hybride pour apprentissage par renforcement - Google Patents

Architecture de récompense hybride pour apprentissage par renforcement

Info

Publication number
EP3625731A1
EP3625731A1 EP18723249.1A EP18723249A EP3625731A1 EP 3625731 A1 EP3625731 A1 EP 3625731A1 EP 18723249 A EP18723249 A EP 18723249A EP 3625731 A1 EP3625731 A1 EP 3625731A1
Authority
EP
European Patent Office
Prior art keywords
agent
reward
action
agents
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP18723249.1A
Other languages
German (de)
English (en)
Inventor
Harm Hendrik Van Seijen
Seyed Mehdi FATEMI BOOSHEHRI
Romain Michel Henri Laroche
Joshua Samuel Romoff
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/634,914 external-priority patent/US10977551B2/en
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of EP3625731A1 publication Critical patent/EP3625731A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • RL reinforcement learning
  • MDP Markov decision processes
  • a challenge in RL is generalization. In traditional deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in some domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.
  • a framework for solving a single-agent task by using multiple agents, each focusing on different aspects of the task, is provided.
  • This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents.
  • the framework generalizes the traditional hierarchical decomposition, in which, at any moment in time, a single agent has control until it has solved its particular subtask.
  • a framework is provided for communicating agents that aims to generalize the traditional hierarchical decomposition and allow for more flexible task decompositions.
  • decompositions where multiple subtasks have to be solved in parallel, or in cases where a subtask does not have a well-defined end but rather is a continuing process that needs constant adjustment e.g., walking through a crowded street.
  • This framework can be referred to as a separation-of-concerns framework.
  • a reward function for a specific agent is provided that not only has a component depending on the environment state, but also a component depending on the communication actions of the other agents.
  • agents Depending on the specific mixture of these components, agents have different degrees of independence.
  • the reward in general is state-specific, an agent can show different levels of dependence in different parts of the state-space.
  • an agent will act independent of the communication actions of other agents; while in areas with low environment-reward, an agent’s policy will depend strongly on the communication actions of other agents.
  • the framework can be seen as a sequential multi-agent decision making system with non-cooperative agents. This is a challenging setting, because from the perspective of one agent, the environment is non-stationary due to the learning of other agents. This challenge is addressed by defining trainer agents with a fixed policy. Learning with these trainer agents can occur, for example, by pre-training agents and then freezing their policy, or by learning in parallel using off-policy learning.
  • Disclosed embodiments further relate to improvements to machine learning and, in particular, reinforcement learning.
  • a hybrid reward architecture that takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning.
  • FIG.1 illustrates an example scenario involving a robot reaching pieces of fruit scattered across a grid.
  • FIG.2 illustrates an example separation of concern model for two agents.
  • FIG.3 illustrates an example generalized decomposition of a single-agent task using n agents.
  • FIG. 4 illustrates subclasses of agents, including fully independent agents, agents with an acyclic relationship, agents with a cyclic relationship, and an acyclic relationship with trainer agents to break cycles in cyclic dependency graphs.
  • FIG.5 illustrates a falling fruit example scenario.
  • FIG.6 illustrates an example application of a separation of concerns model on a tabular domain.
  • FIG. 7 illustrates learning behavior for tasks with different levels of complexity.
  • FIG.8 illustrates an average return over 4,000 episodes for a different number of no-op actions.
  • FIG. 9 illustrates a network used for the flat agent and the high level agent versus a network used for a low-level agent.
  • FIG. 10A illustrates a learning speed comparison between a separation of concerns model and a flat agent for a 24 ⁇ 24 grid.
  • FIG. 10B illustrates a learning speed comparison between a separation of concerns model and a flat agent for a 48 ⁇ 48 grid.
  • FIG. 10C illustrates a learning speed comparison between a separation of concerns model and a flat agent on an 84 ⁇ 84 grid.
  • FIG. 11 illustrates the effect of varying communication reward on the final performance of a separation of concerns system on a 24 ⁇ 24 game of catch.
  • FIG. 12 illustrates the effect of different action selection intervals (asi) for the high-level agent of the separation of concerns system on 84 ⁇ 84 catch.
  • FIG. 13 illustrates the effect of penalizing communication for the high-level agent on the final performance of a separation of concerns system on a 24 ⁇ 24 catch game.
  • FIG. 14A shows the learning speed of a separation of concerns model compared to baselines for average score over a number of epochs.
  • FIG. 14B shows the learning speed of a separation of concerns model compared to baselines for average number of steps over a number of epochs.
  • FIG. 15A shows separation of concern agent results for average score over a number of epochs with and without pre-training on Pac-Boy.
  • FIG. 15B shows separation of concern agent results for average number of steps over a number of epochs with and without pre-training on Pac-Boy.
  • FIG.16 illustrates an architecture of an example aggregator.
  • FIG.17 illustrates an example attractor.
  • FIG.18 illustrates an example three-pellet attractor in Pac-Boy.
  • FIG.19 illustrates an example situation in Pac-Boy without a no-op action.
  • FIG. 20A illustrates average scores of a multi-advisor model in Pac-Boy against baselines.
  • FIG. 20B illustrates average episode length of a multi-advisor model in Pac- Boy against baselines.
  • FIG.20C illustrates average scores for different methods in Pac-Boy.
  • FIG.21 illustrates average performance for this experiment with noisy rewards.
  • FIG.22 illustrates an example single-head architecture.
  • FIG.23 illustrates an example Hybrid Reward Architecture (HRA).
  • HRA Hybrid Reward Architecture
  • FIG. 24 illustrates example DQN, HRA, and HRA with pseudo-rewards architectures.
  • FIG.25A illustrates example average steps over episodes of the fruit collection task.
  • FIG.25B illustrates example average steps over episodes of the fruit collection task.
  • FIGS. 26A–D illustrate four different maps in the ATARI 2600 game MS. PAC-MAN.
  • FIG. 27 illustrates training curves for incremental head additions to the HRA architecture.
  • FIG. 28 compares training curves of HRA with the Asynchronous Advantage Actor-Critic (A3C) baselines.
  • FIG. 29 illustrates a training curve for HRA in the game MS.
  • PAC-MAN smoothed over 100 episodes for a level passing experiment
  • FIG. 30 illustrates training curves for HRA in the game MS. PAC-MAN over various ⁇ values without executive memory.
  • FIG. 31 illustrates training curves for HRA in the game MS. PAC-MN for various ⁇ values with executive memory.
  • FIG. 32 illustrates an example process for taking an action with respect to a task using separation of concerns.
  • FIG. 33 illustrates an example separation of concerns engine implementing a process for completing a task using separation of concerns.
  • FIG.34 illustrates an example hybrid reward engine.
  • FIG. 35 illustrates physical components of a computing device with which aspects of the disclosure may be practiced.
  • FIG.36A illustrates an example mobile computing device.
  • FIG. 36B illustrates the architecture of one aspect of a mobile computing device.
  • FIG. 37 illustrates an aspect of an architecture of a system for processing data received at a computing system from a remote source, such as a general computing device, tablet computing device, or mobile computing device.
  • a remote source such as a general computing device, tablet computing device, or mobile computing device.
  • Hierarchical learning decomposes a value function in a hierarchical way.
  • Options are temporally extended actions consisting of an initialization set, an option policy and a termination condition. Effectively, applying options to a Markov decision process (MDP) changes it into a semi-MDP, which may provide a mechanism for skill discovery.
  • MDP Markov decision process
  • option discovery in the tabular setting, useful sub-goal states can be identified, for example, by using heuristics based on the visitation frequency, by using graph partitioning techniques, or by using the frequency with which state variables change. However, with function approximation, finding good sub-goals becomes significantly more challenging. In some cases, sub-goal states are identified so that only the option policy is learned. Option discovery may also be performed by identifying ‘purposes’ at the edge of a random agent’s visitation area. Learning options towards such edge-purposes brings the agent quickly to a new region where it can continue exploration. An architecture is provided that may learn the policy over options, the options themselves, as well as their respective termination conditions. This is accomplished without defining any particular sub-goal and uses only the number of options known beforehand.
  • RL Hierarchical Reinforcement Learning in the context of deep reinforcement learning is also described.
  • a high-level controller may specify a goal for a low-level controller. Once the goal is accomplished, the top-level controller selects a new goal for the low-level controller.
  • the system can be trained in two phases: in the first phase the low-level controller is trained on a set of different goals; and in the second phase the high-level and low-level controllers are trained in parallel.
  • the high-level controller can send a modulation signal to the low-level controller to affect the policy of the low-level controller.
  • An example multi-agent RL configuration includes multiple agents which are simultaneously acting on an environment and which receive rewards individually based on the joint actions. Such an example can be modelled as a stochastic game.
  • multi- agent systems can be divided into fully cooperative, fully competitive or mixed tasks (neither cooperative nor competitive). For a fully cooperative task, all agents share the same reward function.
  • ILS Integrated Learning System
  • heterogeneous learning agents such as search-based and knowledge-based
  • LCW Learning by Watching
  • SoC Separation of Concerns improves multi-agent frameworks. For instance, SoC splits a single-agent problem into multiple parallel, communicating agents with simpler and more focused, but different objectives (e.g., skills). An introductory example is detailed below with reference to FIG.1.
  • FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108.
  • the goal of the robot 104 is to reach each piece of fruit 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions).
  • an agent controlling the robot 104 aims to maximize a return, ⁇ ⁇ , which is the expected discounted sum of rewards:
  • the robot 104 receives a reward of“+1” once all of the pieces of fruit 102 are reached, otherwise the reward is 0.
  • the fruit 102 can be placed randomly at different positions 108 at the start of each episode.
  • the optimal policy uses a minimal number of actions to reach all of the fruit 102.
  • deep reinforcement learning a task can often be mapped to some low-dimensional representation that can accurately represent the optimal policy.
  • NP-complete nondeterministic polynomial time complete
  • NP-hard i.e., at least as hard as the hardest problem in NP
  • each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102.
  • This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward.
  • the state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function.
  • An aggregator can then make the final action selection from among the agents of each piece of fruit 102. Therefore, a single state-space of size is replaced by n state-spaces, each having states.
  • the aggregator can, for example, use a voting scheme, select its action based on the summed action-values, or select its action according to the agent with the highest action-value. This last form of action selection could result in greedy behavior, with the agent always taking an action toward reaching the closet piece of fruit 102 that is closest, which correlates well with the performance metric. Other domains, however, might require a different aggregator.
  • Disclosed embodiments include agent configurations that decompose tasks in different ways. These agent configurations can reduce an overall state space and allow for improved machine learning performance by increasing a convergence speed, reducing the amount of processing and memory resources consumed, among other improvements to computer technology.
  • a single-agent task is defined by a Markov decision process (MDP), including the tuple where X is the set of states; A is the set
  • G A is taken in state indicates the reward for a transition from state x to
  • a flat agent can be defined by an MDP including the tuple A performance objective of a
  • SoC model can be to maximize a flat return defined by
  • Each policy ⁇ has a corresponding action-value function , which gives the expected value of the return G t conditioned on the state and action .
  • a goal is to maximize the discounted sum of rewards, also referred to as the return:
  • FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment).
  • the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment.
  • the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2.
  • An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set B l is defined (as illustrated, B 1 and B 2 ), as well as a communication action-set C l (as illustrated, C 1 and C 2 ), and a learning objective.
  • the learning objective can be defined by a reward function, r l , plus a discount factor, ⁇ ⁇ .
  • the agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions:
  • each agent i observes state At each
  • each agent i can also select environment action and communication action
  • the environment which responds with an updated state
  • the environment also produces a
  • this reward is only used to measure the overall performance of the SoC model.
  • each agent i uses its own reward function, to compute overall reward,
  • a property of the SoC model can include that the reward function of a particular agent depends on the communication actions of the other agents. This can provide an incentive for an agent to react in response to communication, even in the case of full observability. For example, agent A can‘ask’ agent B to behave in a certain way via a communication action that rewards agent B for this behavior.
  • FIG.3 illustrates an example generalized way to decompose a single-agent task using n agents (as illustrated, Agent 1 through Agent n).
  • an agent i chooses an action
  • agents i can be fed into an aggregator function f (as illustrated, f).
  • the aggregator function f maps the environment actions e n to an action (as illustrated In an example,
  • the communication actions of the agents are combined into a set of communication actions That set is subsequently combined with the flat
  • the input space of an agent (illustrated as set y) can be based on communication actions (illustrated as set from previous time steps and an updated flat state
  • communication signals can be regarded as the environment of a meta-MDP.
  • a single time step delay of communication actions can be used for a general setting where all agents communicate in parallel.
  • an agent may be partially observable or have limited visibility such that the agent does not see a full flat state-space or all communication actions.
  • each agent can receive a subset of the input space (as illustrated, Formally, state space of an agent i is a projection of onto a subspace of such as:
  • each agent can have its own reward function
  • each implementation of the general SoC model can be divided into different categories. These categories can be based on the relation between the different agents.
  • the sequence of random variables is a Markov chain. This can be formalized by letting define a set of stationary policies for all agents, and ⁇ be
  • agent i when all agents except agent i use a stationary policy, the task for agent i becomes Markov. This trivially holds if agent i is not partially observable (e.g., if
  • agent i can be defined as independent of agent j if the policy of agent j does not affect the transition dynamics of agent i in any way.
  • definitions with to be a set of stationary policies to each agent except agent i and j, and to be the space of all such sets. Then, agent i
  • Agent i is dependent on agent j if it is not independent of agent j.
  • dependency relations of SoC agents can be shown using a dependency graph.
  • FIG.4 illustrates subclasses of agents, including fully independent agents 402, agents with an acyclic relationship 404, agents with a cyclic relationship 406, and agents with an acyclic relationship 408 that uses trainer agents to break cycles in cyclic dependency graphs.
  • An arrow pointing from an agent j (e.g., illustrated agents 1 and/or 2) to an agent i (e.g., illustrated as agents, 1, 2 and/or 3) means that agent i depends on agent j.
  • Circles represent regular agents (e.g., agents 1, 2, 3) and diamonds represent trainer agents (e.g., trainer agents 1 ⁇ and/or 2 ⁇ ).
  • a dependency graph can be acyclic (containing no directed cycles) or cyclic (containing directed cycles).
  • agents are fully independent (e.g., as shown by relationship 402 in FIG.4)
  • the nine actions of an agent controlling the robot can be split into a horizontal action set (e.g., west movement, east movement, and no-op actions) and a vertical action set ( .g., north movement, south movement, and no-op actions) such that The task can then be decomposed into two kinds of agents: horizontal agents and vertical agents.
  • the horizontal agents can see the state and receive a
  • a vertical agent can be defined similarly for a vertical direction. With these agents being fully independent, it follows that the agents converge independent of each other. Hence, stable parallel learning occurs. Agents with Acyclic Dependencies
  • a dependency graph is acyclic (e.g., as shown by relationship 404 in FIG. 4), some of the agents depend on other agents, while some agents are fully independent.
  • An example of such a relationship is shown in FIG. 5.
  • FIG. 5 illustrates a falling fruit example scenario exhibiting an acyclic dependency graph.
  • a robot 102 catches falling fruit 104 with a basket 106 to receive a reward.
  • the basket 106 is attached to the robot's body 108 with an arm 110 that can be moved relative to the body 108.
  • the robot 102 can move horizontally. Independent of that motion, the robot 102 can move the basket 106 a limited distance to the left or right.
  • the agent for the body 108 can control the body 108 by observing the horizontal position of the piece of fruit 104, the vertical position of the piece of fruit 104, and the horizontal position of the robot 102.
  • the agent for the arm 110 can control the arm 110 and observe horizontal position of the piece of fruit 104, the vertical position of the piece of fruit 104, the horizontal position of the robot 102, and the horizontal position of the basket 106.
  • the agent for the arm 110 can receive a reward if the piece of fruit 104 is caught.
  • the agent for the body 108 is fully independent while the agent for the arm 110 depends on the agent for the body 108.
  • An acyclic graph contains some fully independent agents that have policies that will converge independent of other agents. Once these policies have converged, the agents that only depend on these agents will converge, and so on, until all agents have converged. Here too stable parallel learning occurs.
  • FIG. 4 also illustrates a relationship 406 exhibiting a cyclic dependency.
  • the behavior of agent 1 depends on the behavior of agents 2 and 3
  • the behavior of agent 2 depends on the behavior of agents 1 and 3
  • the behavior of agent 3 depends on the behavior of agents 1 and 2.
  • both agents see the full state-space and the agents receive a reward when the fruit 104 is caught. Now both agents depend on each other, forming a cyclic dependency.
  • the approach of pre-training a low-level agent with some fixed policy, then freezing its weights and training a high-level policy using the pre-trained agent, may be a more general update strategy.
  • Relationship 408 in FIG. 4 illustrates an acyclic relationship formed by transforming a cyclic graph into an acyclic graph using trainer agents.
  • a trainer agent for an agent i defines fixed behavior for the agents that agent i depends on to ensure stable learning. It is to be appreciated with the benefit of this description that if the dependency graph is an acyclic graph, using single-agent Q-learning to train the different agents is straightforward.
  • the trainer agent assigned to a particular agent i, can be a fixed-policy agent that generates behavior for the agents on which agent i depends such that their affect on agent i is replaced by the affect of the trainer agent.
  • agent i implicitly defines a stationary MDP for agent i with a corresponding optimal policy that can be learned.
  • agent i only depends on the trainer agent.
  • the trainer agent itself is an independent agent.
  • trainer agents can be used to break cycles in dependency graphs.
  • a cyclic graph can be transformed into an acyclic one in different ways. In practice, which agents are assigned trainer agents is a design choice that depends on how easy it is to define effective trainer behavior. In the simplest case, a trainer agent can just be a random or semi-random policy.
  • agent 1 depended on the behavior of agents 2 and 3.
  • Learning with trainer agents can occur in two ways.
  • a first way is to pre-train agents with their respective trainer agents and then freeze their weights and train the rest of the agents.
  • a second way is to train all agents in parallel with the agents that are connected to a trainer agent using off-policy learning to learn values that correspond to the policy of the trainer agent, while the behavior policy is generated by the regular agents.
  • Off-policy learning can be achieved by importance sampling, which corrects for the frequency at which a particular sample is observed under the behavior policy versus the frequency at which it is observed under the target policy. For example, consider agent i with actions that depends on agent j with actions Further,
  • agent i has a trainer agent i ⁇ attached to it mimicking behavior for agent j.
  • agent also has actions ⁇ ⁇
  • agent j is generated by agents i and j. If at time agent j selects action while the selection
  • recursive optimality can be defined as a type of local optimality, in which the policy for each subtask is optimal given the policies of its child-subtasks.
  • a recursive optimal policy is an overall policy that includes the combination of all locally-optimal policies. The recursive optimal policy is generally less desirable than the optimal policy for a flat agent, but can be easier to determine.
  • a similar form of optimality can be defined for a SoC model. If the dependency graph of a SoC model is acyclic (with or without added trainer agents), then a recursive optimal SoC policy can be defined as the policy including all locally optimal policies. In other words, policy is optimal for agent i, given the policies of the agents on which agent i depends.
  • Ensemble learning includes the use of a number of weak learners to build a strong learner. Weak learning can be difficult to use due to difficulties in framing RL problems into smaller problems. In some examples, there can be a combination of strong RL algorithms with policy voting or value function averaging to build an even stronger algorithm.
  • SoC allows for ensemble learning in RL with weak leaners through local state space and local reward definitions.
  • SoC agents can train their policies on the flat action space on the basis of a local state space and reward
  • the agents may instead inform
  • the SoC agents can be trained off-policy based on the actions taken by the aggregator because the aggregator is the controller of the SoC system.
  • stable (off-policy) learning occurs if the state-space of each agent is Markov. That is, stable (off-policy) learning occurs if for all agents i :
  • agents can be organized in a way that decomposes a task hierarchically. For instance, there can be three agents where Agent 0 is a top-level agent, and Agent 1 and Agent 2 are each bottom-level agents. The top-level agent only has communication actions, specifying which of the bottom level agents is in control. In other words, Agent 1 and Agent 2 both have a state-dependent action-set that gives access to the environment actions A if they have been given control by Agent 0. That is, for Agent 1:
  • Agent 0 By allowing Agent 0 to only switch its action once the agent currently in control has reached a terminal state (e.g., by storing a set of terminal state conditions itself or by being informed via a communication action), a typical hierarchical task decomposition can be achieved.
  • a SoC model can be a generalization of a hierarchical model.
  • an implicit MDP is defined for agent i with state space Y, reward function and (joint)
  • agent i can be independent of agent j if the value does not depend on the policy of agent j.
  • a simple example of a case where this independence relation holds is the hierarchical case, where the actions of the top agent remain fixed until the bottom agent reaches a terminal state.
  • a high-level controller specifies a goal for the low-level controller. Once the goal is accomplished, the top-level controller selects a new goal for the low-level controller.
  • the system can be trained in two phases: in the first phase the low-level controller is trained on a set of different goals; in the second phase the high-level and low-level controllers are trained in parallel.
  • FIG.6 illustrates an application of the SoC model on a navigation task within a tabular domain to show the scalability of the SoC model.
  • the goal is to navigate a vehicle 102 from a start position 104 to an end position 106 through a maze formed by walls 608 and navigable, open positions 610.
  • the action set of the vehicle 602 includes a move forward action that moves the vehicle 602 one position 610 forward, a turn clockwise action that rotates the vehicle 602 90-degrees clockwise and a turn counterclockwise action that rotates the vehicle 602 90-degrees counterclockwise.
  • a varying number of extra‘no-op’ actions was added to control the complexity of the domain.
  • the agent controlling the vehicle 602 received a reward of ⁇ 5 when the vehicle 602 bumps into a wall 608 and a reward of ⁇ 1 for all other actions.
  • a flat agent controlling the vehicle 102 was compared with a SoC agent controlling the vehicle 102.
  • the SoC agent included a high and low level agent.
  • the high-level agent communicated a compass direction to the low-level agent
  • the reward function of the low-level agent was such that the agent receives a reward of ⁇ 5 for hitting the wall and a reward of +1 if it made a move in the direction requested by the high-level agent. All agents were trained with Q-learning and used ⁇ greedy exploration with a fixed ⁇ value of 0.01 and a step size of 0.1.
  • FIG. 7 shows the learning behavior within the experiment for tasks with different levels of complexity (e.g., no-op actions). Specifically, the average return of agents for tasks with 5, 10 and 20 no-op actions were compared. While the number of no- op actions had only a small effect on the performance of the SoC method, it affected the flat agent considerably by increasing the number of episodes it took for the flat agent to converge. This is further illustrated in FIG.8.
  • FIG. 8 illustrates the average return for the SoC and flat agents over 4,000 episodes for a varying number of no-op actions.
  • the curve shows that the SoC agent is more robust than the flat agent as the number of no-op actions increased because the average return for the SoC agent decreased significantly less than the average return of the flat agent as the number of no-op actions increased.
  • the results shown in FIGS. 7 and 8 illustrate the ability of the SoC model to improve the scalability.
  • the high-level agent knows the available compass directions in each grid-cell to avoid giving the low-level agent a goal that it cannot fulfill. For example, the vehicle 602 cannot move“north” while the vehicle 102 is in the start position 604 because doing so would cause the vehicle 602 to hit a wall 608.
  • the high-level agent of the SoC system does not require this information and therefore has a smaller state space and has improved performance.
  • a flat agent was compared with the SoC model on the game Catch.
  • Catch is a simple pixel-based game involving a 24 ⁇ 24 ⁇ screen of pixels in which the goal is to guide a basket moving along the bottom of the screen to catch a ball that is dropped at a random location at the top of the screen.
  • both the ball and the basket are a single pixel in size.
  • An agent can give the basket the following actions: left (which moves the basket one pixel to the left), right (which moves the basket one pixel to the right), and stay (which causes the basket to remain in place).
  • the agent received a reward of 1 for catching the ball, a reward of–1 if the ball reaches the bottom of the screen without being caught, and a reward of 0 otherwise.
  • the SoC model for the Catch experiment includes a high-level and a low-level agent.
  • the high-level agent has no direct access to the environment actions, but the high-level agent communicates a desired action to the low-level agent:
  • the low-level agent has direct access to
  • the high-level agent has a discount factor of 0.99 and has access to the full screen, while the low-level agent has a discount factor of 0.65 and uses an optional bounding box of 10 ⁇ 10 pixels around the basket.
  • the low-level agent only observes the ball when it is inside the bounding box.
  • the high-level agents received a reward of 1 if the ball was caught and a reward of–1 otherwise.
  • the low-level agent received a reward of 1 if the ball was caught and a reward of–1 otherwise.
  • the low-level agent received a small positive reward for taking an action suggested by the high-level agent.
  • the high-level agent took an action every two time steps, while the low-level agent takes actions every time step.
  • Both the flat agent and the high-level and low-level agents were trained using a Deep Q-Network (DQN).
  • the flat agent used a convolutional neural network defined as follows: the 24 ⁇ 24 binary image was passed through two convolutional layers, followed by two dense layers. Both convolutional layers had 32 filters of size (5,5) and a stride of (2,2). The first dense layer had 128 units, followed by the output layer with 3 units.
  • the high-level agent in the SoC system used an identical architecture to that of the flat agent. However, due to the reduced state size for the low-level agent, it only used a small dense network instead of a full convolution network.
  • the network flattened the 10 ⁇ 10 input and passed it through two dense layers with 128 units each. The output was then concatenated with a 1-hot vector representing the communication action of the high-level agent. The merged output was then passed through a dense layer with 3 units.
  • FIG.9 illustrates a network used for the flat agent and the high level agent 902 versus a network used for the low-level agent 904. Because the low-level agent used a bounding box, it does not require a full convolutional network.
  • FIGS. 10A–10C show the results of the comparison of performance between a SoC model and a flat agent showing the average score of each agent over a number of epochs for three different grid sizes.
  • FIG. 10A illustrates a learning speed comparison between a SoC model and a flat agent for a 24 ⁇ 24 grid.
  • FIG. 10B illustrates a learning speed comparison between a SoC model and a flat agent for a 48 ⁇ 48 grid.
  • FIG. 10C illustrates a learning speed comparison between a SoC model and a flat agent on an 84 ⁇ 84 grid.
  • the SoC model learned significantly faster than the flat agent.
  • the flat agent failed to learn anything significant over a training period of 800 epochs.
  • the SoC model converged after only 200 epochs.
  • the SoC model In general, for the SoC model, the low-level agent was able to learn quickly due to its small state space and the high-level agent experienced a less sparse reward due to the reduced action selection frequency. For at least these reasons, the SoC model was able to significantly outperform the flat model.
  • FIG. 11 illustrates the effect of varying communication reward on the final performance of the SoC model on a 24 ⁇ 24 game of Catch.
  • the results show that if the additional reward is 0, the low-level agent has no incentive to listen to the high-level agent and will act fully independent. Alternatively, if the additional reward is very high, the low- level agent will follow the suggestion of the high-level agent. Because both agents are limited (the high-level agent has a low action-selection frequency and the low-level agent has a limited view), both these situations are undesirable. As illustrated, the ideal low- level agent in the experiment was one that acted neither fully independently nor fully dependently with respect to the high-level agent.
  • FIG. 12 illustrates the effect on the average score over a number of training epochs caused by different action selection intervals (asi) for a high-level agent of the SoC system on an 84 ⁇ 84 game of Catch.
  • the intervals included every 1, 2, 4, 8, and 16 time intervals.
  • an asi of 4 performed the best in the experiment, while an asi of 16 performed the worst over 200 epochs.
  • the communication is too frequent, the learning speed goes down, because relative to the action selections the reward appears more sparse, making learning harder.
  • it is too infrequent asymptotic performance is reduced because the high-level agent has not enough control over the low-level agent to move it to approximately the right position.
  • FIG. 13 illustrates the effect of penalizing communication for the high-level agent on the final performance of the system on a 24x24 catch game.
  • the communication probability shows the fraction of time steps on which the high-level agent sends a communication action. It can be seen in FIG.13 that the system can learn to maintain near optimal performance without the need for constant communication.
  • the decomposition was made a priori, however, it is to be appreciated by a person of skill in the art with the benefit of this description that this is only a non-limiting example.
  • learning the decomposition can also prove to be useful.
  • the pellet distribution is randomized: at the start of each new episode, there is a 50% probability for each position to have a pellet. During an episode, pellets remain fixed until they are eaten by Pac-Boy.
  • the state of the game includes the positions of Pac-Boy, pellets, and ghosts. This results in a state space of
  • the SoC model was tested in this environment, and concerns were separated in the following manner: an agent was assigned to each possible pellet location. This pellet agent receives a reward of 1 only if a pellet at its assigned position is eaten.
  • the pellet agent’s state space includes Pac-Boy’s position, which results in 76 states. A pellet agent is only active when there is a pellet at its assigned position.
  • an agent was assigned to each ghost. This ghost agent receives a reward of–10 if Pac-Boy bumps into its assigned ghost.
  • the ghost agent’s state space includes Pac-Boy’s position and the ghost’s position, resulting in 76 ⁇ states. Because there are on average 38 pellets, the average number of agents is 40.
  • the first non-SoC baseline was a flat agent that uses the exact same input features as the SoC model. Specifically, the state of each agent of the SoC model was encoded with a one-hot vector and the vectors were concatenated, resulting in a binary feature vector of size 17,252 with about 40 active features per time step. This vector was used for linear function approximation with Q-learning (referred to as Linear Q Learning).
  • DQN-clipped Two non-SoC deep reinforcement learning baselines were also considered.
  • FIGS.14A and 14B show the learning speed of the SoC model compared to the DQN-clipped, DQN-scaled, and Linear Q Learning baselines described above.
  • FIG. 14A compares the average scores (higher is better) over a number of epochs for the models and
  • FIG. 14B compares the average number of steps (lower is better) taken over a number of epochs for the models.
  • One epoch corresponds to 20,000 environmental steps and each curve shows the average performance over 5 random seeds.
  • the upper-bound line in FIG. 14A shows the maximum average score that can be obtained.
  • the SoC model converged to a policy that was very close to the optimal, upper bound, and the baselines fell considerably short of the baseline even after converging.
  • the Linear Q Learning baseline handled the massive state space with no reductions and thus took considerably longer to converge.
  • DQN-clipped and DQN- scaled converge to similar final performances, their policies differed significantly as can be seen in the differing average number of steps taken by each in FIG. 14B.
  • DQN-scaled appeared to be much wearier of the high negative reward obtained from being eaten by the ghosts and thus took more steps to eat all of the pellets.
  • pre-training was performed using a random behavior policy. After pre-training, the agents were transferred to the full game and the remaining agents are trained.
  • FIGS. 15A and 15B show the average score and average steps over epochs, respectively, for SoC agents with and without pre-training on Pac-Boy. As can be seen, pre-training boosts performance with respect to average score and average number of steps compared to an agent without pre-training.
  • the learning rate was set to 0.00025 (which was found to be the best learning rate for DQN on Pac-Boy) and then a search was run for the adaptive- normalization rate by searching over the same parameters mentioned above.
  • the settings used for the Catch and Pac-Boy agents and experiments is shown in Table 1 (below).
  • the low-level agent in the Catch experiment used a dense network defined as follows. The input was passed through dense layers both containing 128 units and used rectified non-linear activations. The output was concatenated with the communication action sent by the high level agent represented by a 1-hot vector of size
  • the merged representation is passed through the output layer with a linear activation
  • Multi-advisor reinforcement learning can be a branch of SoC where a single-agent reinforcement learning problem is distributed to n learners called advisors. Each advisor tries to solve the problem from a different angle. Their advice is then communicated to an aggregator, which is in control of the system.
  • Disclosed examples include three off-policy bootstrapping methods: local-max bootstraps with the local greedy action, rand-policy bootstraps with respect to the random policy, and agg-policy bootstraps with respect to the aggregator’s policy.
  • a single-agent reinforcement learning task can be partitioned into a multi-agent problem (e.g., using a divide and conquer paradigm). All agents can be placed at a same level and be given advisory roles that include providing an aggregator with local Q-values for each available action.
  • a multi-advisory model can be a generalization of reinforcement learning with ensemble models, allowing for both the fusion of several weak reinforcement learners and the decomposition of a single-agent reinforcement learning problem into concurrent subtasks.
  • agents are trained independently and greedily to their local optimality, and are aggregated into a global policy by voting or averaging.
  • An attractor is a state where advisors are attracting in every direction equally and where the local-max aggregator’s optimal behavior is to remain static.
  • Disclosed examples include at least two attractor-free, off-policy bootstrapping methods.
  • there is rand-policy bootstrapping which allows for convergence to a fair short-sighted policy.
  • this example favors short-sightedness over long-term planning.
  • there is an agg-policy bootstrapping method that optimizes the system with respect to the global optimal Bellman equation.
  • this example does not guarantee convergence in a general case.
  • a multi-advisor reinforcement learning architecture can greatly speed up learning and converges to a better solution that certain reinforcement learning baselines.
  • a reinforcement learning framework can be formalized as a Markov Decision Process (MDP).
  • MDP Markov Decision Process
  • A is the action space
  • a trajectory ⁇ is the projection into the MDP
  • a goal is to generate trajectories with high discounted cumulative reward, also called the return: To do so, one needs to
  • FIG. 16 illustrates an example of such an overall multi-advisor architecture 1600, including advisors 1602, an aggregator 1604, and an environment 1606.
  • each advisor 1602 sends its local Q-values q to the aggregator 1604 for all actions in the current state x.
  • the aggregator 1604 is defined with function f that maps the received ⁇ values into an action
  • Multi-criteria reinforcement learning can result in segmentation of rewards with a specific aggregating policy. See Gabor et al, Multi-criteria reinforcement learning, In Proceedings of the 15th International Conference on Machine Learning (ICML) (1998).
  • the multi-advisor models fall within SoC. and SoC distributes the responsibilities among several agents that may communicate and have complex relationships, such as master-servant or collaborators-as-equal relationships.
  • the following section transcribes under the multi-advisor reinforcement learning notations the main theoretical result: the stability theorem ensuring, under conditions, that the advisors’ training eventually converges.
  • the agents can play the role of advisors.
  • the role of function f can be to aggregate their recommendations into a policy.
  • These recommendations can be expressed as their value functions ⁇ ⁇ .
  • the local learners may not be able to be trained on- policy if the policy followed by the aggregator does not necessarily correspond to any of their respective locally optimal policies.
  • Multi-advisor reinforcement learning can be interpreted as ensemble learning for reinforcement learning.
  • a boosting algorithm is used in a RL framework, but the boosting is performed upon policies, not RL algorithms. This technique can be seen as a precursor to the policy reuse algorithm rather than Ensemble Learning.
  • several online RL algorithms are combined on several simple RL problems. The mixture models of the five experts performs generally better than any single one alone. The algorithms can include off-policy, on-policy, actor-critics, among others, and can continue this effort in a very specific setting where actions are explicit and deterministic transitions.
  • advisors are trained on different reward functions. These are potential-based reward shaping variants of the same reward function and are embed the same goals. As a consequence, it can be related to a bagging procedure.
  • the advisors recommendation are then aggregated under the Horde architecture with local greedy off-policy bootstrapping.
  • This section presents three different local off-policy bootstrapping methods: local-max, rand-policy, and agg-policy. They are presented and analyzed under a linear composition aggregator, but most considerations are also valid with other aggregating functions, such as voting or policy mixtures.
  • Off-policy bootstrapping methods Local-max bootstrapping
  • FIG. 17 illustrates a central state (as illustrated, x) in which the system has three possible actions: stay put (as illustrated, action perform the goal of advisor 1 (as
  • An attractor x is a state where local-max would lead to the aggregator staying in that state, if it had the chance. It verifies the following equation:
  • An advisor j can be monotonous if the following condition is satisfied:
  • Monotony of advisors can be restrictive and most reinforcement learning problems do not fall into that category, even for low ⁇ values.
  • Navigation tasks do not qualify by nature: when the system goes into a direction that is opposite to some goal, it gets into a state that is worse than by staying in position.
  • Monotony also does not apply to RL problems with states that terminate the trajectory although some goals are still incomplete.
  • RL problems where all advisors are monotonous, such as resource scheduling where each advisor is responsible for the progression of a given task. Note that a multi-advisor reinforcement learning problem without any attractors does not guarantee optimality. It simply means that the system will continue achieving goals as long as there are any.
  • Off-policy bootstrapping methods Rand-policy bootstrapping
  • the local rand-policy optimization is equivalent to the global rand-policy optimization. As such, it does not suffer from local attractor issue previously described. However, optimizing the value function with respect to the random policy is in general far from the optimal solution to the global MDP problem.
  • Off-policy bootstrapping methods Agg-policy bootstrapping
  • aggregator policy as the reference.
  • the aggregator is in control, and the advisors are evaluating the current aggregator’s policy f.
  • the aggregator’s policy is dependent on the other advisors, which means that, even though the environment can still be modelled as a MDP, the training procedure is not. Assuming that all advisors jointly converge to their respective local optimal value, denoted by it satisfies the following Bellman
  • the multi-advisor model was evaluated using the Pac-Boy experiment as described above.
  • each advisor was responsible for a specific source of reward (or penalty). More precisely, concerns were separated follows: an advisor was assigned to each possible pellet location. This advisor sees reward of 1 only if a pellet at its assigned position gets eaten. Its state space includes Pac-Boy’s position, resulting in 76 states. A pellet advisor is only active when there is a pellet at its assigned position and it is set inactive when its pellet is eaten. In addition, an advisor was assigned to each ghost. This advisor receives reward of -10 if Pac-Boy bumps into its assigned ghost. Its state space includes Pac-Boy’s position and the ghost’s position, resulting in 76 2 states. Because there are on average 37.5 pellets, the average number of advisors running at the beginning of each episode is 39.5.
  • time scale was divided into 50 epochs lasting 20,000 transitions each.
  • an evaluation phase was launched for 80 games.
  • Each experimental result is presented along two dimensional performance indicators: the averaged non discounted rewards and the average length of the games.
  • the average non discounted rewards can be seen as the number of points obtained in a game. Its theoretical maximum is 37.5 and the random policy average performance is around -80, which corresponds to being eaten around 10 times by the ghosts.
  • a first baseline was a system that used the exact same input features as the multi-advisor reinforcement learning model. Specifically, the state of each advisor of the multi-advisor reinforcement learning model was encoded with a one-hot vector and all these vectors are concatenated, resulting in a binary feature vector of size 17,252 with about 40 active features per time step. This vector was used for linear function approximation with Q-learning. This baseline is referred to as linear Q-learning.
  • DQN- clipped The standard DQN algorithm (see, e.g., Mnih et al., above) with reward clipping (referred to as DQN- clipped).
  • Pop-Art The input to both DQN-clipped and DQN-scaled was a 4-channel binary image, where each channel is in the shape of the game grid and represents the positions of one of the following features: the walls, the ghosts, the pellets, or Pac-Boy.
  • Multi-Advisor Model Pac-Boy: Attractor Examples
  • FIG. 18 illustrates an example three-pellet attractor in Pac-Boy.
  • the example three-pellet attractor occurs when the game is in a state with equal distance between Pac-Boy 1802 and three pellets 1804, with Pac-Boy 1802 adjacent to a wall 1806, enabling Pac-Boy to perform a no-op action. Moving towards a pellet 1804, makes it closer to one of the pellets 1804, but further from the two other pellets 1804, since diagonal moves are not allowed. Expressing the real value of each action under local-max gives the following results:
  • the aggregator may opt to hit the wall 1806 indefinitely. Optimality is not guaranteed, and in this case, the system behavior would be sub-optimal.
  • FIG. 19 illustrates an example situation in Pac-Boy without a no-op action.
  • the attractors can be encountered in navigation tasks even in settings without any no-op action.
  • the Pac-Boy 1802 is placed in a 2 ⁇ 2 square with eight pellets 1804 surrounding Pac-Boy 1802.
  • the action-state values of the aggregator under local-max are: [ 00204]
  • Pac-Boy 1802 After moving North or West, Pac-Boy 1802 arrives in a state that is symmetrically equivalent to the first one. More generally in a deterministic navigation task like Pac-Boy where each action can be cancelled by a new action, it can be shown that the condition on ⁇ is a function of the size of the action set A. A more general result on stochastic navigation tasks can be demonstrated.
  • the condition for not being stuck in an attractor set can be related to
  • FIG. 20A illustrates the average score of agg-policy against baselines over a number of epochs.
  • linear Q-learning performs the worst. It benefits from no state space reduction and does not generalize as well as the Deep RL methods.
  • the two other baselines, DQN-clipped and DQN-scaled (DQN-Pop-Art) perform better but do not progress after reaching a reward close to 20.
  • DQN-Pop-Art DQN-Pop-Art
  • FIG.20B illustrates average episode length against baselines over a number of epochs.
  • DQN-clipped and DQN-scaled DQN-Pop-Art
  • their learned policies are in fact very different.
  • DQN-scaled appears to be much wearier of the high negative reward obtained from being eaten by the ghosts and thus takes much more time to eat all the pellets.
  • the agg- policy outperforms the baselines by having a lower number of average steps across the epochs.
  • FIG. 20C illustrates average scores for different methods over a number of epochs.
  • the pellet collection problem is similar to the travelling salesman problem, which is known to be NP-complete.
  • the suboptimal policy including moving towards the closest pellet, corresponding with a small ⁇ , is in fact a decent one.
  • the ghost avoidance this is where local-max with low ⁇ gets its advantage over other settings: the local optimization provides advantageous control of the system near the ghosts, while with rand-policy and agg-policy, the ghost advisor is uncertain of the aggregator’s next action. As a result, they become more conservative around the ghosts, especially rand-policy, which considers each future action as equally likely.
  • agg-policy even though its performance remains near that of local- max, it still suffers from the fact that the local learners cannot fully make sense of the aggregator’s actions due to their limited state space representations.
  • Other ⁇ values for agg-policy were tested and a value close to 0.4 appeared to work well in this example by providing a good trade-off between the long-term horizon and the noise in the Q-function propagated by high values of ⁇ . More precisely, a smaller ⁇ made the ghost advisors less fearful of the ghosts, which is profitable when collecting the nearby pellets.
  • FIG.21 illustrates average performance for this experiment with noisy rewards.
  • performance was compared for local-max with 4, local- max with ⁇ , agg-policy with ⁇ 0.9, and agg-policy with
  • agg-policy performed better than local-max even under noise with variance 100 times larger.
  • the pellet advisors were able to perceive the pellets that were in a radius dependent on with a lower ⁇ implying a lower radius.
  • local-max was incompatible with high ⁇ values and was unable perceive distant pellets.
  • Optimizing with respect an artificial ⁇ value might converge to policies that are largely suboptimal regarding the true ⁇ value in an objective function.
  • the multi-advisor framework allows for decomposing a single agent reinforcement learning problem into simpler problems tackled by learners called advisors.
  • the advisors can be trained according to different local bootstrapping techniques. Local- max bootstraps with a local greedy action. It can converge but a sum-max inversion causes its optimal policy to be endangered by attractors. Rand-policy bootstraps with respect to the random policy. It can converge and is robust to attractors, but its random bootstrapping can prevent the advisors from planning in an efficient way. Finally, agg-policy bootstraps with respect to the aggregator’s policy. It optimizes the system according to the global Bellman optimality equation, but does not necessarily guarantee convergence.
  • a challenge in reinforcement learning (RL) is generalization.
  • generalization is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.
  • HRA Hybrid Reward Architecture
  • RL reinforcement learning
  • One challenge of RL is to scale methods such that they can be applied to large, real-world problems. Because the state-space of such problems is typically massive, strong generalization is usually required to learn a good policy efficiently. RL techniques can be combined with deep neural networks.
  • DQN Deep Q-Networks
  • a value function predicts expected return, conditioned on a state or state-action pair.
  • an optimal policy can be derived.
  • the generalization behavior of DQN can be achieved by regularization on the model for the optimal value function.
  • the optimal value function is very complex, then learning an accurate low-dimensional representation can be challenging.
  • a new, complementary form of regularization can be applied on the target side.
  • the reward function can be replaced with an alternative reward function that has a smoother optimal value function that still yields a reasonable (though not necessarily optimal) policy, when acting greedily.
  • a key observation behind regularization on the target function is the difference between the performance objective, which specifies what type of behavior is desired, and the learning objective, which provides the feedback signal that modifies an agent’s behavior.
  • the performance objective which specifies what type of behavior is desired
  • the learning objective which provides the feedback signal that modifies an agent’s behavior.
  • RL a single reward function often takes on both roles.
  • the reward function that encodes the performance objective might be bad as a learning objective, resulting in slow or unstable learning.
  • a learning objective can be different from the performance objective but still perform well with respect to it.
  • Intrinsic motivation uses the above observation to improve learning in sparse- reward domains. It can achieve this by adding a domain-specific intrinsic reward signal to the reward coming from the environment.
  • an intrinsic reward function is potential-based, which maintains optimality of the resulting policy.
  • a learning objective can be defined based on a different criterion: smoothness of the value function, such that it can easily be represented by a low-dimensional representation. Because of this different goal, adding a potential-based reward function to the original reward function may not be a good strategy, because this typically does not reduce the complexity of the optimal value function.
  • a strategy for constructing a learning objective can be to decompose the reward function of the environment into n different reward functions.
  • Each reward function can be assigned to a separate reinforcement learning agent. These agents can learn in parallel on the same sample sequence by using off-policy learning (e.g., using a Horde architecture).
  • An aggregator can generate or select an action to take with respect to the environment. This can be referred to as an environment action and can define a set of all possible actions that can be taken with respect to the environment.
  • Each agent can give its values for the actions of the current state to an aggregator. In an example, the aggregator can select one of the received actions as the environment action.
  • the aggregator can combine two more received action values into a single action-value for each action (for example, by averaging over all agents). Based on these action-values the current action is selected (for example, by taking the greedy action). In another example, the aggregator combines two or more of the received actions to form the environment action (e.g., combining two actions with the highest action-values).
  • the actions or action values received from the agents may but need not necessarily correspond to actions that can be taken with respect to the environment. For example, an environment may define possible actions as:“Up” and“Down”, but there may be a“caution” agent that, rather than describing an action to take,
  • MDP Markov decision process
  • the behavior is defined by a policy
  • agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards where the discount factor controls the importance of
  • Each policy ⁇ has a corresponding action-value
  • Model-free methods improve their policy by iteratively improving an estimate of the optimal action-value function using sample-based
  • the target function of the deep network can be regularized by splitting the reward fu nction into n reward functions, weighted by
  • the reward function may be decomposed such that the sub-reward functions depend on a subset of the entire set of state variables. These sub-reward functions may be smooth value functions that are easier to learn. Smooth functions can be simplified in comparison to other value functions and can be described by fewer parameters.
  • each agent has its own reward function, each agent i also has its own Q-value function associated with it: To derive a policy from these multiple
  • an aggregator receives the action-values (i.e., a single value for each action), using the same linear combination as used in the reward decomposition.
  • Disclosed embodiments can be relevant to achieving more efficient convergence to a close-to-optimal policy. In some embodiments, this can be achieved by acting greedily with respect to Q-values of a uniformly random policy. Evaluating a random can result in Q-values of individual agents being fully independent of each other, which can result in a smooth value function that can be efficiently learned.
  • This update can be referred to as a local-mean update.
  • agents can share multiple lower-level layers of a deep Q-network, the collection of agents can be viewed alternatively as a single agent with multiple heads, with each head producing the action-values of the current state under a different A single
  • Each head can be associated with a different reward function.
  • FIG. 22 illustrates an example single-head architecture having a single reward function:
  • FIG. 23 illustrates an example HRA with multiple heads, each having its own reward function R.
  • the loss function for HRA is:
  • the aggregator' s Q-values approximate Q H * RA .
  • Q H * RA is not equal to the optimal value function corresponding to R env .
  • a different aggregation scheme can be used, for example, instead of mean over heads, an aggregator action-value could be defined as the max over heads, or a voting based aggregation scheme could be used.
  • an update target based on the expected State-Action-Reward-State-Action update rule can be used:
  • HRA builds on the Horde architecture.
  • the Horde architecture includes a large number of “demons” that learn in parallel via off-policy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudo-reward function.
  • GVF general value function
  • a pseudo-reward can be any feature-based signal that encodes useful information.
  • the Horde architecture can focus on building general knowledge about a world encoded via a large number of GVFs.
  • HRA focuses on training separate components of the environment-reward function to achieve a smoother value function to efficiently learn a control policy.
  • HRA can apply multi-objective learning to smooth a value function of a single reward function.
  • Some approaches can be related to options and hierarchical learning.
  • Options are temporally-extended actions that, like HRA’s heads, can be trained in parallel based on their own (intrinsic) reward functions.
  • intrinsic reward function is over.
  • a higher-level agent that uses an option sees it as just another action and evaluates it using its own reward function. This can yield great speed increases in learning and help substantially with better exploration, but they do not directly make the value function of the higher-level agent less complex.
  • the heads of HRA can represent values, trained with components of the environment reward. Even after training, these values can stay relevant because the aggregator uses the values of all heads to select its action.
  • terminal states are states from which no further reward can be received; they have by definition a value of 0.
  • HRA can refrain from approximating this value by the value network, such that the weights can be fully used to represent the non-terminal states.
  • 3) By using pseudo-reward functions. Instead of updating a head of HRA using a component of the environment reward, it can be updated using a pseudo-reward. In this scenario, each head of HRA representatives a GVF. GVFs are more general than value functions based on reward components and they can often be used to learn more efficiently. However to derive a policy from them requires a more specialized aggregator.
  • the first two types of domain knowledge are not limited to being used only by HRA; they can be used many different methods. However, because HRA can apply this knowledge to each head individually, it can exploit domain knowledge to a much greater extent.
  • a robot controlling a robot for collecting a number of random pieces of fruit as quickly as possible in a 10x10 grid.
  • the agent starts at a random position.
  • An episode ends after all five pieces of fruit have been eaten, or over 300 steps, whichever comes first.
  • FIG.24 illustrates an example DQN neural network 2410, HRA neural network 2420, and HRA with pseudo-rewards neural network 2430.
  • the DQN neural network 2410 can include an input layer 2412, one or more hidden layers 2414, and an output layer 2416 used to produce an output 2418.
  • Backpropagation can be used to train the neural network 2410 based on error measured at the output 2418.
  • the HRA neural network 2420 includes an input layer 2422, one or more hidden layers 2424, and a plurality of heads 2426, each with their own reward function (as illustrated ⁇ ⁇ , ⁇ ⁇ , ⁇ and ⁇ ⁇ ).
  • the heads 2426 inform the output 2428 (e.g., using a linear combination).
  • Backpropagation can also be used to train the HRA neural network 2420.
  • Backpropagation can be used to train the neural network 2420 based on error measured at each of the reward function heads 2426. By measuring error at the heads 2426 (e.g., rather than at the output 2428 as in the DQN network 2410), faster learning can occur.
  • the DQN neural network 2410 and the HRA neural network 2420 can have the same network architecture but differ in how the network is updated.
  • a gradient based on can be computed and the gradient
  • the gradient is propagated through the network from the output 2418.
  • the gradient can be propagated from the layer prior to the last layer: the heads 2426.
  • the HRA with pseudo-rewards neural network 2430 can include an input layer 2432, one or more hidden layers 2434, a plurality of heads 2436 with general value functions (as illustrated , mappings 2437 from the results of the generalized
  • mapping 2437 consider the fruit-collection example where there can be heads 2426 that provide a reward for reaching a particular location that can have a piece of fruit. The mapping 2437 may be based on whether there actually was a piece of fruit at a current location. If so, the mapping 2437 can prove the value of the general value function for the location. If not, the mapping 2437 can provide an output with a value of zero. In this manner, there can be learning even if there is no fruit at a particular location. For example, the weights of the network 2430 can be updated via backpropagation based on the error of the general value function regardless of whether there is fruit at the location.
  • mappings 2437 can be used to filter out results where the fruit is not there prior to providing the output of the heads 2438, so as to not affect the overall output of the network 2439 (and thus a decision taken by an agent based on the network 2430) while still allowing for training.
  • the HRA with pseudo-rewards neural network 2430 the heads 2438 are not updated directly. Instead, general value functions learn based on a pseudo- reward. The output of the general value functions can then be used to compute the output of each head 2438.
  • FIG. 25A illustrates the results comparing DQN max, DQN max (removed features), HRA mean, and HRA mean (removed features).
  • HRA showed a clear performance boost over DQN by requiring fewer steps, even though the network was identical. Further, adding different forms of domain knowledge caused additional large improvements. Whereas using a network structure enhanced by domain knowledge caused large improvements for HRA, using that same network for DQN, resulted in DQN not learning anything at all.
  • FIG. 25B illustrates results comparing tabular HRA GVF, Tabular HRA, and HRA mean (removed features). As illustrated, the Tabular HRA GVF converged to a low number of average steps much more quickly than tabular HRA and HRA mean (removed features).
  • a second domain experiment was performed using is the ATARI 2600 game MS. PAC-MAN.
  • MS. PAC-MAN the player scores points by reaching pellets in a map while avoiding ghosts.
  • FIGS.26A–D each illustrate the four different maps 2601 in the game.
  • Each of the four different maps 2601 include a different maze formed by walls 2602. Within the maze are pellets 2604 and power pellets 2606. ghosts 2608 and bonus fruit 2610 can also appear in the maze.
  • the player controls Ms. Pac-Man 2612 during the game. Points 2614 are scored when Ms. Pac-Man 2612“eats” (reaches) the pellets 2604 and power pellets 2606. Contact with a ghost 2608 causes Ms. Pac-Man 2612 to lose a life 2616, but eating one of the power pellets 2606 turns ghosts 2608 blue for a small duration, allowing them to be eaten for extra points. Bonus fruit 2610 can be eaten for extra points twice per level. When all pellets 2604 and power pellets 2606 have been eaten, a new map 2601 is started. There are seven different types of fruit 2610, each with a different point value.
  • T able 4 Maps and fruit per level
  • the HRA architecture for this experiment used one head for each pellet, one head for each ghost, and one head for each blue ghost, and one head for the fruit. Similar to the fruit collection task, HRA used GVFs that learned the Q-values for reaching a particular location on the map (separate GVFs can be learned for each of the maps in the game). The agent learns part of this representation during training. It started with zero GVFs and zero heads for the pellets. By wandering around the maze, it discovered new map locations it could reach, which resulted in new GVFs being created. Whenever the agent found a pellet at a new location, it created a new head corresponding to the pellet.
  • the Q-values of the head of an object were the Q-values of the GVF that correspond with the object’s location (e.g., moving objects use a different GVF each time). If an object was not on the screen, its Q-values were zero.
  • Each head i was assigned a weight w i , which could be positive or negative.
  • the weight corresponded to the reward received when the object is eaten.
  • the weights were set to -1,000 because contact with a ghost causes Ms. Pac-Man to lose a life.
  • GVF heads eaters and avoiders: Ms.
  • Pac-Man’s state was defined as its low- level features position on the map and her direction (North, South, East, or West). Depending on the map, there are about 400 positions and 950 states. A GVF was created online for each visited Ms. Pac-Man position. Each GVF was then in charge of determining the value of the random policy of Ms. Pac-Man’s state for getting the pseudo- reward placed on the GVF’s associated position. The GVFs were trained online with off- policy one-step bootstrapping with Thus, the full tabular
  • Aggregator For each object of the game (e.g., pellets, ghosts, and fruits), the GVF corresponding to its position was activated with a multiplier depending on the object type. Edible objects’ multipliers were consistent with the number of points they grant (e.g., a pellet multiplier was 10, a power pellet multiplier was 50, a fruit multiplier was 200, and a blue-and-edible-ghost multiplier was 1000). A ghost multiplier of –1000 appeared to produce a fair balance between gaining points and not losing a life. Finally, the aggregator summed up all the activated and multiplied GVFs to compute a global score for each nine actions and choose the action that maximized it. [00279] FIG.
  • curve 27 illustrates training curves (scores over episodes) for incremental head additions to the HRA. These curves include curve 2701 showings results for a HRA without normalization, exploration, or diversification; curve 2702 showing results for a HRA without normalization or exploration but with diversification; curve 2703 showing results for a HRA with normalization and diversification but without exploration; and curve 2704 showing results for a HRA with normalization, exploration, and diversification.
  • Curve 2701 on FIG. 27 reveals that a HRA with na ⁇ ve settings without normalization, exploration, or diversification performs relatively poorly because it tends to deterministically repeat a bad trajectory like a robot hitting a wall continuously.
  • the HRA of curve 2702 builds on the settings of the HRA of curve 2701 by adding a diversification head that addresses the determinism issue.
  • the architecture progressed quickly up to about 10,000 points, but then started regressing.
  • the analysis of the generated trajectories reveals that the system had difficulty finishing levels: when only a few pellets remained on the screen, the aggregator was overwhelmed by ghost avoider values.
  • the regression in score can be explained by the system becoming more adverse to ghosts the more it learns, which makes it difficult to finish levels.
  • Score heads normalization This issue shown in curve 2702 can be addressed by modifying the additive aggregator with a normalization over the score heads between 0 and 1. To fit this new value scale, the ghost multiplier was modified to -10.
  • N is the number of actions taken until now and n(s, a) is the number of times an action a has been performed in state s.
  • This formula replaces the stochastically motivated logarithmic function of an upper confidence bounds approach (see Auer et al.) with a less drastic one that is more compliant with bootstrapping propagation.
  • the targeted exploration head is not necessarily a replacement for a diversification head. Rather, they are complimentary: diversification for making each trajectory unique and targeted exploration for prioritized exploration.
  • the HRA of curve 2704 builds on the HRA of curve 2703 by adding targeted exploration.
  • the HRA of curve 2704 reveals that the new targeted exploration head helps exploration and makes the learning faster. This setting constitutes the HRA architecture that will be used in further experiments.
  • Executive memory head When a human game player maxes out cognitive and physical ability, the player may start to look for favorable situations or even glitches to memorize. This cognitive process can be referred to as executive memory.
  • the executive memory head records every sequence of actions that led to pass a level without any player deaths. Then, when facing the same level, the head gives a very high value to the recorded action, in order to force the aggregator’s selection. Since it does not allow generalization this head was only employed for the level-passing experiment. An executive memory head can be added to HRA to further improve results.
  • Hybrid Reward Model Experiments: MS. PAC-MAN: Results
  • MS. PAC-MAN is considered as one of the hardest games from the ALE benchmark set.
  • ALE is ultimately a fully deterministic environment (it implements pseudo-randomness using a random number generator that always starts with the same seed)
  • both evaluation metrics aim to create randomness in the evaluation in order to rate methods with more generalizing behavior higher.
  • the first metric introduces a mild form of randomness by taking a random number of no-op actions before control is handed over to the learning algorithm (called a“fixed start”).
  • a“fixed start” In the case of Ms.
  • Pac-Man however, the game starts with a certain inactive period that exceeds the maximum number of no-op steps, resulting in the game having a fixed start after all.
  • the second metric selects random starting points along a human trajectory and results in much stronger randomness, and does result in the intended random start evaluation (called a“random start”).
  • Table 5 illustrates final, high-scores for various methods.
  • the best- reported fixed start score comes from STRAW (Vezhnevets et al, 2016); the best reported random start comes from the Dueling network architecture (Wang et al., 2016).
  • the human fixed start score comes from Mnih et al (2015); the human random start score comes from Nair et al. (2015).
  • A3C was performed in a way to reproduce the results of Mnih et al (2016).
  • the pixel-based environment was a reproduction of the preprocessing and the network except a history of two was used because the steps were twice as long.
  • FIG. 28 compares training curves for HRA, pixel-based A3C baseline, and low-level A3C baseline. The curves reveal that HRA reaches an average score of 25,000 after only 3,000 episodes. This is ten times higher than the A3C baselines after 100,000 episodes, and four times higher than the best result in the literature (6,673 for STRAW by Vezhnevets et al 2016) and 60% higher than human performance.
  • FIG. 29 illustrates a training curve for HRA in the game MS.
  • PAC-MAN smoothed over 100 episodes for the level passing experiment.
  • the curves include a curve showing scores for HRA, pixel-based A3C, and low-level A3C.
  • HRA was able to exploit the weakness of the fixed-start evaluation metric by using executive memory capabilities.
  • the training curve shows that HRA was able to achieve the maximum possible score of 999,990 points in less than 3,000 episodes.
  • the curve is slow in the first stages as the model is trained, but, even though the further levels become more difficult, the level passing speeds up because the HRA is able to take advantage of already knowing the maps.
  • FIG. 31 illustrates training curves for HRA in the game MS.
  • PAC-MN for
  • Disclosed embodiments related to, among other things, separating concerns for a single-agent task both analytically, by determining conditions for stable learning, as well as empirically, through evaluation on two domains.
  • agents By giving agents a reward function that depends on the communication actions of other agents, it can be made to listen to requests from other agents to different degrees. How well it listens can depend on the specific reward function.
  • agents can be made to fully ignore other agents, fully be controlled by other agents or something in between, where it makes a trade-off between following the request of another agent and ignoring it.
  • An agent that retains some level of independence can in some cases yield strong overall performance.
  • an SoC model can convincingly beat (single-agent) state-of-art methods on a challenging domain.
  • SoC model can use domain-specific knowledge to improve performance.
  • RL can be scaled up such that it can be applied in specific real-world systems, for example complex dialogue systems or bot environments. In this context, using domain knowledge to achieve good performance on an otherwise intractable domain is acceptable.
  • SoC is illustrated in at least two specific settings, called action aggregation, and ensemble RL. SoC’s expressive power is wider and that other SoC settings are possible.
  • the SoC configuration used in the some embodiments included a high-level agent with only communication actions and a low-level agent that only performs environment actions.
  • alternative configurations that use more than two agents can be substituted.
  • the reward function in reinforcement learning often plays a double role: it acts as both the performance objective, specifying what type of behavior is desired, as well as the learning objective, that is, the feedback signal that modifies the agent’s behavior. That these two roles do not always combine well into a single function becomes clear from domains with sparse rewards, where learning can be prohibitively slow.
  • the SoC model addresses this by fully separating the performance objective, including the reward function of the environment, from the learning objectives of the agents, including their reward functions.
  • Disclosed embodiments further relate to a Hybrid Reward Architecture (HRA).
  • HRA Hybrid Reward Architecture
  • One of the strengths of HRA is that it can exploit domain knowledge to a much greater extent than single-head methods. This was shown clearly by the fruit collection task: while removing irrelevant features caused a large improvement in performance for HRA, for DQN no effective learning occurred when provided with the same network architecture. Furthermore, separating the pixel image into multiple binary channels only caused a small improvement in the performance of A3C over learning directly from pixel. This demonstrates that the reason that modern deep RL struggle with Ms. Pac-Man is not related to learning from pixels; the underlying issue is that the optimal value function for Ms. Pac-Man cannot easily be mapped to a low-dimensional representation.
  • HRA performs well in the MS. PAC-MAN experiment, in part, by learning close to 1800 general value functions. This results in an exponential breakdown of the problem size: whereas the input state-space corresponding with the binary channels is in the order of 10 ⁇ , each GVF has a state-space in the order of 10 ⁇ states, small enough to be represented without function approximation. While a deep network for representing each GVF could have been used, using a deep network for such small problems can hurt more than it helps, as evidenced by the experiments on the fruit collection domain.
  • FIG.32 illustrates an example process 2200 for taking an action with respect to a task using separation of concerns.
  • the process 2200 can begin with the flow moving to operation 2202, which involves obtaining the task. Following operation 2202, the flow can move to operation 2204, which involves decomposing the task into a plurality of agents. Following operation 2204, the flow can move to operation 2206, which involves training the plurality of agents. Following operation 2206, the flow can move to operation 2208, which involves taking an action with respect to the task based on the agents.
  • FIG. 33 illustrates an example separation of concerns engine 2300 implementing a process 2301 for completing a task using separation of concerns.
  • the process can begin with the flow moving to operation 2302, which involves obtaining agents. Following operation 2302, the flow can move to operation 2304, which involves obtaining a task. Following operation 2304, the flow can move to operation 2306 and then operation 2308. Operation 2306 involves observing a portion of the state space of the task. Operation 2308 involves selecting an action. Operations 2306 and 2308 can be performed for each agent. Following operation 2306 and operation 2308, the flow can move to operation 2310, which involves selecting an action from the actions selected with each agent. Following operation 2310, the flow can move to operation 2312, which involves performing the selected action with respect to the task. If the task is complete following the action, the method can end. If the task is not complete, the flow can return to operation 2306 where a portion of an updated state space of the task is observed.
  • FIG.34 illustrates an example hybrid reward engine 3100, including a process 3101 for selecting an action to take in an environment based on a hybrid reward.
  • the process 3101 can begin with operation 3102, which involves obtaining a reward function associated with an environment.
  • operation 3104 which involves splitting the reward function into n reward functions weighted by w.
  • operation 3106 which involves training separate reinforcement learning (RL) agents on each reward function.
  • operation 3108 which involves using trained agents to select an action to take in the environment.
  • FIG.35, FIG. 36, FIG.37 and the associated descriptions provide a discussion of a variety of operating environments in which aspects of the disclosure may be practiced.
  • the devices and systems illustrated and discussed with respect to FIGS. 35–37 are for purposes of example and illustration and are not limiting of a vast number of computing device configurations that may be utilized for practicing aspects of the disclosure, as described herein.
  • FIG.35 is a block diagram illustrating physical components (e.g., hardware) of a computing device 2400 with which aspects of the disclosure may be practiced.
  • the computing device components described below may have computer executable instructions for implementing the separation of concerns engine 2300 and the hybrid reward engine 3100, among or other aspects disclosed herein.
  • the computing device 2400 may include at least one processing unit 2402 (e.g., a central processing unit) and system memory 2404.
  • the system memory 2404 can comprise, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories.
  • the system memory 2404 may include one or more agents 2406 and training data 2407.
  • the training data 2407 may include data used to train the agents 2406.
  • the system memory 2404 may include an operating system 2405 suitable for running the separation of concerns engine 2300 or one or more aspects described herein.
  • the operating system 2405 for example, may be suitable for controlling the operation of the computing device 2400.
  • Embodiments of the disclosure may be practiced in conjunction with a graphics library, a machine learning library, other operating systems, or any other application program and is not limited to any particular application or system.
  • a basic configuration 2410 is illustrated in FIG. 35 by those components within a dashed line.
  • the computing device 2400 may have additional features or functionality.
  • the computing device 2400 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
  • additional storage is illustrated in FIG. 35 by a removable storage device 2409 and a non-removable storage device 2411.
  • program modules 2408 may perform processes including, but not limited to, the aspects, as described herein.
  • Other program modules may also be used in accordance with aspects of the present disclosure.
  • embodiments of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.
  • embodiments of the disclosure may be practiced via a system-on-a-chip where each or many of the components illustrated in FIG.35 may be integrated onto a single integrated circuit.
  • Such a system-on-a-chip device may include one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit.
  • the functionality, described herein, with respect to the capability of client to switch protocols may be operated via application-specific logic integrated with other components of the computing device 2400 on the single integrated circuit (chip).
  • Embodiments of the disclosure may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
  • embodiments of the disclosure may be practiced within a general purpose computer or in any other circuits or systems.
  • the computing device 2400 may also have one or more input device(s) 2412 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, and other input devices.
  • the output device(s) 2414 such as a display, speakers, a printer, actuators, and other output devices may also be included.
  • the aforementioned devices are examples and others may be used.
  • the computing device 2400 may include one or more communication connections 2416 allowing communications with other computing devices 2450. Examples of suitable communication connections 2416 include, but are not limited to, radio frequency transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
  • Computer readable media may include computer storage media.
  • Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules 2408.
  • the system memory 2404, the removable storage device 2409, and the non- removable storage device 2411 are all computer storage media examples (e.g., memory storage).
  • Computer storage media may include RAM, ROM, electrically erasable read- only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 2400. Any such computer storage media may be part of the computing device 2400. Computer storage media does not include a carrier wave or other propagated or modulated data signal.
  • Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media.
  • FIGS. 36A and 36B illustrate a mobile computing device 500, for example, a mobile telephone, a smart phone, wearable computer (such as a smart watch), a tablet computer, a laptop computer, and the like, with which embodiments of the disclosure may be practiced.
  • the client may be a mobile computing device.
  • FIG.36A one aspect of a mobile computing device 500 for implementing the aspects is illustrated.
  • the mobile computing device 500 is a handheld computer having both input elements and output elements.
  • the mobile computing device 500 typically includes a display 505 and one or more input buttons 510 that allow the user to enter information into the mobile computing device 500.
  • the display 505 of the mobile computing device 500 may also function as an input device (e.g., a touch screen display). If included, an optional side input element 515 allows further user input.
  • the side input element 515 may be a rotary switch, a button, or any other type of manual input element.
  • mobile computing device 500 may incorporate more or fewer input elements.
  • the display 505 may not be a touch screen in some embodiments.
  • the mobile computing device 500 is a portable phone system, such as a cellular phone.
  • the mobile computing device 500 may also include an optional keypad 535.
  • Optional keypad 535 may be a physical keypad or a“soft” keypad generated on the touch screen display.
  • the output elements include the display 505 for showing a graphical user interface (GUI), a visual indicator 520 (e.g., a light emitting diode), and/or an audio transducer 525 (e.g., a speaker).
  • GUI graphical user interface
  • the mobile computing device 500 incorporates a vibration transducer for providing the user with tactile feedback.
  • the mobile computing device 500 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
  • FIG. 36B is a block diagram illustrating the architecture of one aspect of a mobile computing device. That is, the mobile computing device 500 can incorporate a system (e.g., an architecture) 502 to implement some aspects.
  • the system 502 is implemented as a“smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players).
  • the system 502 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone.
  • PDA personal digital assistant
  • One or more application programs 566 may be loaded into the memory 562 and run on or in association with the operating system 564. Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth.
  • the system 502 also includes a non- volatile storage area 568 within the memory 562.
  • the non-volatile storage area 568 may be used to store persistent information that should not be lost if the system 502 is powered down.
  • the application programs 566 may use and store information in the non-volatile storage area 568, such as email or other messages used by an email application, and the like.
  • a synchronization application (not shown) also resides on the system 502 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non-volatile storage area 568 synchronized with corresponding information stored at the host computer.
  • other applications may be loaded into the memory 562 and run on the mobile computing device 500, including the instructions for determining relationships between users, as described herein.
  • the system 502 has a power supply 570, which may be implemented as one or more batteries.
  • the power supply 570 may further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
  • the system 502 may also include a radio interface layer 572 that performs the function of transmitting and receiving radio frequency communications.
  • the radio interface layer 572 facilitates wireless connectivity between the system 502 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio interface layer 572 are conducted under control of the operating system 564. In other words, communications received by the radio interface layer 572 may be disseminated to the application programs 566 via the operating system 564, and vice versa.
  • the visual indicator 520 may be used to provide visual notifications, and/or an audio interface 574 may be used for producing audible notifications via an audio transducer 525 (e.g., audio transducer 525 illustrated in FIG. 5A).
  • the visual indicator 520 is a light emitting diode (LED) and the audio transducer 525 may be a speaker.
  • LED light emitting diode
  • the LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device.
  • the audio interface 574 is used to provide audible signals to and receive audible signals from the user.
  • the audio interface 574 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation.
  • the microphone may also serve as an audio sensor to facilitate control of notifications, as will be described below.
  • the system 502 may further include a video interface 576 that enables an operation of peripheral device 530 (e.g., on-board camera) to record still images, video stream, and the like. Audio interface 574, video interface 576, and keyboard 535 may be operated to generate one or more messages as described herein.
  • a mobile computing device 500 implementing the system 502 may have additional features or functionality.
  • the mobile computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape.
  • additional storage is illustrated in FIG. 5B by the non-volatile storage area 568.
  • Data/information generated or captured by the mobile computing device 500 and stored via the system 502 may be stored locally on the mobile computing device 500, as described above, or the data may be stored on any number of storage media that may be accessed by the device via the radio interface layer 572 or via a wired connection between the mobile computing device 500 and a separate computing device associated with the mobile computing device 500, for example, a server computer in a distributed computing network, such as the Internet.
  • a server computer in a distributed computing network such as the Internet.
  • data/information may be accessed via the mobile computing device 500 via the radio interface layer 572 or via a distributed computing network.
  • data/information may be readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
  • FIGS. 33A and 33B are described for purposes of illustrating the present methods and systems and are not intended to limit the disclosure to a particular sequence of steps or a particular combination of hardware or software components.
  • FIG.37 illustrates one aspect of the architecture of a system for processing data received at a computing system from a remote source, such as a general computing device 604 (e.g., personal computer), tablet computing device 606, or mobile computing device 608, as described above.
  • Content displayed at server device 602 may be stored in different communication channels or other storage types.
  • various messages may be received and/or stored using a directory service 622, a web portal 624, a mailbox service 626, an instant messaging store 628, or a social networking service 630.
  • the program modules 2408 may be employed by a client that communicates with server device 602, and/or the program modules 2408 may be employed by server device 602.
  • the server device 602 may provide data to and from a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • the aspects described herein may be embodied in a general computing device 604 (e.g., personal computer), a tablet computing device 606 and/or a
  • FIG. 37 is described for purposes of illustrating the present methods and systems and is not intended to limit the disclosure to a particular sequence of steps or a particular combination of hardware or software components.
  • the embodiments of the invention described herein are implemented as logical steps in one or more computer systems.
  • the logical operations of the present invention are implemented (1) as a sequence of processor-implemented steps executing in one or more computer systems and (2) as interconnected machine or circuit modules within one or more computer systems.
  • the implementation is a matter of choice, dependent on the performance requirements of the computer system implementing the invention. Accordingly, the logical operations making up the embodiments of the invention described herein are referred to variously as operations, steps, objects, or modules.
  • logical operations may be performed in any order, unless explicitly claimed otherwise or a specific order is inherently necessitated by the claim language.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Feedback Control In General (AREA)

Abstract

Des aspects de la présente invention concernent des techniques d'apprentissage machine, comportant la décomposition de problèmes d'apprentissage par renforcement mono-agent en problèmes plus simples pris en charge par des agents multiples. Les actions proposées par les agents multiples sont ensuite agrégées à l'aide d'un agrégateur, lequel sélectionne une action à engager vis-à-vis d'un environnement. Des aspects de la présente invention concernent également un modèle de récompense hybride.
EP18723249.1A 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement Withdrawn EP3625731A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201762508340P 2017-05-18 2017-05-18
US201762524461P 2017-06-23 2017-06-23
US15/634,914 US10977551B2 (en) 2016-12-14 2017-06-27 Hybrid reward architecture for reinforcement learning
PCT/US2018/028743 WO2018212918A1 (fr) 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement

Publications (1)

Publication Number Publication Date
EP3625731A1 true EP3625731A1 (fr) 2020-03-25

Family

ID=64274554

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18723249.1A Withdrawn EP3625731A1 (fr) 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement

Country Status (2)

Country Link
EP (1) EP3625731A1 (fr)
WO (1) WO2018212918A1 (fr)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6982557B2 (ja) * 2018-08-31 2021-12-17 株式会社日立製作所 報酬関数の生成方法及び計算機システム
CN109635913A (zh) * 2018-12-16 2019-04-16 北京工业大学 基于自适应贪婪的q学习算法足球系统仿真方法
CN109741626B (zh) * 2019-02-24 2023-09-29 苏州科技大学 停车场停车情况预测方法、调度方法和系统
CN110211572B (zh) * 2019-05-14 2021-12-10 北京来也网络科技有限公司 基于强化学习的对话控制方法及装置
CN110738356A (zh) * 2019-09-20 2020-01-31 西北工业大学 一种基于sdn增强网络的电动汽车充电智能调度方法
FR3102277B1 (fr) * 2019-10-17 2021-09-17 Continental Automotive Préparation de jeu de données pour un apprentissage automatique multi-agents
CN112820361B (zh) * 2019-11-15 2023-09-22 北京大学 一种基于对抗模仿学习的药物分子生成方法
CN111062491A (zh) * 2019-12-13 2020-04-24 周世海 一种基于强化学习的智能体探索未知环境方法
CN110928329B (zh) * 2019-12-24 2023-05-02 北京空间技术研制试验中心 一种基于深度q学习算法的多飞行器航迹规划方法
CN111260072A (zh) * 2020-01-08 2020-06-09 上海交通大学 一种基于生成对抗网络的强化学习探索方法
US11663522B2 (en) * 2020-04-27 2023-05-30 Microsoft Technology Licensing, Llc Training reinforcement machine learning systems
CN111369181B (zh) * 2020-06-01 2020-09-29 北京全路通信信号研究设计院集团有限公司 一种列车自主调度深度强化学习方法和装置
CN113799949B (zh) * 2020-06-11 2022-07-26 中国科学院沈阳自动化研究所 一种基于q学习的auv浮力调节方法
CN112084721A (zh) * 2020-09-23 2020-12-15 浙江大学 一种多代理强化学习合作任务下的奖励函数建模方法
CN112331277B (zh) * 2020-10-28 2022-06-21 星药科技(北京)有限公司 一种基于强化学习的路径可控的药物分子生成方法
CN112884066A (zh) * 2021-03-15 2021-06-01 网易(杭州)网络有限公司 数据处理方法及装置
CN112991544A (zh) * 2021-04-20 2021-06-18 山东新一代信息产业技术研究院有限公司 一种基于全景影像建模的群体疏散行为仿真方法
CN113191484B (zh) * 2021-04-25 2022-10-14 清华大学 基于深度强化学习的联邦学习客户端智能选取方法及系统
CN113254200B (zh) * 2021-05-13 2023-06-09 中国联合网络通信集团有限公司 资源编排方法及智能体
CN113239639B (zh) * 2021-06-29 2022-08-26 暨南大学 策略信息生成方法、装置、电子装置和存储介质
CN113486949B (zh) * 2021-07-02 2023-03-24 江苏罗思韦尔电气有限公司 基于YOLO v4渐进定位的遮挡目标检测方法及装置
CN113704425A (zh) * 2021-08-27 2021-11-26 广东电力信息科技有限公司 一种结合知识增强和深度强化学习的对话策略优化方法
CN113723013A (zh) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 一种用于连续空间兵棋推演的多智能体决策方法
CN114066071A (zh) * 2021-11-19 2022-02-18 厦门大学 一种基于能耗的电力参数优化方法、终端设备及存储介质
CN114083539B (zh) * 2021-11-30 2022-06-14 哈尔滨工业大学 一种基于多智能体强化学习的机械臂抗干扰运动规划方法
CN114492845B (zh) * 2022-04-01 2022-07-15 中国科学技术大学 资源受限条件下提高强化学习探索效率的方法
CN115190079B (zh) * 2022-07-05 2023-09-15 吉林大学 基于分层强化学习的高铁自供电感知通信一体化交互方法
CN116384469B (zh) * 2023-06-05 2023-08-08 中国人民解放军国防科技大学 一种智能体策略生成方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
WO2018212918A1 (fr) 2018-11-22

Similar Documents

Publication Publication Date Title
US10977551B2 (en) Hybrid reward architecture for reinforcement learning
EP3625731A1 (fr) Architecture de récompense hybride pour apprentissage par renforcement
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Ladosz et al. Exploration in deep reinforcement learning: A survey
Aubret et al. A survey on intrinsic motivation in reinforcement learning
Wang et al. Active model learning and diverse action sampling for task and motion planning
Santos et al. Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems
Lang et al. Planning with noisy probabilistic relational rules
US8112369B2 (en) Methods and systems of adaptive coalition of cognitive agents
Fang et al. Dynamics learning with cascaded variational inference for multi-step manipulation
Berkenkamp Safe exploration in reinforcement learning: Theory and applications in robotics
Andersen et al. Towards safe reinforcement-learning in industrial grid-warehousing
Huang et al. A novel policy based on action confidence limit to improve exploration efficiency in reinforcement learning
US11850752B2 (en) Robot movement apparatus and related methods
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
Arulkumaran Sample Efficiency, Transfer Learning and Interpretability for Deep Reinforcement Learning
Hutsebaut-Buysse Learning to navigate through abstraction and adaptation
Sasikumar Exploration in feature space for reinforcement learning
Veeriah Discovery in Reinforcement Learning
Girard et al. A robust approach to robot team learning
Luz et al. Multi-Agent Deep Reinforcement Learning for Resource Management in Earth Observation Satellite Constellations
Tang Towards Informed Exploration for Deep Reinforcement Learning
Aubret Learning increasingly complex skills through deep reinforcement learning using intrinsic motivation
Furuyama et al. Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator
Elli Galata Evolving cooperation in multi-agent systems

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20191024

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20211209

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20220330