WO2018212918A1 - Architecture de récompense hybride pour apprentissage par renforcement - Google Patents

Architecture de récompense hybride pour apprentissage par renforcement Download PDF

Info

Publication number
WO2018212918A1
WO2018212918A1 PCT/US2018/028743 US2018028743W WO2018212918A1 WO 2018212918 A1 WO2018212918 A1 WO 2018212918A1 US 2018028743 W US2018028743 W US 2018028743W WO 2018212918 A1 WO2018212918 A1 WO 2018212918A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
reward
action
agents
learning
Prior art date
Application number
PCT/US2018/028743
Other languages
English (en)
Inventor
Harm Hendrik Van Seijen
Seyed Mehdi FATEMI BOOSHEHRI
Romain Michel Henri Laroche
Joshua Samuel Romoff
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/634,914 external-priority patent/US10977551B2/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to EP18723249.1A priority Critical patent/EP3625731A1/fr
Publication of WO2018212918A1 publication Critical patent/WO2018212918A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • a challenge in RL is generalization. In traditional deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in some domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.
  • a framework for solving a single-agent task by using multiple agents, each focusing on different aspects of the task, is provided.
  • This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents.
  • the framework generalizes the traditional hierarchical decomposition, in which, at any moment in time, a single agent has control until it has solved its particular subtask.
  • a framework is provided for communicating agents that aims to generalize the traditional hierarchical decomposition and allow for more flexible task decompositions.
  • decompositions where multiple subtasks have to be solved in parallel, or in cases where a subtask does not have a well-defined end but rather is a continuing process that needs constant adjustment e.g., walking through a crowded street.
  • This framework can be referred to as a separation-of-concerns framework.
  • a reward function for a specific agent is provided that not only has a component depending on the environment state, but also a component depending on the communication actions of the other agents.
  • agents Depending on the specific mixture of these components, agents have different degrees of independence.
  • the reward in general is state-specific, an agent can show different levels of dependence in different parts of the state-space.
  • an agent will act independent of the communication actions of other agents; while in areas with low environment-reward, an agent’s policy will depend strongly on the communication actions of other agents.
  • the framework can be seen as a sequential multi-agent decision making system with non-cooperative agents. This is a challenging setting, because from the perspective of one agent, the environment is non-stationary due to the learning of other agents. This challenge is addressed by defining trainer agents with a fixed policy. Learning with these trainer agents can occur, for example, by pre-training agents and then freezing their policy, or by learning in parallel using off-policy learning.
  • Disclosed embodiments further relate to improvements to machine learning and, in particular, reinforcement learning.
  • a hybrid reward architecture that takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning.
  • FIG.1 illustrates an example scenario involving a robot reaching pieces of fruit scattered across a grid.
  • FIG.2 illustrates an example separation of concern model for two agents.
  • FIG.3 illustrates an example generalized decomposition of a single-agent task using n agents.
  • FIG. 4 illustrates subclasses of agents, including fully independent agents, agents with an acyclic relationship, agents with a cyclic relationship, and an acyclic relationship with trainer agents to break cycles in cyclic dependency graphs.
  • FIG.5 illustrates a falling fruit example scenario.
  • FIG.6 illustrates an example application of a separation of concerns model on a tabular domain.
  • FIG. 7 illustrates learning behavior for tasks with different levels of complexity.
  • FIG.8 illustrates an average return over 4,000 episodes for a different number of no-op actions.
  • FIG. 9 illustrates a network used for the flat agent and the high level agent versus a network used for a low-level agent.
  • FIG. 10A illustrates a learning speed comparison between a separation of concerns model and a flat agent for a 24 ⁇ 24 grid.
  • FIG. 10B illustrates a learning speed comparison between a separation of concerns model and a flat agent for a 48 ⁇ 48 grid.
  • FIG. 10C illustrates a learning speed comparison between a separation of concerns model and a flat agent on an 84 ⁇ 84 grid.
  • FIG. 11 illustrates the effect of varying communication reward on the final performance of a separation of concerns system on a 24 ⁇ 24 game of catch.
  • FIG. 14B shows the learning speed of a separation of concerns model compared to baselines for average number of steps over a number of epochs.
  • FIG. 20B illustrates average episode length of a multi-advisor model in Pac- Boy against baselines.
  • FIG.20C illustrates average scores for different methods in Pac-Boy.
  • FIG.21 illustrates average performance for this experiment with noisy rewards.
  • FIG.22 illustrates an example single-head architecture.
  • FIG.23 illustrates an example Hybrid Reward Architecture (HRA).
  • HRA Hybrid Reward Architecture
  • FIG. 24 illustrates example DQN, HRA, and HRA with pseudo-rewards architectures.
  • FIG.25A illustrates example average steps over episodes of the fruit collection task.
  • FIG. 33 illustrates an example separation of concerns engine implementing a process for completing a task using separation of concerns.
  • Hierarchical learning decomposes a value function in a hierarchical way.
  • Options are temporally extended actions consisting of an initialization set, an option policy and a termination condition. Effectively, applying options to a Markov decision process (MDP) changes it into a semi-MDP, which may provide a mechanism for skill discovery.
  • MDP Markov decision process
  • An example multi-agent RL configuration includes multiple agents which are simultaneously acting on an environment and which receive rewards individually based on the joint actions. Such an example can be modelled as a stochastic game.
  • multi- agent systems can be divided into fully cooperative, fully competitive or mixed tasks (neither cooperative nor competitive). For a fully cooperative task, all agents share the same reward function.
  • ILS Integrated Learning System
  • heterogeneous learning agents such as search-based and knowledge-based
  • LCW Learning by Watching
  • SoC Separation of Concerns improves multi-agent frameworks. For instance, SoC splits a single-agent problem into multiple parallel, communicating agents with simpler and more focused, but different objectives (e.g., skills). An introductory example is detailed below with reference to FIG.1.
  • FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108.
  • the goal of the robot 104 is to reach each piece of fruit 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions).
  • an agent controlling the robot 104 aims to maximize a return, ⁇ ⁇ , which is the expected discounted sum of rewards:
  • the optimal policy uses a minimal number of actions to reach all of the fruit 102.
  • deep reinforcement learning a task can often be mapped to some low-dimensional representation that can accurately represent the optimal policy.
  • NP-complete nondeterministic polynomial time complete
  • NP-hard i.e., at least as hard as the hardest problem in NP
  • each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102.
  • This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward.
  • the state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function.
  • An aggregator can then make the final action selection from among the agents of each piece of fruit 102. Therefore, a single state-space of size is replaced by n state-spaces, each having states.
  • the aggregator can, for example, use a voting scheme, select its action based on the summed action-values, or select its action according to the agent with the highest action-value. This last form of action selection could result in greedy behavior, with the agent always taking an action toward reaching the closet piece of fruit 102 that is closest, which correlates well with the performance metric. Other domains, however, might require a different aggregator.
  • Disclosed embodiments include agent configurations that decompose tasks in different ways. These agent configurations can reduce an overall state space and allow for improved machine learning performance by increasing a convergence speed, reducing the amount of processing and memory resources consumed, among other improvements to computer technology.
  • a flat agent can be defined by an MDP including the tuple A performance objective of a
  • Each policy ⁇ has a corresponding action-value function , which gives the expected value of the return G t conditioned on the state and action .
  • a goal is to maximize the discounted sum of rewards, also referred to as the return:
  • FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment).
  • the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment.
  • the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2.
  • An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set B l is defined (as illustrated, B 1 and B 2 ), as well as a communication action-set C l (as illustrated, C 1 and C 2 ), and a learning objective.
  • the learning objective can be defined by a reward function, r l , plus a discount factor, ⁇ ⁇ .
  • the agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions:
  • each agent i observes state At each
  • the environment which responds with an updated state
  • the environment also produces a
  • this reward is only used to measure the overall performance of the SoC model.
  • each agent i uses its own reward function, to compute overall reward,
  • an agent i chooses an action
  • agents i can be fed into an aggregator function f (as illustrated, f).
  • the aggregator function f maps the environment actions e n to an action (as illustrated In an example,
  • the input space of an agent (illustrated as set y) can be based on communication actions (illustrated as set from previous time steps and an updated flat state
  • communication signals can be regarded as the environment of a meta-MDP.
  • a single time step delay of communication actions can be used for a general setting where all agents communicate in parallel.
  • an agent may be partially observable or have limited visibility such that the agent does not see a full flat state-space or all communication actions.
  • each agent can receive a subset of the input space (as illustrated, Formally, state space of an agent i is a projection of onto a subspace of such as:
  • the sequence of random variables is a Markov chain. This can be formalized by letting define a set of stationary policies for all agents, and ⁇ be
  • agent i can be defined as independent of agent j if the policy of agent j does not affect the transition dynamics of agent i in any way.
  • definitions with to be a set of stationary policies to each agent except agent i and j, and to be the space of all such sets. Then, agent i
  • a vertical agent can be defined similarly for a vertical direction. With these agents being fully independent, it follows that the agents converge independent of each other. Hence, stable parallel learning occurs. Agents with Acyclic Dependencies
  • FIG. 5 illustrates a falling fruit example scenario exhibiting an acyclic dependency graph.
  • a robot 102 catches falling fruit 104 with a basket 106 to receive a reward.
  • the basket 106 is attached to the robot's body 108 with an arm 110 that can be moved relative to the body 108.
  • the robot 102 can move horizontally. Independent of that motion, the robot 102 can move the basket 106 a limited distance to the left or right.
  • FIG. 4 also illustrates a relationship 406 exhibiting a cyclic dependency.
  • the behavior of agent 1 depends on the behavior of agents 2 and 3
  • the behavior of agent 2 depends on the behavior of agents 1 and 3
  • the behavior of agent 3 depends on the behavior of agents 1 and 2.
  • both agents see the full state-space and the agents receive a reward when the fruit 104 is caught. Now both agents depend on each other, forming a cyclic dependency.
  • the approach of pre-training a low-level agent with some fixed policy, then freezing its weights and training a high-level policy using the pre-trained agent, may be a more general update strategy.
  • Relationship 408 in FIG. 4 illustrates an acyclic relationship formed by transforming a cyclic graph into an acyclic graph using trainer agents.
  • a trainer agent for an agent i defines fixed behavior for the agents that agent i depends on to ensure stable learning. It is to be appreciated with the benefit of this description that if the dependency graph is an acyclic graph, using single-agent Q-learning to train the different agents is straightforward.
  • the trainer agent assigned to a particular agent i, can be a fixed-policy agent that generates behavior for the agents on which agent i depends such that their affect on agent i is replaced by the affect of the trainer agent.
  • agent i implicitly defines a stationary MDP for agent i with a corresponding optimal policy that can be learned.
  • agent i only depends on the trainer agent.
  • the trainer agent itself is an independent agent.
  • trainer agents can be used to break cycles in dependency graphs.
  • a cyclic graph can be transformed into an acyclic one in different ways. In practice, which agents are assigned trainer agents is a design choice that depends on how easy it is to define effective trainer behavior. In the simplest case, a trainer agent can just be a random or semi-random policy.
  • agent 1 depended on the behavior of agents 2 and 3.
  • Learning with trainer agents can occur in two ways.
  • a first way is to pre-train agents with their respective trainer agents and then freeze their weights and train the rest of the agents.
  • a second way is to train all agents in parallel with the agents that are connected to a trainer agent using off-policy learning to learn values that correspond to the policy of the trainer agent, while the behavior policy is generated by the regular agents.
  • Off-policy learning can be achieved by importance sampling, which corrects for the frequency at which a particular sample is observed under the behavior policy versus the frequency at which it is observed under the target policy. For example, consider agent i with actions that depends on agent j with actions Further,
  • agent i has a trainer agent i ⁇ attached to it mimicking behavior for agent j.
  • agent also has actions ⁇ ⁇
  • recursive optimality can be defined as a type of local optimality, in which the policy for each subtask is optimal given the policies of its child-subtasks.
  • a recursive optimal policy is an overall policy that includes the combination of all locally-optimal policies. The recursive optimal policy is generally less desirable than the optimal policy for a flat agent, but can be easier to determine.
  • a similar form of optimality can be defined for a SoC model. If the dependency graph of a SoC model is acyclic (with or without added trainer agents), then a recursive optimal SoC policy can be defined as the policy including all locally optimal policies. In other words, policy is optimal for agent i, given the policies of the agents on which agent i depends.
  • Ensemble learning includes the use of a number of weak learners to build a strong learner. Weak learning can be difficult to use due to difficulties in framing RL problems into smaller problems. In some examples, there can be a combination of strong RL algorithms with policy voting or value function averaging to build an even stronger algorithm.
  • SoC allows for ensemble learning in RL with weak leaners through local state space and local reward definitions.
  • SoC agents can train their policies on the flat action space on the basis of a local state space and reward
  • the agents may instead inform
  • the SoC agents can be trained off-policy based on the actions taken by the aggregator because the aggregator is the controller of the SoC system.
  • stable (off-policy) learning occurs if the state-space of each agent is Markov. That is, stable (off-policy) learning occurs if for all agents i :
  • agents can be organized in a way that decomposes a task hierarchically. For instance, there can be three agents where Agent 0 is a top-level agent, and Agent 1 and Agent 2 are each bottom-level agents. The top-level agent only has communication actions, specifying which of the bottom level agents is in control. In other words, Agent 1 and Agent 2 both have a state-dependent action-set that gives access to the environment actions A if they have been given control by Agent 0. That is, for Agent 1:
  • Agent 0 By allowing Agent 0 to only switch its action once the agent currently in control has reached a terminal state (e.g., by storing a set of terminal state conditions itself or by being informed via a communication action), a typical hierarchical task decomposition can be achieved.
  • a SoC model can be a generalization of a hierarchical model.
  • an implicit MDP is defined for agent i with state space Y, reward function and (joint)
  • agent i can be independent of agent j if the value does not depend on the policy of agent j.
  • a simple example of a case where this independence relation holds is the hierarchical case, where the actions of the top agent remain fixed until the bottom agent reaches a terminal state.
  • the high-level agent has a discount factor of 0.99 and has access to the full screen, while the low-level agent has a discount factor of 0.65 and uses an optional bounding box of 10 ⁇ 10 pixels around the basket.
  • the low-level agent only observes the ball when it is inside the bounding box.
  • the high-level agents received a reward of 1 if the ball was caught and a reward of–1 otherwise.
  • the low-level agent received a reward of 1 if the ball was caught and a reward of–1 otherwise.
  • the low-level agent received a small positive reward for taking an action suggested by the high-level agent.
  • the high-level agent took an action every two time steps, while the low-level agent takes actions every time step.
  • Both the flat agent and the high-level and low-level agents were trained using a Deep Q-Network (DQN).
  • the flat agent used a convolutional neural network defined as follows: the 24 ⁇ 24 binary image was passed through two convolutional layers, followed by two dense layers. Both convolutional layers had 32 filters of size (5,5) and a stride of (2,2). The first dense layer had 128 units, followed by the output layer with 3 units.
  • FIG.9 illustrates a network used for the flat agent and the high level agent 902 versus a network used for the low-level agent 904. Because the low-level agent used a bounding box, it does not require a full convolutional network.
  • FIGS. 10A–10C show the results of the comparison of performance between a SoC model and a flat agent showing the average score of each agent over a number of epochs for three different grid sizes.
  • FIG. 10A illustrates a learning speed comparison between a SoC model and a flat agent for a 24 ⁇ 24 grid.
  • FIG. 10B illustrates a learning speed comparison between a SoC model and a flat agent for a 48 ⁇ 48 grid.
  • FIG. 10C illustrates a learning speed comparison between a SoC model and a flat agent on an 84 ⁇ 84 grid.
  • the SoC model learned significantly faster than the flat agent.
  • the flat agent failed to learn anything significant over a training period of 800 epochs.
  • the SoC model converged after only 200 epochs.
  • the SoC model In general, for the SoC model, the low-level agent was able to learn quickly due to its small state space and the high-level agent experienced a less sparse reward due to the reduced action selection frequency. For at least these reasons, the SoC model was able to significantly outperform the flat model.
  • FIG. 11 illustrates the effect of varying communication reward on the final performance of the SoC model on a 24 ⁇ 24 game of Catch.
  • the results show that if the additional reward is 0, the low-level agent has no incentive to listen to the high-level agent and will act fully independent. Alternatively, if the additional reward is very high, the low- level agent will follow the suggestion of the high-level agent. Because both agents are limited (the high-level agent has a low action-selection frequency and the low-level agent has a limited view), both these situations are undesirable. As illustrated, the ideal low- level agent in the experiment was one that acted neither fully independently nor fully dependently with respect to the high-level agent.
  • FIG. 12 illustrates the effect on the average score over a number of training epochs caused by different action selection intervals (asi) for a high-level agent of the SoC system on an 84 ⁇ 84 game of Catch.
  • the intervals included every 1, 2, 4, 8, and 16 time intervals.
  • an asi of 4 performed the best in the experiment, while an asi of 16 performed the worst over 200 epochs.
  • the communication is too frequent, the learning speed goes down, because relative to the action selections the reward appears more sparse, making learning harder.
  • it is too infrequent asymptotic performance is reduced because the high-level agent has not enough control over the low-level agent to move it to approximately the right position.
  • FIG. 13 illustrates the effect of penalizing communication for the high-level agent on the final performance of the system on a 24x24 catch game.
  • the communication probability shows the fraction of time steps on which the high-level agent sends a communication action. It can be seen in FIG.13 that the system can learn to maintain near optimal performance without the need for constant communication.
  • the decomposition was made a priori, however, it is to be appreciated by a person of skill in the art with the benefit of this description that this is only a non-limiting example.
  • learning the decomposition can also prove to be useful.
  • the pellet distribution is randomized: at the start of each new episode, there is a 50% probability for each position to have a pellet. During an episode, pellets remain fixed until they are eaten by Pac-Boy.
  • the state of the game includes the positions of Pac-Boy, pellets, and ghosts. This results in a state space of
  • the SoC model was tested in this environment, and concerns were separated in the following manner: an agent was assigned to each possible pellet location. This pellet agent receives a reward of 1 only if a pellet at its assigned position is eaten.
  • the pellet agent’s state space includes Pac-Boy’s position, which results in 76 states. A pellet agent is only active when there is a pellet at its assigned position.
  • an agent was assigned to each ghost. This ghost agent receives a reward of–10 if Pac-Boy bumps into its assigned ghost.
  • the ghost agent’s state space includes Pac-Boy’s position and the ghost’s position, resulting in 76 ⁇ states. Because there are on average 38 pellets, the average number of agents is 40.
  • DQN-clipped Two non-SoC deep reinforcement learning baselines were also considered.
  • FIGS.14A and 14B show the learning speed of the SoC model compared to the DQN-clipped, DQN-scaled, and Linear Q Learning baselines described above.
  • FIG. 14A compares the average scores (higher is better) over a number of epochs for the models and
  • FIG. 14B compares the average number of steps (lower is better) taken over a number of epochs for the models.
  • One epoch corresponds to 20,000 environmental steps and each curve shows the average performance over 5 random seeds.
  • the upper-bound line in FIG. 14A shows the maximum average score that can be obtained.
  • the SoC model converged to a policy that was very close to the optimal, upper bound, and the baselines fell considerably short of the baseline even after converging.
  • the Linear Q Learning baseline handled the massive state space with no reductions and thus took considerably longer to converge.
  • DQN-clipped and DQN- scaled converge to similar final performances, their policies differed significantly as can be seen in the differing average number of steps taken by each in FIG. 14B.
  • DQN-scaled appeared to be much wearier of the high negative reward obtained from being eaten by the ghosts and thus took more steps to eat all of the pellets.
  • the learning rate was set to 0.00025 (which was found to be the best learning rate for DQN on Pac-Boy) and then a search was run for the adaptive- normalization rate by searching over the same parameters mentioned above.
  • the settings used for the Catch and Pac-Boy agents and experiments is shown in Table 1 (below).
  • Multi-advisor reinforcement learning can be a branch of SoC where a single-agent reinforcement learning problem is distributed to n learners called advisors. Each advisor tries to solve the problem from a different angle. Their advice is then communicated to an aggregator, which is in control of the system.
  • Disclosed examples include three off-policy bootstrapping methods: local-max bootstraps with the local greedy action, rand-policy bootstraps with respect to the random policy, and agg-policy bootstraps with respect to the aggregator’s policy.
  • a single-agent reinforcement learning task can be partitioned into a multi-agent problem (e.g., using a divide and conquer paradigm). All agents can be placed at a same level and be given advisory roles that include providing an aggregator with local Q-values for each available action.
  • a multi-advisory model can be a generalization of reinforcement learning with ensemble models, allowing for both the fusion of several weak reinforcement learners and the decomposition of a single-agent reinforcement learning problem into concurrent subtasks.
  • agents are trained independently and greedily to their local optimality, and are aggregated into a global policy by voting or averaging.
  • An attractor is a state where advisors are attracting in every direction equally and where the local-max aggregator’s optimal behavior is to remain static.
  • Disclosed examples include at least two attractor-free, off-policy bootstrapping methods.
  • there is rand-policy bootstrapping which allows for convergence to a fair short-sighted policy.
  • this example favors short-sightedness over long-term planning.
  • there is an agg-policy bootstrapping method that optimizes the system with respect to the global optimal Bellman equation.
  • this example does not guarantee convergence in a general case.
  • a multi-advisor reinforcement learning architecture can greatly speed up learning and converges to a better solution that certain reinforcement learning baselines.
  • a reinforcement learning framework can be formalized as a Markov Decision Process (MDP).
  • MDP Markov Decision Process
  • A is the action space
  • a trajectory ⁇ is the projection into the MDP
  • a goal is to generate trajectories with high discounted cumulative reward, also called the return: To do so, one needs to
  • the multi-advisor models fall within SoC. and SoC distributes the responsibilities among several agents that may communicate and have complex relationships, such as master-servant or collaborators-as-equal relationships.
  • the following section transcribes under the multi-advisor reinforcement learning notations the main theoretical result: the stability theorem ensuring, under conditions, that the advisors’ training eventually converges.
  • Multi-advisor reinforcement learning can be interpreted as ensemble learning for reinforcement learning.
  • a boosting algorithm is used in a RL framework, but the boosting is performed upon policies, not RL algorithms. This technique can be seen as a precursor to the policy reuse algorithm rather than Ensemble Learning.
  • several online RL algorithms are combined on several simple RL problems. The mixture models of the five experts performs generally better than any single one alone. The algorithms can include off-policy, on-policy, actor-critics, among others, and can continue this effort in a very specific setting where actions are explicit and deterministic transitions.
  • FIG. 17 illustrates a central state (as illustrated, x) in which the system has three possible actions: stay put (as illustrated, action perform the goal of advisor 1 (as
  • An attractor x is a state where local-max would lead to the aggregator staying in that state, if it had the chance. It verifies the following equation:
  • An advisor j can be monotonous if the following condition is satisfied:
  • the local rand-policy optimization is equivalent to the global rand-policy optimization. As such, it does not suffer from local attractor issue previously described. However, optimizing the value function with respect to the random policy is in general far from the optimal solution to the global MDP problem.
  • aggregator policy as the reference.
  • the aggregator is in control, and the advisors are evaluating the current aggregator’s policy f.
  • the aggregator’s policy is dependent on the other advisors, which means that, even though the environment can still be modelled as a MDP, the training procedure is not. Assuming that all advisors jointly converge to their respective local optimal value, denoted by it satisfies the following Bellman
  • the multi-advisor model was evaluated using the Pac-Boy experiment as described above.
  • each advisor was responsible for a specific source of reward (or penalty). More precisely, concerns were separated follows: an advisor was assigned to each possible pellet location. This advisor sees reward of 1 only if a pellet at its assigned position gets eaten. Its state space includes Pac-Boy’s position, resulting in 76 states. A pellet advisor is only active when there is a pellet at its assigned position and it is set inactive when its pellet is eaten. In addition, an advisor was assigned to each ghost. This advisor receives reward of -10 if Pac-Boy bumps into its assigned ghost. Its state space includes Pac-Boy’s position and the ghost’s position, resulting in 76 2 states. Because there are on average 37.5 pellets, the average number of advisors running at the beginning of each episode is 39.5.
  • time scale was divided into 50 epochs lasting 20,000 transitions each.
  • an evaluation phase was launched for 80 games.
  • Each experimental result is presented along two dimensional performance indicators: the averaged non discounted rewards and the average length of the games.
  • the average non discounted rewards can be seen as the number of points obtained in a game. Its theoretical maximum is 37.5 and the random policy average performance is around -80, which corresponds to being eaten around 10 times by the ghosts.
  • FIG. 18 illustrates an example three-pellet attractor in Pac-Boy.
  • the example three-pellet attractor occurs when the game is in a state with equal distance between Pac-Boy 1802 and three pellets 1804, with Pac-Boy 1802 adjacent to a wall 1806, enabling Pac-Boy to perform a no-op action. Moving towards a pellet 1804, makes it closer to one of the pellets 1804, but further from the two other pellets 1804, since diagonal moves are not allowed. Expressing the real value of each action under local-max gives the following results:
  • the aggregator may opt to hit the wall 1806 indefinitely. Optimality is not guaranteed, and in this case, the system behavior would be sub-optimal.
  • Pac-Boy 1802 After moving North or West, Pac-Boy 1802 arrives in a state that is symmetrically equivalent to the first one. More generally in a deterministic navigation task like Pac-Boy where each action can be cancelled by a new action, it can be shown that the condition on ⁇ is a function of the size of the action set A. A more general result on stochastic navigation tasks can be demonstrated.
  • FIG.20B illustrates average episode length against baselines over a number of epochs.
  • DQN-clipped and DQN-scaled DQN-Pop-Art
  • their learned policies are in fact very different.
  • DQN-scaled appears to be much wearier of the high negative reward obtained from being eaten by the ghosts and thus takes much more time to eat all the pellets.
  • the agg- policy outperforms the baselines by having a lower number of average steps across the epochs.
  • the reward function may be decomposed such that the sub-reward functions depend on a subset of the entire set of state variables. These sub-reward functions may be smooth value functions that are easier to learn. Smooth functions can be simplified in comparison to other value functions and can be described by fewer parameters.
  • Disclosed embodiments can be relevant to achieving more efficient convergence to a close-to-optimal policy. In some embodiments, this can be achieved by acting greedily with respect to Q-values of a uniformly random policy. Evaluating a random can result in Q-values of individual agents being fully independent of each other, which can result in a smooth value function that can be efficiently learned.
  • This update can be referred to as a local-mean update.
  • Each head can be associated with a different reward function.
  • HRA builds on the Horde architecture.
  • the Horde architecture includes a large number of “demons” that learn in parallel via off-policy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudo-reward function.
  • GVF general value function
  • a pseudo-reward can be any feature-based signal that encodes useful information.
  • the Horde architecture can focus on building general knowledge about a world encoded via a large number of GVFs.
  • HRA focuses on training separate components of the environment-reward function to achieve a smoother value function to efficiently learn a control policy.
  • HRA can apply multi-objective learning to smooth a value function of a single reward function.
  • the DQN neural network 2410 can include an input layer 2412, one or more hidden layers 2414, and an output layer 2416 used to produce an output 2418.
  • Backpropagation can be used to train the neural network 2410 based on error measured at the output 2418.
  • the HRA neural network 2420 includes an input layer 2422, one or more hidden layers 2424, and a plurality of heads 2426, each with their own reward function (as illustrated ⁇ ⁇ , ⁇ ⁇ , ⁇ and ⁇ ⁇ ).
  • the heads 2426 inform the output 2428 (e.g., using a linear combination).
  • Backpropagation can also be used to train the HRA neural network 2420.
  • Backpropagation can be used to train the neural network 2420 based on error measured at each of the reward function heads 2426. By measuring error at the heads 2426 (e.g., rather than at the output 2428 as in the DQN network 2410), faster learning can occur.
  • the DQN neural network 2410 and the HRA neural network 2420 can have the same network architecture but differ in how the network is updated.
  • a gradient based on can be computed and the gradient
  • mapping 2437 consider the fruit-collection example where there can be heads 2426 that provide a reward for reaching a particular location that can have a piece of fruit. The mapping 2437 may be based on whether there actually was a piece of fruit at a current location. If so, the mapping 2437 can prove the value of the general value function for the location. If not, the mapping 2437 can provide an output with a value of zero. In this manner, there can be learning even if there is no fruit at a particular location. For example, the weights of the network 2430 can be updated via backpropagation based on the error of the general value function regardless of whether there is fruit at the location.
  • mappings 2437 can be used to filter out results where the fruit is not there prior to providing the output of the heads 2438, so as to not affect the overall output of the network 2439 (and thus a decision taken by an agent based on the network 2430) while still allowing for training.
  • FIG. 25A illustrates the results comparing DQN max, DQN max (removed features), HRA mean, and HRA mean (removed features).
  • HRA showed a clear performance boost over DQN by requiring fewer steps, even though the network was identical. Further, adding different forms of domain knowledge caused additional large improvements. Whereas using a network structure enhanced by domain knowledge caused large improvements for HRA, using that same network for DQN, resulted in DQN not learning anything at all.
  • FIG. 25B illustrates results comparing tabular HRA GVF, Tabular HRA, and HRA mean (removed features). As illustrated, the Tabular HRA GVF converged to a low number of average steps much more quickly than tabular HRA and HRA mean (removed features).
  • a second domain experiment was performed using is the ATARI 2600 game MS. PAC-MAN.
  • MS. PAC-MAN the player scores points by reaching pellets in a map while avoiding ghosts.
  • FIGS.26A–D each illustrate the four different maps 2601 in the game.
  • Each of the four different maps 2601 include a different maze formed by walls 2602. Within the maze are pellets 2604 and power pellets 2606. ghosts 2608 and bonus fruit 2610 can also appear in the maze.
  • the player controls Ms. Pac-Man 2612 during the game. Points 2614 are scored when Ms. Pac-Man 2612“eats” (reaches) the pellets 2604 and power pellets 2606. Contact with a ghost 2608 causes Ms. Pac-Man 2612 to lose a life 2616, but eating one of the power pellets 2606 turns ghosts 2608 blue for a small duration, allowing them to be eaten for extra points. Bonus fruit 2610 can be eaten for extra points twice per level. When all pellets 2604 and power pellets 2606 have been eaten, a new map 2601 is started. There are seven different types of fruit 2610, each with a different point value.
  • the HRA architecture for this experiment used one head for each pellet, one head for each ghost, and one head for each blue ghost, and one head for the fruit. Similar to the fruit collection task, HRA used GVFs that learned the Q-values for reaching a particular location on the map (separate GVFs can be learned for each of the maps in the game). The agent learns part of this representation during training. It started with zero GVFs and zero heads for the pellets. By wandering around the maze, it discovered new map locations it could reach, which resulted in new GVFs being created. Whenever the agent found a pellet at a new location, it created a new head corresponding to the pellet.
  • the Q-values of the head of an object were the Q-values of the GVF that correspond with the object’s location (e.g., moving objects use a different GVF each time). If an object was not on the screen, its Q-values were zero.
  • Each head i was assigned a weight w i , which could be positive or negative.
  • the weight corresponded to the reward received when the object is eaten.
  • the weights were set to -1,000 because contact with a ghost causes Ms. Pac-Man to lose a life.
  • GVF heads eaters and avoiders: Ms.
  • Pac-Man’s state was defined as its low- level features position on the map and her direction (North, South, East, or West). Depending on the map, there are about 400 positions and 950 states. A GVF was created online for each visited Ms. Pac-Man position. Each GVF was then in charge of determining the value of the random policy of Ms. Pac-Man’s state for getting the pseudo- reward placed on the GVF’s associated position. The GVFs were trained online with off- policy one-step bootstrapping with Thus, the full tabular
  • Aggregator For each object of the game (e.g., pellets, ghosts, and fruits), the GVF corresponding to its position was activated with a multiplier depending on the object type. Edible objects’ multipliers were consistent with the number of points they grant (e.g., a pellet multiplier was 10, a power pellet multiplier was 50, a fruit multiplier was 200, and a blue-and-edible-ghost multiplier was 1000). A ghost multiplier of –1000 appeared to produce a fair balance between gaining points and not losing a life. Finally, the aggregator summed up all the activated and multiplied GVFs to compute a global score for each nine actions and choose the action that maximized it. [00279] FIG.
  • curve 27 illustrates training curves (scores over episodes) for incremental head additions to the HRA. These curves include curve 2701 showings results for a HRA without normalization, exploration, or diversification; curve 2702 showing results for a HRA without normalization or exploration but with diversification; curve 2703 showing results for a HRA with normalization and diversification but without exploration; and curve 2704 showing results for a HRA with normalization, exploration, and diversification.
  • Curve 2701 on FIG. 27 reveals that a HRA with na ⁇ ve settings without normalization, exploration, or diversification performs relatively poorly because it tends to deterministically repeat a bad trajectory like a robot hitting a wall continuously.
  • the HRA of curve 2702 builds on the settings of the HRA of curve 2701 by adding a diversification head that addresses the determinism issue.
  • the architecture progressed quickly up to about 10,000 points, but then started regressing.
  • the analysis of the generated trajectories reveals that the system had difficulty finishing levels: when only a few pellets remained on the screen, the aggregator was overwhelmed by ghost avoider values.
  • the regression in score can be explained by the system becoming more adverse to ghosts the more it learns, which makes it difficult to finish levels.
  • Score heads normalization This issue shown in curve 2702 can be addressed by modifying the additive aggregator with a normalization over the score heads between 0 and 1. To fit this new value scale, the ghost multiplier was modified to -10.
  • N is the number of actions taken until now and n(s, a) is the number of times an action a has been performed in state s.
  • This formula replaces the stochastically motivated logarithmic function of an upper confidence bounds approach (see Auer et al.) with a less drastic one that is more compliant with bootstrapping propagation.
  • the targeted exploration head is not necessarily a replacement for a diversification head. Rather, they are complimentary: diversification for making each trajectory unique and targeted exploration for prioritized exploration.
  • the HRA of curve 2704 builds on the HRA of curve 2703 by adding targeted exploration.
  • the HRA of curve 2704 reveals that the new targeted exploration head helps exploration and makes the learning faster. This setting constitutes the HRA architecture that will be used in further experiments.
  • Executive memory head When a human game player maxes out cognitive and physical ability, the player may start to look for favorable situations or even glitches to memorize. This cognitive process can be referred to as executive memory.
  • the executive memory head records every sequence of actions that led to pass a level without any player deaths. Then, when facing the same level, the head gives a very high value to the recorded action, in order to force the aggregator’s selection. Since it does not allow generalization this head was only employed for the level-passing experiment. An executive memory head can be added to HRA to further improve results.
  • Hybrid Reward Model Experiments: MS. PAC-MAN: Results
  • MS. PAC-MAN is considered as one of the hardest games from the ALE benchmark set.
  • ALE is ultimately a fully deterministic environment (it implements pseudo-randomness using a random number generator that always starts with the same seed)
  • both evaluation metrics aim to create randomness in the evaluation in order to rate methods with more generalizing behavior higher.
  • the first metric introduces a mild form of randomness by taking a random number of no-op actions before control is handed over to the learning algorithm (called a“fixed start”).
  • a“fixed start” In the case of Ms.
  • Pac-Man however, the game starts with a certain inactive period that exceeds the maximum number of no-op steps, resulting in the game having a fixed start after all.
  • the second metric selects random starting points along a human trajectory and results in much stronger randomness, and does result in the intended random start evaluation (called a“random start”).
  • Table 5 illustrates final, high-scores for various methods.
  • the best- reported fixed start score comes from STRAW (Vezhnevets et al, 2016); the best reported random start comes from the Dueling network architecture (Wang et al., 2016).
  • the human fixed start score comes from Mnih et al (2015); the human random start score comes from Nair et al. (2015).
  • A3C was performed in a way to reproduce the results of Mnih et al (2016).
  • the pixel-based environment was a reproduction of the preprocessing and the network except a history of two was used because the steps were twice as long.
  • FIG. 28 compares training curves for HRA, pixel-based A3C baseline, and low-level A3C baseline. The curves reveal that HRA reaches an average score of 25,000 after only 3,000 episodes. This is ten times higher than the A3C baselines after 100,000 episodes, and four times higher than the best result in the literature (6,673 for STRAW by Vezhnevets et al 2016) and 60% higher than human performance.
  • FIG. 29 illustrates a training curve for HRA in the game MS.
  • PAC-MAN smoothed over 100 episodes for the level passing experiment.
  • the curves include a curve showing scores for HRA, pixel-based A3C, and low-level A3C.
  • HRA was able to exploit the weakness of the fixed-start evaluation metric by using executive memory capabilities.
  • the training curve shows that HRA was able to achieve the maximum possible score of 999,990 points in less than 3,000 episodes.
  • the curve is slow in the first stages as the model is trained, but, even though the further levels become more difficult, the level passing speeds up because the HRA is able to take advantage of already knowing the maps.
  • FIG. 31 illustrates training curves for HRA in the game MS.
  • PAC-MN for
  • Disclosed embodiments related to, among other things, separating concerns for a single-agent task both analytically, by determining conditions for stable learning, as well as empirically, through evaluation on two domains.
  • agents By giving agents a reward function that depends on the communication actions of other agents, it can be made to listen to requests from other agents to different degrees. How well it listens can depend on the specific reward function.
  • agents can be made to fully ignore other agents, fully be controlled by other agents or something in between, where it makes a trade-off between following the request of another agent and ignoring it.
  • An agent that retains some level of independence can in some cases yield strong overall performance.
  • an SoC model can convincingly beat (single-agent) state-of-art methods on a challenging domain.
  • SoC model can use domain-specific knowledge to improve performance.
  • RL can be scaled up such that it can be applied in specific real-world systems, for example complex dialogue systems or bot environments. In this context, using domain knowledge to achieve good performance on an otherwise intractable domain is acceptable.
  • SoC is illustrated in at least two specific settings, called action aggregation, and ensemble RL. SoC’s expressive power is wider and that other SoC settings are possible.
  • the SoC configuration used in the some embodiments included a high-level agent with only communication actions and a low-level agent that only performs environment actions.
  • alternative configurations that use more than two agents can be substituted.
  • the reward function in reinforcement learning often plays a double role: it acts as both the performance objective, specifying what type of behavior is desired, as well as the learning objective, that is, the feedback signal that modifies the agent’s behavior. That these two roles do not always combine well into a single function becomes clear from domains with sparse rewards, where learning can be prohibitively slow.
  • the SoC model addresses this by fully separating the performance objective, including the reward function of the environment, from the learning objectives of the agents, including their reward functions.
  • Disclosed embodiments further relate to a Hybrid Reward Architecture (HRA).
  • HRA Hybrid Reward Architecture
  • One of the strengths of HRA is that it can exploit domain knowledge to a much greater extent than single-head methods. This was shown clearly by the fruit collection task: while removing irrelevant features caused a large improvement in performance for HRA, for DQN no effective learning occurred when provided with the same network architecture. Furthermore, separating the pixel image into multiple binary channels only caused a small improvement in the performance of A3C over learning directly from pixel. This demonstrates that the reason that modern deep RL struggle with Ms. Pac-Man is not related to learning from pixels; the underlying issue is that the optimal value function for Ms. Pac-Man cannot easily be mapped to a low-dimensional representation.
  • HRA performs well in the MS. PAC-MAN experiment, in part, by learning close to 1800 general value functions. This results in an exponential breakdown of the problem size: whereas the input state-space corresponding with the binary channels is in the order of 10 ⁇ , each GVF has a state-space in the order of 10 ⁇ states, small enough to be represented without function approximation. While a deep network for representing each GVF could have been used, using a deep network for such small problems can hurt more than it helps, as evidenced by the experiments on the fruit collection domain.
  • FIG.32 illustrates an example process 2200 for taking an action with respect to a task using separation of concerns.
  • the process 2200 can begin with the flow moving to operation 2202, which involves obtaining the task. Following operation 2202, the flow can move to operation 2204, which involves decomposing the task into a plurality of agents. Following operation 2204, the flow can move to operation 2206, which involves training the plurality of agents. Following operation 2206, the flow can move to operation 2208, which involves taking an action with respect to the task based on the agents.
  • FIG. 33 illustrates an example separation of concerns engine 2300 implementing a process 2301 for completing a task using separation of concerns.
  • the process can begin with the flow moving to operation 2302, which involves obtaining agents. Following operation 2302, the flow can move to operation 2304, which involves obtaining a task. Following operation 2304, the flow can move to operation 2306 and then operation 2308. Operation 2306 involves observing a portion of the state space of the task. Operation 2308 involves selecting an action. Operations 2306 and 2308 can be performed for each agent. Following operation 2306 and operation 2308, the flow can move to operation 2310, which involves selecting an action from the actions selected with each agent. Following operation 2310, the flow can move to operation 2312, which involves performing the selected action with respect to the task. If the task is complete following the action, the method can end. If the task is not complete, the flow can return to operation 2306 where a portion of an updated state space of the task is observed.
  • FIG.34 illustrates an example hybrid reward engine 3100, including a process 3101 for selecting an action to take in an environment based on a hybrid reward.
  • the process 3101 can begin with operation 3102, which involves obtaining a reward function associated with an environment.
  • operation 3104 which involves splitting the reward function into n reward functions weighted by w.
  • operation 3106 which involves training separate reinforcement learning (RL) agents on each reward function.
  • operation 3108 which involves using trained agents to select an action to take in the environment.
  • FIG.35, FIG. 36, FIG.37 and the associated descriptions provide a discussion of a variety of operating environments in which aspects of the disclosure may be practiced.
  • the devices and systems illustrated and discussed with respect to FIGS. 35–37 are for purposes of example and illustration and are not limiting of a vast number of computing device configurations that may be utilized for practicing aspects of the disclosure, as described herein.
  • FIG.35 is a block diagram illustrating physical components (e.g., hardware) of a computing device 2400 with which aspects of the disclosure may be practiced.
  • the computing device components described below may have computer executable instructions for implementing the separation of concerns engine 2300 and the hybrid reward engine 3100, among or other aspects disclosed herein.
  • the computing device 2400 may include at least one processing unit 2402 (e.g., a central processing unit) and system memory 2404.
  • the system memory 2404 can comprise, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories.
  • the system memory 2404 may include one or more agents 2406 and training data 2407.
  • the training data 2407 may include data used to train the agents 2406.
  • the system memory 2404 may include an operating system 2405 suitable for running the separation of concerns engine 2300 or one or more aspects described herein.
  • the operating system 2405 for example, may be suitable for controlling the operation of the computing device 2400.
  • Embodiments of the disclosure may be practiced in conjunction with a graphics library, a machine learning library, other operating systems, or any other application program and is not limited to any particular application or system.
  • a basic configuration 2410 is illustrated in FIG. 35 by those components within a dashed line.
  • the computing device 2400 may have additional features or functionality.
  • the computing device 2400 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
  • additional storage is illustrated in FIG. 35 by a removable storage device 2409 and a non-removable storage device 2411.
  • program modules 2408 may perform processes including, but not limited to, the aspects, as described herein.
  • Other program modules may also be used in accordance with aspects of the present disclosure.
  • embodiments of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.
  • embodiments of the disclosure may be practiced via a system-on-a-chip where each or many of the components illustrated in FIG.35 may be integrated onto a single integrated circuit.
  • Such a system-on-a-chip device may include one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit.
  • the functionality, described herein, with respect to the capability of client to switch protocols may be operated via application-specific logic integrated with other components of the computing device 2400 on the single integrated circuit (chip).
  • Embodiments of the disclosure may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
  • embodiments of the disclosure may be practiced within a general purpose computer or in any other circuits or systems.
  • the computing device 2400 may also have one or more input device(s) 2412 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, and other input devices.
  • the output device(s) 2414 such as a display, speakers, a printer, actuators, and other output devices may also be included.
  • the aforementioned devices are examples and others may be used.
  • the computing device 2400 may include one or more communication connections 2416 allowing communications with other computing devices 2450. Examples of suitable communication connections 2416 include, but are not limited to, radio frequency transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
  • Computer readable media may include computer storage media.
  • Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules 2408.
  • the system memory 2404, the removable storage device 2409, and the non- removable storage device 2411 are all computer storage media examples (e.g., memory storage).
  • Computer storage media may include RAM, ROM, electrically erasable read- only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 2400. Any such computer storage media may be part of the computing device 2400. Computer storage media does not include a carrier wave or other propagated or modulated data signal.
  • Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media.
  • FIGS. 36A and 36B illustrate a mobile computing device 500, for example, a mobile telephone, a smart phone, wearable computer (such as a smart watch), a tablet computer, a laptop computer, and the like, with which embodiments of the disclosure may be practiced.
  • the client may be a mobile computing device.
  • FIG.36A one aspect of a mobile computing device 500 for implementing the aspects is illustrated.
  • the mobile computing device 500 is a handheld computer having both input elements and output elements.
  • the mobile computing device 500 typically includes a display 505 and one or more input buttons 510 that allow the user to enter information into the mobile computing device 500.
  • the display 505 of the mobile computing device 500 may also function as an input device (e.g., a touch screen display). If included, an optional side input element 515 allows further user input.
  • the side input element 515 may be a rotary switch, a button, or any other type of manual input element.
  • mobile computing device 500 may incorporate more or fewer input elements.
  • the display 505 may not be a touch screen in some embodiments.
  • the mobile computing device 500 is a portable phone system, such as a cellular phone.
  • the mobile computing device 500 may also include an optional keypad 535.
  • Optional keypad 535 may be a physical keypad or a“soft” keypad generated on the touch screen display.
  • the output elements include the display 505 for showing a graphical user interface (GUI), a visual indicator 520 (e.g., a light emitting diode), and/or an audio transducer 525 (e.g., a speaker).
  • GUI graphical user interface
  • the mobile computing device 500 incorporates a vibration transducer for providing the user with tactile feedback.
  • the mobile computing device 500 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
  • FIG. 36B is a block diagram illustrating the architecture of one aspect of a mobile computing device. That is, the mobile computing device 500 can incorporate a system (e.g., an architecture) 502 to implement some aspects.
  • the system 502 is implemented as a“smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players).
  • the system 502 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone.
  • PDA personal digital assistant
  • One or more application programs 566 may be loaded into the memory 562 and run on or in association with the operating system 564. Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth.
  • the system 502 also includes a non- volatile storage area 568 within the memory 562.
  • the non-volatile storage area 568 may be used to store persistent information that should not be lost if the system 502 is powered down.
  • the application programs 566 may use and store information in the non-volatile storage area 568, such as email or other messages used by an email application, and the like.
  • a synchronization application (not shown) also resides on the system 502 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non-volatile storage area 568 synchronized with corresponding information stored at the host computer.
  • other applications may be loaded into the memory 562 and run on the mobile computing device 500, including the instructions for determining relationships between users, as described herein.
  • the system 502 has a power supply 570, which may be implemented as one or more batteries.
  • the power supply 570 may further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
  • the system 502 may also include a radio interface layer 572 that performs the function of transmitting and receiving radio frequency communications.
  • the radio interface layer 572 facilitates wireless connectivity between the system 502 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio interface layer 572 are conducted under control of the operating system 564. In other words, communications received by the radio interface layer 572 may be disseminated to the application programs 566 via the operating system 564, and vice versa.
  • the visual indicator 520 may be used to provide visual notifications, and/or an audio interface 574 may be used for producing audible notifications via an audio transducer 525 (e.g., audio transducer 525 illustrated in FIG. 5A).
  • the visual indicator 520 is a light emitting diode (LED) and the audio transducer 525 may be a speaker.
  • LED light emitting diode
  • the LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device.
  • the audio interface 574 is used to provide audible signals to and receive audible signals from the user.
  • the audio interface 574 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation.
  • the microphone may also serve as an audio sensor to facilitate control of notifications, as will be described below.
  • the system 502 may further include a video interface 576 that enables an operation of peripheral device 530 (e.g., on-board camera) to record still images, video stream, and the like. Audio interface 574, video interface 576, and keyboard 535 may be operated to generate one or more messages as described herein.
  • a mobile computing device 500 implementing the system 502 may have additional features or functionality.
  • the mobile computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape.
  • additional storage is illustrated in FIG. 5B by the non-volatile storage area 568.
  • Data/information generated or captured by the mobile computing device 500 and stored via the system 502 may be stored locally on the mobile computing device 500, as described above, or the data may be stored on any number of storage media that may be accessed by the device via the radio interface layer 572 or via a wired connection between the mobile computing device 500 and a separate computing device associated with the mobile computing device 500, for example, a server computer in a distributed computing network, such as the Internet.
  • a server computer in a distributed computing network such as the Internet.
  • data/information may be accessed via the mobile computing device 500 via the radio interface layer 572 or via a distributed computing network.
  • data/information may be readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
  • FIGS. 33A and 33B are described for purposes of illustrating the present methods and systems and are not intended to limit the disclosure to a particular sequence of steps or a particular combination of hardware or software components.
  • FIG.37 illustrates one aspect of the architecture of a system for processing data received at a computing system from a remote source, such as a general computing device 604 (e.g., personal computer), tablet computing device 606, or mobile computing device 608, as described above.
  • Content displayed at server device 602 may be stored in different communication channels or other storage types.
  • various messages may be received and/or stored using a directory service 622, a web portal 624, a mailbox service 626, an instant messaging store 628, or a social networking service 630.
  • the program modules 2408 may be employed by a client that communicates with server device 602, and/or the program modules 2408 may be employed by server device 602.
  • the server device 602 may provide data to and from a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • a client computing device such as a general computing device 604, a tablet computing device 606 and/or a mobile computing device 608 (e.g., a smart phone) through a network 615.
  • the aspects described herein may be embodied in a general computing device 604 (e.g., personal computer), a tablet computing device 606 and/or a
  • FIG. 37 is described for purposes of illustrating the present methods and systems and is not intended to limit the disclosure to a particular sequence of steps or a particular combination of hardware or software components.
  • the embodiments of the invention described herein are implemented as logical steps in one or more computer systems.
  • the logical operations of the present invention are implemented (1) as a sequence of processor-implemented steps executing in one or more computer systems and (2) as interconnected machine or circuit modules within one or more computer systems.
  • the implementation is a matter of choice, dependent on the performance requirements of the computer system implementing the invention. Accordingly, the logical operations making up the embodiments of the invention described herein are referred to variously as operations, steps, objects, or modules.
  • logical operations may be performed in any order, unless explicitly claimed otherwise or a specific order is inherently necessitated by the claim language.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Feedback Control In General (AREA)

Abstract

Des aspects de la présente invention concernent des techniques d'apprentissage machine, comportant la décomposition de problèmes d'apprentissage par renforcement mono-agent en problèmes plus simples pris en charge par des agents multiples. Les actions proposées par les agents multiples sont ensuite agrégées à l'aide d'un agrégateur, lequel sélectionne une action à engager vis-à-vis d'un environnement. Des aspects de la présente invention concernent également un modèle de récompense hybride.
PCT/US2018/028743 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement WO2018212918A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP18723249.1A EP3625731A1 (fr) 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201762508340P 2017-05-18 2017-05-18
US62/508,340 2017-05-18
US201762524461P 2017-06-23 2017-06-23
US62/524,461 2017-06-23
US15/634,914 US10977551B2 (en) 2016-12-14 2017-06-27 Hybrid reward architecture for reinforcement learning
US15/634,914 2017-06-27

Publications (1)

Publication Number Publication Date
WO2018212918A1 true WO2018212918A1 (fr) 2018-11-22

Family

ID=64274554

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/028743 WO2018212918A1 (fr) 2017-05-18 2018-04-21 Architecture de récompense hybride pour apprentissage par renforcement

Country Status (2)

Country Link
EP (1) EP3625731A1 (fr)
WO (1) WO2018212918A1 (fr)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635913A (zh) * 2018-12-16 2019-04-16 北京工业大学 基于自适应贪婪的q学习算法足球系统仿真方法
CN109741626A (zh) * 2019-02-24 2019-05-10 苏州科技大学 停车场停车情况预测方法、调度方法和系统
CN110211572A (zh) * 2019-05-14 2019-09-06 北京来也网络科技有限公司 基于强化学习的对话控制方法及装置
CN110738356A (zh) * 2019-09-20 2020-01-31 西北工业大学 一种基于sdn增强网络的电动汽车充电智能调度方法
CN110928329A (zh) * 2019-12-24 2020-03-27 北京空间技术研制试验中心 一种基于深度q学习算法的多飞行器航迹规划方法
CN111062491A (zh) * 2019-12-13 2020-04-24 周世海 一种基于强化学习的智能体探索未知环境方法
CN111260072A (zh) * 2020-01-08 2020-06-09 上海交通大学 一种基于生成对抗网络的强化学习探索方法
CN111369181A (zh) * 2020-06-01 2020-07-03 北京全路通信信号研究设计院集团有限公司 一种列车自主调度深度强化学习方法和模块
CN112084721A (zh) * 2020-09-23 2020-12-15 浙江大学 一种多代理强化学习合作任务下的奖励函数建模方法
CN112331277A (zh) * 2020-10-28 2021-02-05 星药科技(北京)有限公司 一种基于强化学习的路径可控的药物分子生成方法
CN112633491A (zh) * 2019-10-08 2021-04-09 华为技术有限公司 训练神经网络的方法与装置
FR3102277A1 (fr) * 2019-10-17 2021-04-23 Continental Automotive Préparation de jeu de données pour un apprentissage automatique multi-agents
CN112820361A (zh) * 2019-11-15 2021-05-18 北京大学 一种基于对抗模仿学习的药物分子生成方法
CN112884066A (zh) * 2021-03-15 2021-06-01 网易(杭州)网络有限公司 数据处理方法及装置
CN112991544A (zh) * 2021-04-20 2021-06-18 山东新一代信息产业技术研究院有限公司 一种基于全景影像建模的群体疏散行为仿真方法
CN113191484A (zh) * 2021-04-25 2021-07-30 清华大学 基于深度强化学习的联邦学习客户端智能选取方法及系统
CN113239639A (zh) * 2021-06-29 2021-08-10 暨南大学 策略信息生成方法、装置、电子装置和存储介质
CN113254200A (zh) * 2021-05-13 2021-08-13 中国联合网络通信集团有限公司 资源编排方法及智能体
CN113486949A (zh) * 2021-07-02 2021-10-08 江苏罗思韦尔电气有限公司 基于YOLO v4渐进定位的遮挡目标检测方法及装置
US20210334696A1 (en) * 2020-04-27 2021-10-28 Microsoft Technology Licensing, Llc Training reinforcement machine learning systems
CN113704425A (zh) * 2021-08-27 2021-11-26 广东电力信息科技有限公司 一种结合知识增强和深度强化学习的对话策略优化方法
CN113723013A (zh) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 一种用于连续空间兵棋推演的多智能体决策方法
CN113799949A (zh) * 2020-06-11 2021-12-17 中国科学院沈阳自动化研究所 一种基于q学习的auv浮力调节方法
CN114066071A (zh) * 2021-11-19 2022-02-18 厦门大学 一种基于能耗的电力参数优化方法、终端设备及存储介质
CN114083539A (zh) * 2021-11-30 2022-02-25 哈尔滨工业大学 一种基于多智能体强化学习的机械臂抗干扰运动规划方法
CN114492845A (zh) * 2022-04-01 2022-05-13 中国科学技术大学 资源受限条件下提高强化学习探索效率的方法
CN115190079A (zh) * 2022-07-05 2022-10-14 吉林大学 基于分层强化学习的高铁自供电感知通信一体化交互方法
US11487972B2 (en) * 2018-08-31 2022-11-01 Hitachi, Ltd. Reward function generation method and computer system
CN116384469A (zh) * 2023-06-05 2023-07-04 中国人民解放军国防科技大学 一种智能体策略生成方法、装置、计算机设备和存储介质

Non-Patent Citations (22)

* Cited by examiner, † Cited by third party
Title
BOHMER ET AL.: "Autonomous learning of state representations for control: An emerging field aims to autonomously learn state representations for reinforcement learning agents from their real-world sensor observations", KI-KUNSTLICHE INTELLIGENZ, 2015
BREIMAN; RANDOM, MACHINE LEARNING, 2001
DAYAN; HINTON: "Feudal reinforcement learning", PROCEEDINGS OF THE 7TH ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS, 1993
GABOR ET AL.: "Multi-criteria reinforcement learning", PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML, 1998
GLOROT ET AL.: "Understanding the difficulty of training deep feedforward neural networks", PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, 2010
HARM VAN SEIJEN ET AL: "Hybrid Reward Architecture for Reinforcement Learning", ARXIV.ORG, 13 June 2017 (2017-06-13), CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY, XP080769588 *
HARUTYUNYAN ET AL.: "Off-policy reward shaping with ensembles", ARXIV: 1502.03248, 2015
LAROACH ET AL.: "Algorithm selection of off-policy reinforcement learning algorithm", ARXIV:1701.08810, 2017
LAROACH ET AL.: "Hybridization of expertise and reinforcement learning in dialogue systems", PROCEEDINGS OF THE 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH, 2009
LAROCHE ET AL., MULTI-ADVISOR REINFORCEMENT LEARNING, 3 April 2017 (2017-04-03)
LAROCHE ET AL.: "MULTI-ADVISOR REINFORCEMENT LEARNING", ARXIV.ORG, 3 April 2017 (2017-04-03), XP002783157 *
RICHARD S SUTTON ET AL: "Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction", PROC. OF 10TH INT. CONF. ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 6 May 2011 (2011-05-06), pages 761 - 768, XP055343919 *
RUMMERY ET AL.: "Online Q-learning using connectionist systems", 1994, UNIVERSITY OF CAMBRIDGE, DEPARTMENT OF ENGINEERING
SEE MNIH ET AL.: "Human-level control through deep reinforcement learning", NATURE, vol. 518, 2015, pages 529 - 533, XP055283401, DOI: doi:10.1038/nature14236
STUART RUSSELL ET AL: "Q-Decomposition for Reinforcement Learning Agents", 2003, XP055492516, Retrieved from the Internet <URL:https://people.eecs.berkeley.edu/~russell/papers/ml03-qdecomp.pdf> [retrieved on 20180713] *
SUTTON ET AL.: "Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning", ARTIFICIAL INTELLIGENCE, 1999
TOM SCHAUL ET AL: "Universal Value Function Approximators", VOLUME 37: PROCEEDINGS OF THE 32ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 11 July 2015 (2015-07-11), XP055344798 *
VAN HASSELT ET AL.: "Learning values across many orders of magnitude", PROCEEDINGS OF ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS, 2016
VAN SEIJEN ET AL., SEPARATION OF CONCERNS IN REINFORCEMENT LEARNING, 15 December 2016 (2016-12-15)
VEZHNEVETS ET AL.: "Feudal networks for hierarchical reinforcement learning", ARXIV: 1703.01161, 2017
WATKINS: "Learning from Delayed Rewards, PhD thesis", 1989, CAMBRIDGE UNIVERSITY
WIERING ET AL.: "Ensemble algorithms in reinforcement learning", IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, 2008

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11487972B2 (en) * 2018-08-31 2022-11-01 Hitachi, Ltd. Reward function generation method and computer system
CN109635913A (zh) * 2018-12-16 2019-04-16 北京工业大学 基于自适应贪婪的q学习算法足球系统仿真方法
CN109741626A (zh) * 2019-02-24 2019-05-10 苏州科技大学 停车场停车情况预测方法、调度方法和系统
CN109741626B (zh) * 2019-02-24 2023-09-29 苏州科技大学 停车场停车情况预测方法、调度方法和系统
CN110211572A (zh) * 2019-05-14 2019-09-06 北京来也网络科技有限公司 基于强化学习的对话控制方法及装置
CN110738356A (zh) * 2019-09-20 2020-01-31 西北工业大学 一种基于sdn增强网络的电动汽车充电智能调度方法
CN112633491A (zh) * 2019-10-08 2021-04-09 华为技术有限公司 训练神经网络的方法与装置
FR3102277A1 (fr) * 2019-10-17 2021-04-23 Continental Automotive Préparation de jeu de données pour un apprentissage automatique multi-agents
CN112820361B (zh) * 2019-11-15 2023-09-22 北京大学 一种基于对抗模仿学习的药物分子生成方法
CN112820361A (zh) * 2019-11-15 2021-05-18 北京大学 一种基于对抗模仿学习的药物分子生成方法
CN111062491A (zh) * 2019-12-13 2020-04-24 周世海 一种基于强化学习的智能体探索未知环境方法
CN110928329B (zh) * 2019-12-24 2023-05-02 北京空间技术研制试验中心 一种基于深度q学习算法的多飞行器航迹规划方法
CN110928329A (zh) * 2019-12-24 2020-03-27 北京空间技术研制试验中心 一种基于深度q学习算法的多飞行器航迹规划方法
CN111260072A (zh) * 2020-01-08 2020-06-09 上海交通大学 一种基于生成对抗网络的强化学习探索方法
US11663522B2 (en) * 2020-04-27 2023-05-30 Microsoft Technology Licensing, Llc Training reinforcement machine learning systems
US20210334696A1 (en) * 2020-04-27 2021-10-28 Microsoft Technology Licensing, Llc Training reinforcement machine learning systems
CN111369181B (zh) * 2020-06-01 2020-09-29 北京全路通信信号研究设计院集团有限公司 一种列车自主调度深度强化学习方法和装置
CN111369181A (zh) * 2020-06-01 2020-07-03 北京全路通信信号研究设计院集团有限公司 一种列车自主调度深度强化学习方法和模块
CN113799949A (zh) * 2020-06-11 2021-12-17 中国科学院沈阳自动化研究所 一种基于q学习的auv浮力调节方法
CN113799949B (zh) * 2020-06-11 2022-07-26 中国科学院沈阳自动化研究所 一种基于q学习的auv浮力调节方法
CN112084721A (zh) * 2020-09-23 2020-12-15 浙江大学 一种多代理强化学习合作任务下的奖励函数建模方法
CN112331277A (zh) * 2020-10-28 2021-02-05 星药科技(北京)有限公司 一种基于强化学习的路径可控的药物分子生成方法
CN112884066A (zh) * 2021-03-15 2021-06-01 网易(杭州)网络有限公司 数据处理方法及装置
CN112991544A (zh) * 2021-04-20 2021-06-18 山东新一代信息产业技术研究院有限公司 一种基于全景影像建模的群体疏散行为仿真方法
CN113191484A (zh) * 2021-04-25 2021-07-30 清华大学 基于深度强化学习的联邦学习客户端智能选取方法及系统
CN113191484B (zh) * 2021-04-25 2022-10-14 清华大学 基于深度强化学习的联邦学习客户端智能选取方法及系统
CN113254200A (zh) * 2021-05-13 2021-08-13 中国联合网络通信集团有限公司 资源编排方法及智能体
CN113254200B (zh) * 2021-05-13 2023-06-09 中国联合网络通信集团有限公司 资源编排方法及智能体
CN113239639A (zh) * 2021-06-29 2021-08-10 暨南大学 策略信息生成方法、装置、电子装置和存储介质
CN113239639B (zh) * 2021-06-29 2022-08-26 暨南大学 策略信息生成方法、装置、电子装置和存储介质
CN113486949A (zh) * 2021-07-02 2021-10-08 江苏罗思韦尔电气有限公司 基于YOLO v4渐进定位的遮挡目标检测方法及装置
CN113486949B (zh) * 2021-07-02 2023-03-24 江苏罗思韦尔电气有限公司 基于YOLO v4渐进定位的遮挡目标检测方法及装置
CN113704425A (zh) * 2021-08-27 2021-11-26 广东电力信息科技有限公司 一种结合知识增强和深度强化学习的对话策略优化方法
CN113723013A (zh) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 一种用于连续空间兵棋推演的多智能体决策方法
CN114066071A (zh) * 2021-11-19 2022-02-18 厦门大学 一种基于能耗的电力参数优化方法、终端设备及存储介质
CN114083539A (zh) * 2021-11-30 2022-02-25 哈尔滨工业大学 一种基于多智能体强化学习的机械臂抗干扰运动规划方法
CN114492845B (zh) * 2022-04-01 2022-07-15 中国科学技术大学 资源受限条件下提高强化学习探索效率的方法
CN114492845A (zh) * 2022-04-01 2022-05-13 中国科学技术大学 资源受限条件下提高强化学习探索效率的方法
CN115190079B (zh) * 2022-07-05 2023-09-15 吉林大学 基于分层强化学习的高铁自供电感知通信一体化交互方法
CN115190079A (zh) * 2022-07-05 2022-10-14 吉林大学 基于分层强化学习的高铁自供电感知通信一体化交互方法
CN116384469A (zh) * 2023-06-05 2023-07-04 中国人民解放军国防科技大学 一种智能体策略生成方法、装置、计算机设备和存储介质
CN116384469B (zh) * 2023-06-05 2023-08-08 中国人民解放军国防科技大学 一种智能体策略生成方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
EP3625731A1 (fr) 2020-03-25

Similar Documents

Publication Publication Date Title
US10977551B2 (en) Hybrid reward architecture for reinforcement learning
WO2018212918A1 (fr) Architecture de récompense hybride pour apprentissage par renforcement
Ladosz et al. Exploration in deep reinforcement learning: A survey
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Wang et al. Active model learning and diverse action sampling for task and motion planning
Santos et al. Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems
US8112369B2 (en) Methods and systems of adaptive coalition of cognitive agents
Fang et al. Dynamics learning with cascaded variational inference for multi-step manipulation
Berkenkamp Safe exploration in reinforcement learning: Theory and applications in robotics
Andersen et al. Towards safe reinforcement-learning in industrial grid-warehousing
Bakker et al. Quasi-online reinforcement learning for robots
Huang et al. A novel policy based on action confidence limit to improve exploration efficiency in reinforcement learning
Barrett Making friends on the fly: advances in ad hoc teamwork
US11850752B2 (en) Robot movement apparatus and related methods
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
Arulkumaran Sample Efficiency, Transfer Learning and Interpretability for Deep Reinforcement Learning
Hutsebaut-Buysse Learning to navigate through abstraction and adaptation
Sasikumar Exploration in feature space for reinforcement learning
Veeriah Discovery in Reinforcement Learning
Girard et al. A robust approach to robot team learning
Tang Towards Informed Exploration for Deep Reinforcement Learning
Aubret Learning increasingly complex skills through deep reinforcement learning using intrinsic motivation
Furuyama et al. Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator
Kurek Deep reinforcement learning in keepaway soccer
Abreu E 2 RL–Efficient Exploration in Reinforcement Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18723249

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018723249

Country of ref document: EP

Effective date: 20191218