CN115066686A

CN115066686A - Generating implicit plans that achieve a goal in an environment using attention operations embedded to the plans

Info

Publication number: CN115066686A
Application number: CN202180013484.1A
Authority: CN
Inventors: S.里特; R.福克纳; D.N.雷波索
Original assignee: DeepMind Technologies Ltd
Current assignee: DeepMind Technologies Ltd
Priority date: 2020-02-07
Filing date: 2021-02-08
Publication date: 2022-09-16
Also published as: EP4085385B1; US20230101930A1; EP4085385A1; WO2021156513A1

Abstract

A method, system, and apparatus, including a computer program encoded on a computer storage medium, for selecting an action to be performed by an agent interacting with an environment to achieve a goal. In one aspect, a method comprises: generating a respective plan embedding corresponding to each of a plurality of experience tuples in an external memory, wherein each experience tuple characterizes an interaction of an agent with an environment at a previous time step; processing the planning embedding using a planning neural network to generate an implicit plan that achieves the goal; and selecting an action to be performed at the time step by the agent using implicit programming.

Description

Generating implicit plans that achieve a goal in an environment using attention operations embedded in the plans

Background

This specification relates to processing data using machine learning models.

The machine learning model receives input and generates output, such as predicted output, based on the received input. Some machine learning models are parametric models, and generate an output based on the received inputs and the parameter values of the model.

Some machine learning models are depth models that employ a multi-layer model to generate output for received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers, each applying a nonlinear transformation to a received input to generate an output.

Disclosure of Invention

This specification describes an action selection system implemented as a computer program on one or more computers at one or more locations for controlling agents interacting with the environment to achieve a goal.

Throughout the specification, "embedding" of an entity (e.g., a view of an environment) may refer to a representation of the entity as an ordered set of values, e.g., a vector or matrix of values. The embedding of the entity may be generated, for example, as an output of a neural network that processes data characterizing the entity.

According to a first aspect, there is provided a method performed by one or more data processing apparatus for selecting an action to be performed by an agent interacting with an environment to achieve a goal.

The method includes generating a respective plan embedding corresponding to, e.g., including a representation of, each of a plurality of experience tuples in an external memory, wherein each experience tuple characterizes an interaction of an agent with an environment at a respective previous time step. Optionally, the planning embedding may also include a representation of the target, such as embedding.

Thus, in an implementation, the plan embedding characterizes prior interactions of the agent with the environment, and optionally, characterizes the target. In an implementation, the plan embedding does not include a representation of the current observation that characterizes the current environmental state.

The method may include processing the planning embedding using a planning neural network to generate an implicit plan for achieving the goal. Implicit programming may thus include embedded encoded information about prior interactions of agents with the environment and optionally targets. It may also depend on the representation of the current observation, as will be described later. It may implicitly characterize the actions that can be performed by the agent to accomplish the goal. The planning neural network may be any neural network configured to handle planning embedding-optionally, target embedding, and, in an implementation, a currently observed representation. However, in implementations, the planning neural network may include one or more self-attention layers, as described below.

The method may also include using implicit programming to select an action to be performed by the agent at the time step.

In an implementation, the method iteratively updates the plan embedding using attention to the plan embedding, e.g., using an attention subnetwork. Multiple iterations of the same-note, e.g., self-note, function may be applied to plan embedding. An implicit plan can be generated based on the plan embedding and the current observation. Generating an implicit plan may include appending a representation of the current embedding to each plan embedding and processing the combined embedding using one or more neural network layers (e.g., a self-attention layer, such as using an attention subnetwork). In an implementation, these neural network layers do not process the representation of the current observation.

Broadly, using attention involves applying an attention mechanism, such as a self-attention mechanism, that correlates plan embeddings to determine an implicit plan. The details of the attention mechanism vary, but in general, the attention mechanism may map a learned query vector and a learned set of key-value vector pairs to outputs. The output may be computed as a weighted sum of values, the weights depending on the similarity of the query and the key. In this type of self-attention mechanism, the input to the attention mechanism may be a plan embedding set and the output may be a transformed version of the same plan embedding set. As just one example, a dot-product attention mechanism (which also describes a multi-headed attention example) is described in arXiv: 1706.03762. In an implementation, the use of (self) attention helps to determine the relationship between past states.

In some implementations, using attention to planned embedding involves using a residual neural network block (i.e., one that includes a residual or skipped connection) to handle planned embedding. The residual neural network block may be configured to apply a series of operations to the plan embedding, including a layer normalization operation (see, e.g., arXiv:1607:06450), an attention operation, and a linear projection operation.

The method may involve jointly training the planning neural network and action selection using reinforcement learning techniques, such as by back-propagating the gradient of a reinforcement learning objective function. The reinforcement learning objective function may be any suitable objective function, such as a time difference objective function or a policy gradient objective function, such as using an actor-critic objective function, depending on the reward an agent receives from the environment in response to an action.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages.

The system described in this specification can enable an agent to use information learned about an environment to generate "implicit plans" to address tasks (i.e., accomplish goals) in the environment. Implicit planning refers to data (e.g., numerical data represented as an ordered set of numerical values, such as a vector or matrix of numerical values) that implicitly characterizes actions that may be performed by an agent to complete a task. By selecting actions to be performed by an agent using implicit programming, the system described herein may enable the agent to more efficiently complete tasks and exploration environments (e.g., by completing tasks and exploration environments in fewer time steps). That is, the described techniques allow for a mix of exploration and target-oriented behaviors, while enabling agents to learn to plan on a long time scale, so that once trained, the agents can generalize outside of their training experience. Thus, in particular, the system described in this specification can enable an agent to leverage its previously acquired knowledge of tasks and environments to efficiently perform new (i.e., previously unseen) tasks in a new environment. In one example, the agent may be a consumer robot that performs a home task (e.g., a cleaning task), and the system described in this specification may be able to make the agent efficiently perform the new task when the agent is placed in a new environment (e.g., a room in a different house).

The system described in this specification can generate an implicit plan to solve a task by generating a plan embedding based on past interactions of agents with the environment and iteratively updating the plan embedding using attention operations. Iteratively updating plan embeddings using attention operations allows information to be shared between plan embeddings, facilitating more efficient planning, which may enable agents to complete tasks and explore the environment more efficiently, e.g., through fewer time steps.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 is a block diagram of an example action selection system.

Fig. 2 illustrates an example architecture of a planned neural network included in an action selection system.

FIG. 3 is a flow diagram of an example process for selecting actions to be performed by an agent interacting with an environment to achieve a goal.

Fig. 4 is a schematic diagram of an example of the system of fig. 1.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

FIG. 1 is a block diagram of an example action selection system 100. Action selection system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, implementing the systems, components, and techniques described below.

The system 100 selects an action 102 to be performed by an agent 104 interacting with the environment 106 at each of a plurality of time steps to achieve the goal. At each time step, the system 100 receives data characterizing the current state of the environment 106, such as an image of the environment 106, and selects an action 102 to be performed by the agent 104 in response to the received data. The data characterizing the state of the environment 106 will be referred to in this specification as observations 110. At each time step, the state of the environment 106 at the time step (as characterized by the observations 110) depends on the state of the environment 106 at the previous time step and the action 102 performed by the agent 104 at the previous time step.

At each time step, the system 100 may receive a reward 108 based on the current state of the environment 106 and the agent's 104 actions 102 at that time step. Generally, the reward 108 may be expressed as a numerical value. The reward 108 may be based on any event or aspect in the environment 106. For example, the reward 108 may indicate whether the agent 104 has completed the goal (e.g., navigated to a goal location in the environment 106) or the agent's 104 progress toward achieving the goal.

In some implementations, the environment is a real-world environment and the agent is a mechanical agent that interacts with the real-world environment. For example, the agent may be a robot that interacts with the environment to achieve a goal, e.g., to locate an object of interest in the environment, to move an object of interest to a specified location in the environment, to physically manipulate an object of interest in the environment in a specified manner, or to navigate to a specified destination in the environment; alternatively, the agent may be an autonomous or semi-autonomous land, air, or marine vehicle that navigates within the environment to a specified destination within the environment. The action may then be an action taken by the mechanical agent to achieve the goal in the real-world environment, and may include control signals that control the mechanical agent.

In these implementations, the observations may include, for example, one or more of images, object position data, and sensor data to capture observations as the agent interacts with the environment, such as sensor data from image, distance or position sensors, or from actuators.

For example, in the case of a robot, the observations may include data characterizing the current state of the robot, such as one or more of: joint position, joint velocity, joint force, torque or acceleration, such as gravity compensated torque feedback, and a global or relative pose of an item held by the robot.

In the case of a robot or other mechanical agent or vehicle, the observations may similarly include one or more of the position, linear or angular velocity, force, torque or acceleration, and global or relative pose of one or more portions of the agent. Observations may be defined as 1, 2 or 3 dimensions and may be absolute and/or relative observations.

The observation may also include, for example, data obtained by one of a plurality of sensor devices sensing the real-world environment; for example, sensed electronic signals, such as motor current or temperature signals; and/or image or video data, e.g., from a camera or LIDAR sensor, such as data from a sensor of an agent or data from a sensor located separately from an agent in the environment.

In the case of an electronic agent, the observation may include data from one or more sensors monitoring the plant or service facility portion, such as current, voltage, power, temperature and other sensors and/or electronic signals representing the functionality of the device electronic and/or mechanical items.

The action may be a control input controlling the robot, such as a torque or higher level control command of a robot joint, or a torque of an autonomous or semi-autonomous land or air or sea vehicle, such as a torque or a higher level control command of a control surface or other control element of the vehicle.

In other words, the action may include, for example, position, velocity, or force/torque/acceleration data of one or more joints of the robot or a portion of another mechanical agent. The actions may additionally or alternatively comprise electronic control data, such as motor control data, or more generally data for controlling one or more electronic devices within the environment, the control of which has an effect on the observed state of the environment. For example, in the case of autonomous or semi-autonomous land, air or marine vehicles, the actions may include actions that control navigation (e.g., steering) and movement (e.g., braking and/or acceleration of the vehicle).

In some implementations, the environment is a simulated environment, such as a simulation of the real-world environment described above, and the agent is implemented as one or more computers that interact with the simulated environment. For example, the simulated environment may be a simulation of a robot or vehicle, and the reinforcement learning system may be trained on the simulation and then, once trained, may be used in the real world.

For example, the simulated environment may be a motion simulation environment, such as a driving simulation or a flight simulation, and the agent may be a simulated vehicle that navigates in the motion simulation. In these implementations, the action may be a control input to control a simulated user or a simulated vehicle.

In another example, the simulated environment may be a video game and the agent may be a simulated user playing the video game.

In another example, the simulated environment may be a protein folding environment such that each state is a respective state of a protein chain and the agent is a computer system for determining how to fold the protein chain. In this example, the action is a possible folding action for folding a protein chain, and the goal to be achieved may include, for example, folding the protein to stabilize the protein and enable it to perform a particular biological function.

Typically in the case of a simulated environment, an observation may include a simulated version of one or more previously described observations or observation types, and an action may include a simulated version of one or more previously described actions or action types.

In some cases, the action-selection system 100 may be used to control the interaction of agents with a simulation environment, and the training engine may train parameters of the action-selection system (e.g., using reinforcement learning techniques) based on the interaction of the agents with the simulation environment. After training the action selection system based on the interaction of the agent with the simulated environment, the agent may be deployed in the real-world environment, and the trained action selection system may be used to control the interaction of the agent with the real-world environment. Training the action selection system based on the agent's interaction with the simulated environment (i.e., not the real-world environment) may avoid wear on the agent and may reduce the likelihood that the agent may damage itself or various aspects of its environment by performing inappropriate selection actions.

In some other applications, an agent may control actions in a real-world environment, including equipment items, for example in a data center or grid mains electricity or water distribution system, or in a manufacturing plant or service facility. The observations may be related to the operation of the plant or facility. For example, the observations may include observations of power or water used by the equipment, or observations of power generation or distribution control, or observations of resource usage or waste production. The agent may control actions in the environment to achieve the goal of improving efficiency, for example by reducing resource usage, and/or reducing environmental impact of operations in the environment, for example by reducing waste. The actions may include actions to control or impose operating conditions on equipment items of the plant/facility, and/or actions that result in setting changes in the operation of the plant/facility, such as adjusting or turning on/off components of the plant/facility.

In some further applications, the environment is a real-world environment, and the agent manages task allocation across computing resources, e.g., on a mobile device and/or in a data center. In these implementations, the actions may include assigning tasks to particular computing resources, and the goals to be achieved may include minimizing the time required to complete a set of tasks using the specified computing resources.

As another example, the action may include presenting an advertisement, the observation may include an advertisement impression or click count or rate, and the reward may characterize previous selections of items or content taken by one or more users. In this example, the goal to be achieved may include maximizing the selection of items or content by one or more users.

Optionally, in any of the above implementations, the observations at any given time step may include data from a previous time step that may be beneficial in characterizing the environment, e.g., actions performed at the previous time step, rewards received at the previous time step, etc.

The system 100 uses the external memory 114, the planning neural network 200, and the action-selecting neural network 120 to select actions to be performed by the agent 104 at each time step, as will be described in more detail below.

Memory 114 stores a respective "experience tuple" corresponding to each of a plurality of previous time steps (e.g., memory 114 may store a respective experience tuple for each time step prior to the current time step). The memory 114 may be implemented, for example, as a logical data storage area or a physical data storage device.

An experience tuple for a time step refers to data that characterizes the interaction of the agent 104 with the environment 106 at the previous time step. For example, the experience tuple of the previous time step may include the following respective embeddings (representations): (i) observation of the previous time step; (ii) an action performed by the agent at a previous time step; and (iii) subsequent observations resulting from actions performed by the agent at a previous time step.

The system 100 can generate embeddings (e.g., included in experience tuples) of the observations by providing the observations to an embedding neural network configured to process the observations to generate corresponding embeddings. The system 100 may generate an embedding of the action (e.g., included in the experience tuple) by associating the action with a one-hot embedding (e.g., included in the experience tuple) that uniquely identifies the action from the set of possible actions.

In some implementations, the system 100 clears the memory 114 (i.e., by deleting or overwriting the contents of the memory 114) each time the clearing criteria is satisfied. For example, the cleanup criteria may be met if the agent has completed the target in the environment, if the agent is placed in a new environment, or if the memory is full (e.g., because an experience tuple is stored in each available slot of the memory).

To select an action to be performed at a time step, the system 100 generates a respective "plan" embedding 116 corresponding to each of a plurality of experience tuples stored in the memory 114. In some implementations, the system 100 can generate the plan embedding 116 for the experience tuple, for example, by cascading the embedding of "target" observations to the experience tuple, where the target observations represent the environmental state when the agent's target has been completed. For example, if the agent's goal is to navigate to a specified location in the environment, the target observation may be an observation that represents the state of the environment when the agent is located at the specified location. In some other implementations, the system 100 may identify plan embedding 116 associated with the experience tuple as a copy of the experience tuple (e.g., such that the plan embedding 116 and the experience tuple stored in the memory 114 are the same).

The system 100 can generate a respective plan embedding 116 corresponding to each experience tuple stored in the memory 114. Alternatively, the system 100 may generate plan insertions 116 for only a suitable subset of experience tuples stored in the memory 114, e.g., only for experience tuples corresponding to a predetermined number L of the most recent time steps. (L may be any suitable positive integer value, e.g., L ═ 5).

The planning neural network 200 is configured to process: (i) plan embedding 116 representing prior interactions of agents with the environment; and (ii) a current observation 110 representing the current state of the environment to generate an "implicit plan" 118 for accomplishing the objectives of the agent. Implicit planning 118 is an embedding (also from planning embedding 116) that can encode information about the current state of the environment (from current observations 110), the history of agent interactions with the environment (from planning embedding 116), and optionally the goals completed by the agents.

The planning neural network 200 may have any suitable neural network architecture that enables it to perform the functions it describes. As part of generating the implicit plan 118 from the plan embedding 116, the planning neural network 200 may enrich the plan embedding by updating the plan embedding using self-attention operations. An example architecture for planning a neural network 200 is described in more detail with reference to fig. 2.

The action selecting neural network 120 is configured to process inputs including the implicit plan 118 generated by the planning neural network 200 to generate an action selection output 122. Optionally, the action selecting neural network 120 may process other data than the implicit plan 118, e.g., the action selecting neural network 120 may also process respective embeddings of one or more of: a current observation, an action performed at a previous time step, or a reward received at a previous time step. The action selection output 122 may include a respective score for each action in the set of possible actions that the agent may perform.

The system 100 uses the action selection output 122 generated by the action selection neural network 120 at that time step to select the action 102 to be performed by the agent 104 at that time step. For example, the system 100 may select the action with the highest score as the action performed by the agent at that time step according to the action selection output 122. In some implementations, the system 100 selects an action to be performed by an agent according to an exploration policy. For example, the system 100 may use an e-greedy exploration strategy. In this example, the system 100 can select the highest scoring action (from the action selection output 122) with a probability of 1 ∈ and randomly select the action with a probability of ∈ where ∈ is a number between 0 and 1.

The action selecting neural network 120 may have any suitable neural network architecture that enables it to perform its described functions. For example, the action selecting neural network may include any suitable neural network layers (e.g., convolutional layers, fully-connected layers, attention layers, etc.) connected in any suitable configuration (e.g., as a linear sequence of layers). In one example, the action-selecting neural network 120 may include: an input layer configured to receive the implicit plan 118, a linear sequence of a plurality of fully-connected layers, and an output layer including a respective neuron corresponding to each action in a set of possible actions that the agent may perform.

After the system 100 selects the action 102 to be performed by the agent 104 at that time step, the agent 104 interacts with the environment 106 by performing the action 102, and the system 100 may receive the reward 108 based on the interaction. The system 100 can generate an experience tuple characterizing the interaction of the agent with the environment at that time step and store the experience tuple in the memory 114.

The training engine 112 may use the observations 110 and corresponding rewards 108 resulting from the interaction of the agents 104 with the environment 106 to train the action selection system 100 using reinforcement learning techniques. The training engine 112 trains the action selection system 100 by iteratively adjusting parameters of the action selection neural network 120 and the planning neural network 200. The training engine 112 may adjust the parameters of the action selection system 100 by iteratively back-propagating the gradient of the reinforcement learning objective function by the action selection system 100. By training the action selection system 100, the training engine 112 may cause the action selection system 100 to select actions that increase the cumulative measure of rewards received by the action selection system 100 (e.g., long-term time discounted cumulative rewards) and cause the agent to more effectively complete its goal (e.g., through fewer time steps).

Fig. 2 shows an example architecture of a planned neural network 200 included in the action selection system 100 described with reference to fig. 1.

The planning neural network 200 is configured to process: (i) a set of plan embeddings 116 representing previous interactions of the agent with the environment, and (ii) a current observation 110 representing a current state of the environment to generate an implicit plan 118 for completing the objectives of the agent.

Planning neural network 200 includes an attention subnetwork 202 and a fusion subnetwork 206, which will be described in more detail below.

Note that subnetwork 202 is configured to iteratively (i.e., at each of one or more iterations) update plan embedding 116 to generate updated plan embedding 204. More specifically, attention subnetwork 202 iteratively updates plan embedding 116 by processing plan embedding 116 using a sequence of one or more "attention blocks". Each attention block is a set of one or more neural network layers configured to receive a set of current plan embeddings, update the current plan embeddings by applying an attention operation to the current plan embeddings, and output the updated plan embeddings. The first attention block may receive the initial plan embedding 116, each subsequent attention block may receive the plan embedding output by the previous attention block, and the last attention block may output the updated plan embedding 204 (i.e., the output of the definition attention subnetwork 202).

Each attention block updates the plan embedding by applying attention operations to the plan embedding, and in particular, updates each plan embedding by using self-attention to the plan embedding. To update a given plan embedding with self-attention for the plan embedding, the attention block may determine a respective "attention weight" between the given plan embedding and each plan embedding in the set of plan embeddings. The attention block may then update the given plan embedding with (i) the attention weight and (ii) the plan embedding.

For example, if the plan is embedded 116 by

Representing where N is the number of planned embeddings, then p is embedded for updating the plan _i The attention block may determine attention weights

Wherein a is _i,j Represents p _i And p _j Attention weights in between, such as:

wherein W _q And W _k Is a learned parameter matrix, softmax (·) represents the soft-max normalization operation, and c is a constant. Using attention weights, attention blocks can embed plans into p _i The updating is as follows:

wherein W _v Is a learned parameter matrix. (W) _q p _i Can be called plan embedding p _i "query embedding", W _k p _j Can be called plan embedding p _j Is "key embedded" and W _v p _j Can be called plan embedding p _j Value embedding). Parameter matrix W _q ("query embedding matrix"), W _k ("Key Embedded matrix") and W _v ("value embedding matrix") is a trainable parameter of the attention block. In general, each attention block in attention subnetwork 202 can update the plan embedding using queries, keys, and value embedding matrices with different parameter values.

Optionally, each attention blockThere may be multiple "heads," each generating a respective update plan embedding corresponding to each input plan embedding, i.e., such that each input plan embedding is associated with multiple update plan embeddings. For example, each head may be based on a parameter matrix W described with reference to equations (1) - (3) _q 、W _k And W _v Generates an updated plan embedding. An attention block with multiple heads may implement a "gating" operation to combine head-generated update plan embeddings for input plan embeddings, i.e., to generate a single update plan embeddings corresponding to each input plan embeddings. For example, note that a block may process input plan embedding using one or more neural network layers (e.g., fully-connected neural network layers) to generate respective gating values for each head. The attention block may then combine the update plan embeddings corresponding to the input plan embeddings according to the gating values. For example, note that a block may embed p for an input plan _i Generating an updated plan embedding:

where k is the head index, α _k Is the gated value of the head k and,

is for input plan p _i The updated plan generated by header k is embedded. The attention operation described with reference to equations (1) - (4) may be referred to as a "multi-headed key-query-value attention operation".

By updating the planning inlays 116 using self-attention operations, the planning neural network 200 uses learning operations to share information between the planning inlays 116, thereby enriching each planning inlay with information content from other planning inlays. Enriching the informational content of the plan embedding 116 may enable the planning neural network 200 to generate implicit plans 118 with more information, which enables the agent to more efficiently accomplish goals in the environment, e.g., via fewer time steps.

In some implementations, as described with reference to fig. 1, the action selection system 100 generates plan inlays 116 that correspond only to the appropriate subset of experience tuples stored in memory, e.g., only to the L most recent time steps of the experience tuples. To enable the planning system 100 to merge information from all stored experience tuples (i.e., in addition to only the L most recent experience tuples), the action selection system 100 may generate a respective "static" embedding corresponding to each experience tuple stored in memory. The action selection system 100 may generate static embedding corresponding to the experience tuples, e.g., by concatenating the embedding of the target observation to the experience tuples. After generating the static embedding for the experience tuples stored in memory, the action selection system 100 may then provide the static embedding to the planning system 100 in addition to the plan embedding 116.

In addition to using self-attention (as described above) for the plan embedding itself, each attention block of attention subnetwork 202 may use cross-attention for static embedding to update the plan embedding. For example, each attention block may first update the plan embedding using cross-attention for static embedding, and then update the plan embedding using self-attention for plan embedding. Typically, the attention blocks of the attention subnetwork 202 do not update the static embedding, i.e., the static embedding remains fixed even if each of the attention blocks of the programmed embedding attended subnetwork 202 is updated.

To update a given plan embedding using cross-attention for static embedding, an attention block may determine a respective attention weight between the given plan embedding and each static embedding. The attention block may then update the given plan embedding with (i) the attention weights and (ii) the static embedding. For example, if passing

Represents plan embedding 116, and passes

Representing static embedding, then plan embedding p is updated _i The attention block may determine attention weights

Wherein a is _i,j Denotes p _i And s _j Attention weights in between, such as:

wherein W _q And W _k Is a learning parameter matrix, softmax (·) denotes the softmax normalization operation, and c is a constant. Using the attention weights, the attention block can embed the plan in p _i The updating is as follows:

wherein W _v Is a learning parameter matrix. Alternatively, the attention block may have multiple heads that generate multiple update plan embeddings corresponding to each input plan embeddings using cross-attention for static embedding. As described above, the attention block may combine multiple update plan embeddings corresponding to each input plan embeddings to generate a single update plan embeddings corresponding to each input plan embeddings.

By updating the planning embedding 116 using cross-attention on static embedding for each experience tuple in the external memory, the planning neural network 200 can efficiently capture information from all previous interactions of agents with the environment, generating an implicit plan that is more informative. The action selection system may use implicit programming with more information to select actions that enable the agent to more effectively complete the goal in the environment, e.g., via fewer time steps. Planning a neural network can significantly reduce consumption of computational resources (e.g., memory and computational power) by avoiding, for example, updating static embeddings using attention operations.

In addition to the attention operations described above, each attention block may implement any other suitable nervesThe network operates to update the current planning embedding. For example, each attention block may be to process the current planning insert B _i To generate updated plan embedding B _i+1 The residual block of (2), as follows:

B _i+1 ＝f(B _i +MHA(LayerNorm(B _i ))) (8)

where LayerNorm (-) represents a layer normalization operation and MHA (-) represents a multi-headed attention operation (including embedding B into the program) _i Self-attention, and optionally cross-attention to static embedding), and f (-) represents a linear projection operation.

The convergence subnetwork 206 is configured to handle: (i) the updated plan is embedded 204, and (ii) the current observations 110, to generate implicit plans 118 for achieving the goals. In general, the convergence subnetwork 206 can have any suitable neural network architecture that enables it to perform its described functions, including any suitable neural network layers (e.g., convolutional layers or fully-connected layers) connected in any suitable configuration (e.g., as a linear sequence of layers).

For example, to generate the implicit plan 118, the fusion subnetwork 206 can generate an embedding of the current observation 110, e.g., by processing the current observation 110 using an embedding neural network. The fusion sub-network 206 may then attach (connect) the embedding of the current observation 110 to each updated planning embedding 204 to generate a respective "combined" embedding corresponding to each updated planning embedding 204. The convergence subnetwork 206 can process each combined embedding using one or more neural network layers (e.g., fully connected layers) to generate a respective "transformed" embedding corresponding to each updated planned embedding 204. The fusion sub-network 206 may generate an implicit plan by applying a pooling operation to the embedding of the transformations. The pooling operation may be any suitable operation that, when applied to the embedding of the transform, generates an implicit plan having dimensions that are independent of the number of embeddings of the transform. For example, the pooling operation may be a feature-by-feature maximal pooling operation, i.e., an implicit plan is defined to have the same dimensions as each transformed embedding, and each entry of the implicit plan is defined as the maximum of the corresponding entry of the transformed embedding.

The parameters of the planning neural network 200, including the parameters of the attention sub-network 202 (including the attention blocks that it makes up) and the fusion sub-network 206, are jointly trained by a training engine using reinforcement learning techniques (as described with reference to FIG. 1) along with the parameters of the action selecting neural network 120. In particular, the gradient of the reinforcement learning objective function is propagated back through the action-selecting neural network into the convergence sub-network and the attention sub-network of the planning neural network. These gradients are used to adjust parameters of the planned neural network so that implicit planning code information can be generated that, when processed by the action selection neural network, will result in the selection of actions that allow the agent to effectively complete the target in the environment.

FIG. 3 is a flow diagram of an example process 300 for selecting actions to be performed by an agent interacting with an environment to achieve a goal. For convenience, process 300 will be described as being performed by a system of one or more computers located at one or more locations. For example, an action selection system, such as action selection system 100 of FIG. 1, suitably programmed in accordance with the present description, may perform process 300.

The system generates a respective plan embedding (302) corresponding to each of a plurality of experience tuples in the external memory. Each experience tuple characterizes an interaction of the agent with the environment at a respective previous time step.

The system processes the planning embedding using a planning neural network to generate an implicit plan (304) for achieving the goal.

The system selects an action to be performed by the agent at the time step using implicit planning (306).

FIG. 4 is a schematic diagram of an example of the system of FIG. 1, where elements similar to those previously described are indicated by similar reference numerals.

This specification uses the term "configured to" in the systems and computer program components. For a system of one or more computers configured to perform a particular operation or action, it is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that in operation causes the system to perform the operation or action. For one or more computer programs that are to be configured to perform a particular operation or action, it is meant that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operation or action.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and includes various apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. An apparatus may also be or further comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). In addition to hardware, an apparatus can optionally include code that creates an execution environment for a computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program, which may also be referred to or described as a program, software application, app, module, software module, script, or code, may be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages; it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, a single file dedicated to the program in question, or multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or computers.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in combination with, special purpose logic circuitry, e.g., an FPGA or an ASIC.

A computer suitable for executing a computer program may be based on a general purpose or a special purpose microprocessor or both, or on any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game player, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user, for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from the device used by the user, for example, by sending web pages to a web browser on the user's device in response to requests received from the web browser. In addition, the computer may interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smartphone running a messaging application) and receiving response messages from the user.

The data processing apparatus used to implement the machine learning model may also include, for example, dedicated hardware accelerator units for processing common and computationally intensive portions of machine learning training or production (i.e., inference, workload).

The machine learning model may be implemented and deployed using a machine learning framework (e.g., a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework).

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), e.g., the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server sends data, such as an HTML page, to the user device, such as for the purpose of displaying data to and receiving user input from a user interacting with the device as a client. Data generated at the user device, e.g., a result of the user interaction, may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, and are described in the claims, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain situations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain situations, multitasking and parallel processing may be advantageous.

Claims

1. A method performed by one or more data processing apparatus for selecting an action to be performed by an agent interacting with an environment to achieve a goal, the method comprising:

generating a respective plan embedding corresponding to each of a plurality of experience tuples in an external memory, wherein each experience tuple characterizes an interaction of an agent with an environment at a respective previous time step;

processing the planning embedding using a planning neural network to generate an implicit plan that achieves the goal; and

an action to be performed at a time step by an agent is selected using implicit programming.

2. The method of claim 1, wherein processing a planning embedding using a planning neural network to generate an implicit plan for achieving a goal comprises:

iteratively updating the plan embedding, including at each iteration of the plurality of iterations, each plan embedding updated with attention to the plan embedding; and

implicit plans are generated using plan embedding.

3. The method of claim 2, further comprising generating a respective static embedding corresponding to each of a plurality of experience tuples in an external memory;

wherein iteratively updating the planning embedding further comprises, at each iteration of the plurality of iterations:

each plan embedding is updated with attention to static embedding.

4. The method of claim 3, wherein:

generating a respective plan embedding corresponding to each of a plurality of experience tuples in external memory comprises:

generating respective plan embeddings only for experience tuples in the external memory characterizing interactions of the agent with the environment over a predetermined number of recent time steps; and is

Generating a respective static embedding corresponding to each of a plurality of experience tuples in external memory comprises:

a respective static embedding is generated for each experience tuple in the external memory.

5. The method of any of claims 2-4, wherein updating each plan embedding with attention to the plan embedding comprises:

processing planned embedding using a residual neural network block configured to apply a series of operations to planned embedding, the series of operations comprising: (i) a slice normalization operation, (ii) a attention operation, and (iii) a linear projection operation.

6. The method of claim 5, wherein the attention operation comprises a multi-headed key-query-value attention operation on a plan embedding.

7. The method of any of claims 2-6, wherein generating an implicit plan using plan embedding comprises:

based on: (i) plan embedding and (ii) current observations characterizing a current state of the environment to generate an implicit plan.

8. The method of claim 7, wherein the method is based on: (i) planning the embedding and (ii) current observations characterizing a current state of the environment to generate an implicit plan includes:

for each plan embedding:

appending the currently observed representation to the planning embedding to generate a combined embedding; and

processing, by one or more neural network layers, the combined embedding to generate a transformed embedding; and generating an implicit plan based on the transformed embedding.

9. The method of claim 8, wherein generating an implicit plan based on the transformed embedding comprises:

implicit plans are generated by applying pooling operations to the transformed embedding.

10. The method of claim 9, wherein the pooling operation is a feature-by-feature max pooling operation.

11. The method of any of claims 1-10, wherein selecting the action to be performed at a time step for an agent using implicit programming comprises:

processing an input comprising an implicit plan using a motion selection neural network to generate a motion selection output; and

an action is selected based on the action selection output.

12. The method of claim 11, wherein the action selection output includes a respective score for each action in a set of possible actions that the agent is capable of performing, and selecting an action based on the action selection output includes sampling actions according to the action scores.

13. The method of any of claims 11-12, wherein the action-selecting neural network and the planning neural network are trained using reinforcement learning techniques to maximize cumulative measures of rewards received by the agent interacting with the environment.

14. The method of any of claims 1-13, wherein each experience tuple comprises: (i) a representation characterizing an observation of the environmental state at a respective previous time step; (ii) a representation of an action performed by the agent at a respective previous time step; and (iii) a representation that characterizes an observation of the environmental state after the agent performed the action at the respective previous time step.

15. The method of any of claims 1-14, wherein generating the respective plan embedding corresponding to the experience tuple comprises:

a representation of the target state of the environment is appended to the experience tuple.

16. The method of claim 15, wherein proxying the goal to be achieved comprises transitioning the environmental state to a goal state.

17. The method according to any one of claims 1-16, further comprising: after selecting an action for the agent to perform at the time step using implicit programming, an experience tuple characterizing the interaction of the agent with the environment at the current time step is stored in external memory.

18. A system, comprising:

one or more computers; and

one or more storage devices communicatively coupled to one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods of any of claims 1-17.

19. One or more non-transitory computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective methods of any of claims 1-17.