CN117859135A

CN117859135A - Autoregressively generating a sequence of data elements defining an action to be performed by an agent

Info

Publication number: CN117859135A
Application number: CN202280057653.6A
Authority: CN
Inventors: S·E·里德; K·佐尔纳; E·帕里索托; T·埃利兹; A·诺威科夫; J·W·雷; M·M·R·丹尼尔; J·F·戈麦斯德弗雷塔斯; O·文亚尔斯; S·戈麦斯; A·D·爱德华兹; J·布鲁斯; G·巴瑟-玛伦
Original assignee: DeepMind Technologies Ltd
Current assignee: DeepMind Technologies Ltd
Priority date: 2021-08-24
Filing date: 2022-08-12
Publication date: 2024-04-09

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for using an action selection neural network to select actions to be performed by an agent to interact with an environment. In one aspect, a method includes, at each time step in a sequence of time steps: generating as a sequence of data elements a current representation of the state of the task performed in the environment by the agent up to said current time step; autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.

Description

Autoregressively generating a sequence of data elements defining an action to be performed by an agent

Technical Field

The present description relates to processing data using a machine learning model.

Background

The machine learning model receives input and generates output, such as predicted output, based on the received input. Some machine learning models are parametric models and generate an output based on the received input and a parametric value for the model.

Some machine learning models are depth models that employ a multi-layer model to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers, each of which applies a nonlinear transformation to a received input to generate an output.

Disclosure of Invention

The specification describes an action selection system implemented as a computer program on one or more computers in one or more locations for controlling an agent interacting with an environment to perform tasks.

Throughout the specification, a "data element" may refer to, for example, a value (e.g., an integer or floating point value) or an embedding. Embedding refers to an ordered set of index values, e.g., vectors, matrices, or other tensors of values.

According to a first aspect, there is provided a method performed by one or more computers for selecting an action to be performed by an agent to interact with an environment using an action selection neural network, in particular a trained action selection neural network. The method includes, at each time step in the sequence of time steps: for example, from a current observation characterizing the state of the environment, a current representation of the state of the task performed in the environment by the agent up to a current time step is generated as a (first) sequence of data elements. The method further comprises autoregressively generating a (second) sequence of data elements representing a current action to be performed by the agent at a current time step. For example, the data element (second) sequence may comprise a plurality of action data elements that together represent an action to be performed by the agent. In an embodiment, autoregressively generating the sequence of data elements (second) includes, for each position (in the second sequence of data elements) starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action-selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for representing the position in the sequence of data elements of the current action according to the score distribution; and updating the current representation of the state of the task by concatenating the selected (action) data element for the location to the current representation of the state of the task. That is, the updated current representation of the task state, i.e. the sequence of data elements (first), is updated for the autoregressive generation of the sequence of data elements (second), in particular for processing the current (now updated) representation of the task state to select the (action) data element of the next position. After autoregressively generating the sequence of data elements representing the current action, the method causes the agent to perform the current action at the current time step. The method may then update the current representation of the task state using the current observation of the next time step.

In some implementations, for each time step in the sequence of time steps, generating a current representation of the task state up to the current time step includes: receiving a current observation characterizing an environmental state of a current time step; generating a representation of the current observation as a sequence of data elements; and including the currently observed representation as a sequence of data elements in the current representation of the state of the task up to the current time step, e.g. by concatenating the (first) sequence of data elements representing the current state of the task with the currently observed representation as a sequence of data elements.

In some implementations, the current observation is defined by a set of values, and generating a representation of the current observation includes: each value in the set of values defining the current observation is concatenated into a sequence of values in a predefined order (i.e., the order in which the values of the observations are defined).

In some implementations, generating the representation of the current observation as a sequence of data elements further includes: the discretization defines each value in the set of currently observed values.

In some implementations, the current observation characterizing the current state of the environment for the current time step includes an image of the environment defined by the pixel array.

In some implementations, generating the representation of the current observation includes: the target rewards to be achieved by the agent's interaction with the environment are combined with the representation of the current observation as a sequence of data elements, wherein the target rewards define a cumulative metric of rewards to be achieved as a result of the agent's interaction with the environment.

In some implementations, for each time step subsequent to the first time step in the sequence of time steps, including the currently observed representation as a sequence of data elements in the current representation of the state of the task up to the current time step includes: receiving as a sequence of data elements a representation of the state of the task until a previous time step; and concatenating the representation of the current observation as the sequence of data elements to the representation of the state of the task as the sequence of data elements up to the previous time step to generate a current representation of the state of the task up to the current time step.

In some implementations, for each time step before the current time step, a representation of the state of the task up to the previous time step represents: (i) A respective observation characterizing a state of the environment at the time step, and (ii) a respective action performed by the agent at the time step.

In some implementations, at a first time step in the sequence of time steps, including the representation of the current observation as a sequence of data elements in the current representation of the state of the task up to the current time step includes: receiving a hint comprising data characterizing a task to be performed by an agent in an environment; generating a representation of the hint as a sequence of data elements; and concatenating the representation of the current observation as the sequence of data elements to the representation of the hint as the sequence of data elements to generate a current representation of the state of the task up to the current time step.

In some embodiments, the prompt includes one or more of the following: presentation of a task, target observations characterizing a target state of an environment, or a text sequence in natural language that provides instructions related to the task.

In some implementations, the action-selecting neural network has been trained based on a set of training examples, wherein for each training example: the training examples are represented as a sequence of data elements; at least one data element in the sequence of data elements representing a training example is designated as an action data element; and training the action selection neural network based on the training examples includes training the action selection neural network to generate action data elements included in the training examples.

In some implementations, the set of training examples includes respective training examples from a plurality of different control domains, wherein each control domain is associated with: (i) a corresponding agent, (ii) a corresponding environment, and (iii) a corresponding task, wherein each training example from each control domain characterizes interaction of the corresponding agent with the corresponding environment by performing an action to complete the corresponding task.

In some implementations, the plurality of different control domains includes a first control domain in which observations of the corresponding environment have a first dimension and a second control domain in which observations of the corresponding environment have a second, different dimension.

In some implementations, the plurality of different control domains includes a first control domain in which an action performed by a corresponding agent has a first dimension and a second control domain in which an action performed by a corresponding agent has a second, different dimension.

In some implementations, the training examples set includes a plurality of language modeling training examples, where each language modeling training example represents a text sequence of a natural language.

In some embodiments, the action selection neural network comprises a plurality of self-attention neural network layers. In general, the self-attention neural network layer has an attention layer input for each element of the input, and is configured to apply an attention mechanism on the attention layer input to generate an attention layer output for each element of the input. Many different attention mechanisms may be used.

In some implementations, for each position starting from a first position in the sequence of data elements representing the current action, selecting the data element for that position includes: the data element with the highest score under the score distribution is selected.

In some embodiments, for each time step in the sequence of time steps, the sequence of data elements representing the task state up to the current time step comprises: a sequence of values; embedding a sequence; or include values at some locations and embedded sequences at other locations.

In some implementations, the agent is a mechanical agent that interacts with the real world environment. Thus, the selected action may be an action performed by the machine agent in the real world environment, such as an action that causes the machine agent to physically manipulate one or more objects in the environment, and the observation characterizing the environmental state may be an observation of the real world environment. The observation may be a multi-modal observation. The method may use an action-selecting neural network to perform one or more tasks; a particular advantage of the described system is that the same action-selecting neural network with the same set of parameters (weights) can be used to perform many different tasks. In some embodiments, the system, particularly the action-selective neural network, has 12 hundred million or more learnable parameters; this facilitates the ability to perform a number of different tasks.

In some implementations, the current observation includes an image, and generating a representation of the current observation includes: generating a respective initial tile embedding corresponding to each of a plurality of tiles in the image; processing the initial tile embeddings using an encoder neural network to generate a respective final tile embedment for each of the plurality of tiles in the image; wherein each final block is embedded as a respective data element included in the sequence of data elements representing the current observation.

In some implementations, generating respective initial tile embeddings corresponding to tiles in the image includes: generating a pixel embedding representing pixels in a partition in the image; generating a tile location insert representing a location of a tile in the image; and generating an initial tile embedding for the tile by combining the pixel embedding and the tile location embedding for the tile.

In some implementations, the encoder neural network includes one or more self-attention neural network layers.

In some implementations, the encoder neural network includes one or more residual blocks.

In some implementations, the agent is a mechanical agent that interacts with the real world environment.

In some implementations, selecting an action to be performed by the mechanical agent includes selecting an action to cause the mechanical agent to physically manipulate one or more objects in the environment.

According to another aspect, there is provided a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the methods described herein.

One or more non-transitory computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the methods described herein.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages.

The action selection system described in this specification uses an autoregressive action selection neural network operating on a sequence of data elements to select actions to be performed by an agent in an environment. In particular, the action selection system represents both observations and actions as sequences of data elements, and uses the action selection neural network to operate on these sequences to autoregressively generate sequences of data elements representing actions to be performed by the agent in the environment. Because the action selection neural network operates on a sequence of data elements, it can be trained based on any training examples that can be represented as a sequence of data elements. Thus, the action selection neural network may be trained to perform any task based on training examples representing interactions of any agent with any environment, regardless of the observations of the environment and the respective dimensions of the actions performed by the agents.

The action selection system trains the action selection neural network based on a highly diverse set of training examples that represent interactions of a plurality of different agents with a plurality of different environments to perform a plurality of different tasks. Thus, action-selective neural networks learn a flexible and transferable understanding of agent control, which enables them to be quickly and efficiently generalized to new areas. Specifically, the action selection neural network may perform "learning less (few-shot learning)" that is, the action selection neural network may be trained to achieve acceptable performance levels on tasks in the new domain based on training only after a few training examples from the new domain. In some cases, the action selection neural network may perform "zero-order learning That is, by achieving acceptable performance levels on tasks in the new domain, training is not required based on any training examples from the new domain. Thus, the action selection system provides a generic model for agent control that is more widely applicable than conventional action selection systems. The action selection system enables more efficient use of computing resources (e.g., memory and computing power) by requiring less training data and fewer training iterations than conventional systems to achieve acceptable performance levels on control agents in new areas.

In addition to training the action selection neural network to perform proxy control tasks, the action selection system may also train the action selection neural network to perform language modeling, i.e., by training the action selection neural network based on a sequence of data elements representing text in natural language. Training the action selection neural network to perform language modeling may speed up training and improve performance of the action selection neural network, e.g., by improving the ability of the action selection neural network to implicitly infer the meaning of natural language cues provided to the action selection neural network.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 illustrates an example action selection system.

Fig. 2 shows training examples from different fields.

Fig. 3A and 3B illustrate operations performed by the action selection system to select actions to be performed by an agent interacting with an environment to complete a task.

FIG. 4 is a flow diagram of an example process for selecting actions to be performed by an agent to interact with an environment at a current time step.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

FIG. 1 illustrates an example action selection system 100. The action selection system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations, in which the systems, components, and techniques described below are implemented.

The system 100 selects an action 102 to be performed by the agent 120 interacting with the environment 118 at each of a plurality of time steps to complete a task in the environment 118.

At each time step, the system 100 receives an observation 116 characterizing the current state of the environment 118, and selects an action 102 to be performed by the proxy 120 in response to the observation 116. As described later, the action 102 at a time step may be represented by a sequence of action data elements.

Each time step may be associated with a reward, for example, based on a state of the environment 118 at the time step, an action 102 performed by the agent 120 at the time step, or both. In general, rewards may be expressed as numerical values. The rewards may be based on any event in the environment 118 or aspect of the environment 118. For example, the reward may indicate whether the agent 120 has completed a task in the environment (e.g., navigated to a target location in the environment 118), or the agent's progress toward completing the task. In some implementations, the reward may be a sparse reward having a value of 0 at each time step before the agent completes the task and a value of 1 (or some other positive value) at the time step when the agent completes the task. In some implementations, the rewards may be dense rewards with non-zero values at time steps prior to the agent completing the task, e.g., if the task involves navigating to a target location, the rewards at each time step may vary continuously based on the proximity of the agent to the target location.

The training engine 112 may train the system 100 to select an action that increases the "return" generated by the interaction of the agent 120 with the environment 118 by performing the action 102 selected by the system 100, as will be described in more detail below. The rewards refer to a cumulative metric of rewards generated by the agent 120 interactions with the environment 118, such as a time discount sum of the rewards.

In some implementations, the environment is a real-world environment and the agent is a mechanical agent that interacts with the real-world environment. For example, an agent may be a robot that interacts with an environment to perform tasks, e.g., locate an object of interest in the environment, move the object of interest to a specified location in the environment, physically manipulate the object of interest in the environment in a specified manner, or navigate to a specified destination in the environment; or the agent may be an autonomous or semi-autonomous land, air or sea vehicle that navigates through the environment to a specified destination in the environment. In particular examples, the agent may be a robot that interacts with the real world environment using a mechanical clamping tool, for example, to stack a set of objects (e.g., boxes) in the environment, or to assemble a set of components (e.g., electronic components).

In these embodiments, the observations may include, for example, one or more of an image (where the image may be represented as, for example, an array of pixels), object position data, and sensor data to capture observations as the agent interacts with the environment, such as sensor data from an image, a distance or position sensor, or from an actuator.

For example, in the case of a robot, the observations may include data characterizing the current state of the robot, e.g., one or more of the following: joint position, joint velocity, joint force, torque or acceleration, such as gravity compensated torque feedback, and global or relative pose of the item held by the robot.

In the case of a robot or other mechanical agent or vehicle, the observations may similarly include one or more of the following: the position, linear or angular velocity, force, torque or acceleration of one or more parts of the agent, and global or relative pose. Observations may be defined in 1, 2, or 3 dimensions, and may be absolute and/or relative observations.

The observations may also include, for example, data obtained by one or more sensor devices sensing a real world environment; for example, a sensed electronic signal, such as a motor current or temperature signal; and/or image or video data, e.g., from a camera or LIDAR sensor, e.g., data from a sensor of an agent or data from a sensor located separately from an agent in the environment.

In the case of an electronic agent, the observations may include data from one or more sensors monitoring a portion of a plant or service facility, such as current, voltage, power, temperature, and other sensors and/or electronic signals representing the functionality of electronic and/or mechanical items of equipment.

The action may be a control input controlling the robot or an autonomous or semi-autonomous land or air or sea vehicle, e.g. a torque or higher level control command for a joint of the robot, e.g. a torque or higher level control command to a control surface or other control element of the vehicle.

In other words, the actions may include, for example, position, speed or force/torque/acceleration data of one or more joints of the robot or of a component of another mechanical agent. The actions may additionally or alternatively include electronic control data, such as motor control data, or more generally data for controlling one or more electronic devices within the environment, the control of which has an effect on the observed state of the environment. For example, in the case of autonomous or semi-autonomous land, air, or marine vehicles, the actions may include actions that control navigation (e.g., steering) and movement (e.g., braking and/or acceleration of the vehicle). As described above, an action at a particular time step may have multiple components, each represented by a respective action data element.

In some implementations, the environment is a simulated environment, and the agent is implemented as one or more computers that interact with the simulated environment.

For example, the simulated environment may be a motion simulated environment, such as a driving simulation or a flight simulation, and the agent may be a simulated vehicle navigating through the motion simulation. In these embodiments, the action may be a control input for controlling a simulated user or a simulated vehicle.

In another example, the simulated environment may be a video game and the agent may be a simulated user playing the video game.

In another example, the simulated environment may be a protein folding environment such that each state is a respective state of a protein chain, and the agent is a computer system for determining how to fold the protein chain. In this example, the action is a possible folding action for folding the protein chain, and the task to be performed may include, for example, folding the protein, stabilizing the protein and enabling it to perform a specific biological function.

Generally, in the case of a simulated environment, the observations may include simulated versions of one or more of the previously described observations or observation types, and the actions may include simulated versions of one or more of the previously described actions or action types.

In some cases, the action selection system 100 may be used to control agent interactions with the simulated environment, and the training engine 112 may train parameters of the action selection system based on agent interactions with the simulated environment. After training the action selection system based on the agent's interactions with the simulated environment, the agent may be deployed in the real-world environment, and the trained action selection system may be used to control the agent's interactions with the real-world environment. Training the action selection system based on the agent's interaction with the simulated environment (i.e., rather than the real world environment) may avoid abrasion of the agent and may reduce the likelihood that the agent may damage itself or aspects of its environment by performing poorly selected actions.

In some other applications, the agent may control actions in a real world environment, including equipment items, such as in a data center or grid main power or water distribution system, or in a manufacturing plant or service facility. The observation may then involve the operation of the device or facility. For example, the observations may include observations of power or water usage of the device, or observations of power generation or distribution control, or observations of resource usage or waste production. An agent may control actions in an environment to perform tasks that improve efficiency, such as by reducing resource usage, and/or reducing environmental impact of operations in the environment, such as by reducing waste. The actions may include actions to control or impose operating conditions on equipment items of the device/facility, and/or actions to cause a setting change in the operation of the device/facility, e.g., to adjust or turn on/off components of the device/facility.

In some further applications, the environment is a real world environment, and the agent manages task allocation across computing resources (e.g., on the mobile device and/or in the data center). In these implementations, the actions may include assigning tasks to particular computing resources, and the tasks to be performed may include minimizing the time required to complete a set of tasks using the specified computing resources.

As another example, the action may include presenting an advertisement, the observation may include an advertisement impression or click-through count or click-through rate, and the reward may characterize one or more users' previous selections of the item or content. In this example, the tasks to be performed may include maximizing selection of items or content by one or more users.

As another example, the agent may generate an action that represents the text sequence in natural language. In some implementations, the task can be, for example, generating a natural language text sequence responsive to an observation represented by the natural language text sequence. In some implementations, a task may be, for example, a natural language text sequence that generates instructions that represent control of a (real or simulated) physical agent (e.g., "turn left", "accelerate", "activate light", etc.) to perform the task in a (real or simulated) physical environment.

As another example, an agent may generate actions (e.g., in a computer programming language) that represent sequences of computer code. In some implementations, the task can involve receiving an observation defining a natural language description of the desired computer code, and in response, generating a sequence of computer code that fits the natural language description of the desired computer code. In some implementations, a task may involve receiving an observation defining an input sequence of computer code, and in response, generating an output sequence of computer code that is a completion of the input sequence of computer code (e.g., logically expanding the input sequence of computer code).

To select an action 102 to be performed by the proxy 120, the system 100 maintains and iteratively updates the current task state 110 represented as a sequence of data elements. The sequence of data elements representing the current task state 110 may be, for example, a sequence of values, an embedded sequence, or a sequence of values included in some locations and embedded in other locations. At each time step, the current task state 110 represents the state of the task performed by the agents in the environment up to that time step.

Alternatively, prior to the first time step (i.e., the first time step in a sequence of time steps in which the agent interacts with the environment to perform a task), the system 100 may initialize the current task state 110 by "hints," which may be any suitable data characterizing the task to be performed by the agent 120 in the environment 118. The prompt may be provided to the system 100, for example, by a user of the system 100. Several examples of hints are described in more detail below, and example techniques for representing hints as sequences of data elements are described in more detail below.

In some implementations, the prompt may include a presentation of the task to be performed by the agent in the environment. That is, the hint may characterize the agent's interaction with the environment over a series of time steps during which the agent progresses toward completing a task in the environment. The hints may be defined by a series of "interaction tuples," where each interaction tuple corresponds to a respective time step and represents: observations of the environment at time steps, actions performed by the agent at time steps, or both.

The hints can include presentation of tasks that are different from (but related to) the tasks to be performed by the agent 120 in the environment 118. For example, if the proxy 120 is a robotic proxy and the task to be performed by the proxy 120 involves grabbing and moving one type of object (e.g., an object having a cube shape), the prompt may define a presentation that grabs and moves a different type of object (e.g., an object having a sphere shape).

The hints may include presentation of tasks in a different environment than the environment 118 in which the agent 120 is to perform its tasks. For example, if the agent 120 is a home robot agent and the task to be performed by the agent involves cleaning a room (e.g., kitchen), the prompt may define a presentation that the agent cleans a different room (e.g., bathroom).

The prompts may include a presentation of tasks performed by agents other than the agents 120 controlled by the system 100. For example, if the agent is a robotic agent 120 having robotic arms, the prompt may define a presentation of the agent performing tasks with robotic arms of different configurations (e.g., having different lengths).

In some implementations, the prompt may include a "target" observation, for example, that characterizes a target state of the environment, such that the agent 120 accomplishes the task by performing an action that transitions the environment to (or to a state related to) the target state. For example, if the agent 120 is a robotic agent and the task to be performed by the agent 120 involves assembling a set of components (e.g., electronic or mechanical components), the target observation may be, for example, an image showing the set of components assembled into a desired configuration.

In some implementations, the hints can include a text sequence in natural language (e.g., english) that provides instructions related to the task to be performed by the agent 120 in the environment 118. For example, if the agent 120 is a semi-autonomous or fully autonomous vehicle, the hint may be the word sequence "park car in a parking space closest to the store entrance" or "and to the highway and move to the leftmost lane".

In some implementations, the hints can include data characterizing the tasks to be performed by the agent 120 in a number of different ways, for example, the hints can include presentation of the tasks and text sequences in natural language that provide instructions related to the tasks.

The system 100 may represent hints as a sequence of data elements in any suitable manner. For example, if the hint includes a text sequence, the system 100 can represent the text sequence as a sequence of tokens from a predefined set of tokens, and then map each token to a corresponding value according to a predefined mapping. The token set may include, for example, characters, n-grams (n-grams), word fragments, words, or combinations thereof. Example techniques for representing observations and actions as a sequence of data elements are described in more detail below, which may be applied to representing a presentation of a task or target observation in a hint as a sequence of data elements.

In general, hints encode information that can enable the system 100 to infer a task to be performed by the agent 120 in the environment 118, and thus select actions to be performed by the agent 120 to complete the task. The system may be able to infer a task to be performed from the format of observations and actions in the representation of the task state. However, the system may sometimes need further context to disambiguate the task, and this may be provided by hints.

In some cases, as described above, the hints represent task presentations in a different environment (e.g., different from environment 118), or task presentations by a different agent (e.g., configured differently from agent 120), or task presentations of different but related tasks to be performed by agent 120. In these cases, the system 100 may combine the information encoded in the hint with the information encoded in the observation 116 of the environment 118 to infer a task to be performed by the agent 120 in the environment 118. For example, if the agent is a home robot and the task to be performed by the agent involves cleaning a target house, the prompt may include a presentation of the agent cleaning a different house. In this example, the system may combine the information representing the cleaning task as encoded in the prompt with the information representing the target house as encoded in the observations received by the system to implicitly infer that the task to be performed by the agent involves cleaning the target house.

In some cases, even without prompting, the system 100 may implicitly infer a task to be performed by the agent 120, particularly based on information encoded in observations received by the system 100. For example, the system 100 may implicitly infer from the observations that the agent 120 is interacting with one type of environment that the agent is typically performing a particular task during training, and on that basis, select actions to be performed by the agent to accomplish the particular task.

At each time step, the system 100 receives a current observation 116 characterizing the state of the environment 118 at the time step, and updates the current task state 110 using the current observation 116. For example, the system 100 may represent the current observation 116 as a sequence of data elements and update the current task state 110 by concatenating the sequence of data elements representing the current observation to the sequence of data elements representing the current task state 110. That is, in this example, the updated task state is represented by a sequence of data elements defined by concatenating: (i) A sequence of data elements representing the current task state 110, and (ii) a sequence of data elements representing the current observation 116.

The system 100 may represent the current observation 116 of the time step as a sequence of data elements in any suitable manner. The sequence of data elements representing the current observation 116 may be, for example, a sequence of values, an embedded sequence, or a sequence of values included in some locations and embedded in other locations. Several example techniques for representing the current observation 116 as a sequence of data elements are described next.

Typically, when received by the system 100, the current observation 116 is defined by an ordered set of values (e.g., a vector, matrix, or other tensor of values). (the number of values in an ordered set of values defining an observation may be referred to as the "dimension" of the observation). In some implementations, the system 100 can represent the set of values defining the current observation 116 as a sequence of values, for example, by concatenating the values defining the current observation into a sequence of values in an arbitrary but fixed order.

For example, if the observation 116 includes an image represented by an array of pixel intensity values, the system 100 may represent the array of pixel intensity values as a sequence of values by concatenating each pixel intensity value in the array of pixel intensity values representing the image in an arbitrary but fixed order. If the array of pixel intensity values is an N x N array having rows N and columns N, the system may concatenate the pixel intensity values in each row of the array, for example, starting from a first position in the row until a last position in the row for a first row in the array until a last row in the array.

As another example, if the observations include a position value (e.g., representing the position of the agent in the environment), a velocity value (e.g., representing the velocity of the agent in the environment), and an acceleration value (e.g., representing the acceleration of the agent in the environment), the system may concatenate these values in any predefined order, e.g., position value, followed by velocity value, followed by acceleration value.

In some implementations, to generate a representation of the observation 116 (or some portion of the observation 116) as a sequence of values, the system 100 first generates an embedding (e.g., a low-dimensional embedding) of the observation 116 by processing the observation 116 using an encoder machine learning model. The system 100 may then concatenate the embedded values defining the observations into a sequence of values representing the observations in an arbitrary but fixed order. The encoder machine learning model may be, for example, an encoder neural network of an automatic encoder machine learning model.

In some implementations, the system 100 can generate the representation of the observations as one or more embedded sequences. For example, the system 100 may generate a representation of an image as an embedded sequence by dividing the image into a sequence of segments and then generating a corresponding embedding for each segment using an encoder machine learning model. The system may then concatenate the respective embeddings of the image partitions to generate a representation of the image as an embedded sequence. The encoder machine learning model may be implemented as a neural network having any suitable neural network architecture. Several examples of possible architectures of the encoder neural network (i.e., implementing the encoder machine learning model) are described next.

In one example, the encoder neural network may have a residual neural network architecture that includes a sequence of residual blocks, e.g., where the input of each residual block is added to the output of the residual block. In a particular example, the encoder neural network can be implemented using a v2 res net architecture, e.g., as referenced: "Identity mappings in deep residual networks (identity mapping in depth residual network)", european Conference on Computer Vision (European computer vision conference), pages 630-645, 2016, described by He et al. The encoder neural network may be configured to receive a respective initial embedding representing each tile in the image. The initial embedding of the image tiles may be based on: (i) Block pixel embedding representing pixels in an image block, for example, by concatenating pixels in an image block into a vector; and (ii) a tile location embedding that represents the location of the tile in the image. For example, the initial embedding of an image tile is a sum or concatenation of tile pixel embedding and tile position embedding of the tile. The encoder neural network may be configured to process the initial embedding of each image tile to generate a final embedding of the tile.

The system may generate the tile location embedment of the image tile in any suitable manner, i.e., representing the location of the image tile in the image from which the image tile was extracted. For example, the system may generate relative row and column spacing of the tiles by normalizing the pixel spacing of the tiles at the image resolution. The system may quantize the row and column normalized intervals into a finite vocabulary of indices, which index: (i) storing a table of row position codes; and (ii) storing a table of column position codes. The system may retrieve the indexed row and column position codes and sum (or otherwise combine) the row and column position codes to produce a block position embedding.

As another example, the encoder neural network may have an attention-based neural network architecture. For example, the encoder neural network may include one or more self-attention neural network layers. The encoder neural network may repeatedly update the initial embedding of the image partitions (as described above), for example, using a self-attention neural network layer, to generate a corresponding final embedding for each image partition. In a particular example, the encoder neural network may have a visual transducer architecture, for example, as described in "An image is worth 16x16 words" by a.dosovitskiy et al: transformers for image recognition at scale (an image corresponds to 16x16 words: a transformer for scale image recognition) "arXiv:2010.11929v2, 2021.

The encoder neural network may be trained in conjunction with the action selection neural network 108, for example, to optimize the loss function. For example, the training engine 112 may counter-propagate the gradient of the loss function through the action selection neural network 108 and into the encoder neural network.

In some cases, the observations for the time step may include a plurality of constituent observations. For example, the observations of the time step may include respective images captured by a plurality of camera sensors of the agent. In these examples, as part of representing the observations as a sequence of data elements, the system may embed (e.g., sum) a respective "observation level" position with each data element in the sequence. The observation level location of the data element embeds an index that characterizes the constituent observations, e.g., represented by the data element. The observation level location embedding may be, for example, a learned or predefined embedding. The action data elements for the time steps may be combined with action embeddings, which may be the same for each action data element.

Alternatively, the system 100 may perform "reward adjustment" by generating (at one or more time steps) additional values representing the target rewards to be achieved through interaction of the agent 120 with the environment 118 ("reward values") and combining the reward values with a sequence of data elements representing the current observation. For example, the system 100 may perform the reward adjustment by concatenating the reward value to a sequence of data elements representing the current observation 116.

In general, the system 100 aims at selecting an action 102 that maximizes the return received by the proxy 120. Thus, the system 100 may set the return value to a predefined "expert" return value that represents the return to be achieved by the agent's expert execution of the task. The system may calculate expert return values, for example, as average return values achieved when an agent performs a task one or more times under the control of an expert (e.g., a human expert). Performing reward adjustment enables training engine 112 to efficiently train system 100 based on the agent receiving a training example of a range of possible returns (optionally including low rewards), as will be described in more detail below. The training engine 112 may normalize the reward value used during training by dividing the reward value by the expert reward value such that the optimal reward value for each task is the same predefined value, e.g., value 1, as will be described in more detail below.

Alternatively, if the system 100 uses a prompt that includes a task presentation to initialize the current task state 110, the prompt may also include a reward condition. For example, each observation in the prompt may include additional values representing the rewards achieved during the task presentation.

As part of representing the current observation 116 as a sequence of data elements, the system 100 may discretize each value in a set of values defining the current observation 116. Discretized values may refer to mapping values to corresponding values from a finite set of predefined "discretized" values (e.g., integer values within the range [0,255 ]). To discretize the values, the system 100 may first apply a transform function (e.g., a μ -law transform function) to the values to map them to a predefined range (e.g., range [ -1,1 ]). The predefined range may be associated with a predefined partition into a set of intervals, and each of the intervals may be associated with a corresponding discretized value from the predefined set of discretized values. Applying the transformation function to the values will cause the values to be included in one of the intervals, and the system 100 may discretize the values by mapping the values to discretized values associated with the intervals.

After updating the current task state 110 with the current observations 116, the system 100 processes the current task state 110 to autoregressively generate a sequence of one or more data elements that collectively represent the action 102 to be performed by the agent at the current time step. Each data element in the sequence of data elements representing an action 102 will be referred to herein as an "action data element" 104, i.e., an action data element sequence 104 that causes the action 102 to be generated by an action selection neural network 108.

The system 100 sequentially generates a respective action data element 104 at each position in the sequence of action data elements, starting from a first position in the sequence of action data elements defining the current action 102. The system 100 generates each action data element 104 by processing the current task state 110 using the action selection neural network 108 according to the parameter values of the action selection neural network 108 to generate a score distribution 106 over a set of possible action data elements. The set of possible action data elements may be any suitable set of data elements, for example integer values in the range of [0,255], or a predefined embedded set. The system 100 then uses the score distribution 106 over the set of possible action data elements to select the action data element 104. For example, the system 100 may select the action data element 104 with the highest score according to the score distribution 106. As another example, the system 100 may sample the action data elements from the set of possible action data elements according to a probability distribution over the set of possible action data elements (e.g., a probability distribution that may be generated by processing the score distribution 106 using a soft-max function).

In some cases, for one or more locations in the sequence of action data elements defining the current action 102, the valid set of action data elements at that location may be an appropriate subset (i.e., less than all) of the set of possible action data elements. An action data element at a location may be referred to as "valid" if the action comprising the action data element at the location represents an actionable action that may be performed by an agent. For example, if the action data element at a location represents a torque to be applied to a joint of a robotic agent, the robotic agent may apply M possible torques to the joint, and the set of possible action data elements includes N > M action data elements, the M possible action data elements may be designated as valid action data elements at the location. The system may ensure that the action data element selected at each location is a valid action data element, for example, by selecting the valid action data element with the highest score according to the score distribution over the set of possible action elements at that location.

After generating each action data element 104, the system 100 updates the current task state 110 by concatenating the action data element 104 to the current task state 110 before generating the next action data element 104 in the sequence of action data elements 104 that define the current action 102. Thus, the action selection neural network 108 autoregressively generates a sequence of action data elements 104, i.e., because the action data elements 104 at each location are generated by processing the current task state 110 that includes the action data elements 104 generated for each previous location. An example of selecting an action to be performed by an agent by autoregressively generating an action data element using the action selection neural network 108 is described with reference to fig. 3A and 3B.

The sequence of action data elements 104 defines an action 102 to be performed by the proxy 120 at a time step. For example, if the agent is a mechanical agent, the action data element 104 at each position in the sequence may define the torque to be applied to the corresponding joint of the robot. As another example, if the agent is an autonomous vehicle, the action data element 104 at one location may define acceleration/deceleration to be achieved by the vehicle, and the action data element 104 at another location may define steering to be achieved by the vehicle.

Alternatively, the hyper-parameters of the system 100 may specify a maximum length of the current task state 110, i.e., a maximum number of data elements that may be included in the current task state 110. When the system 100 concatenates data elements representing new observations and actions to the "terminal" end of the current task state, the length of the current task state increases. Thus, the system may remove data elements from the "initial" end of the current task state as needed to ensure that the length of the current task state remains at most a maximum length. (the terminal end of the current task state refers to the position occupied by the final data element in the sequence of data elements representing the current task state, and the initial end of the current task state refers to the position occupied by the first data element in the sequence of data elements representing the current task state).

The action selection neural network 108 may have any suitable neural network architecture that enables it to perform its described functions (i.e., process the current task state 110 to generate a score distribution over a set of possible action data elements). In particular, the action-selecting neural network may include any suitable number (e.g., 5, 10, or 100 layers) and any suitable neural network layers (e.g., attention, convolution, full connection, etc.) connected in any suitable configuration (e.g., as a linear sequence of layers).

Several examples of possible architectures of the action selection neural network 108 are described next. In each of these examples, the action selection neural network may include an embedding layer configured to map each data element represented as a numerical value in a sequence of data elements defining the current task state 110 to a corresponding embedding in an embedding space. The embedding layer may maintain data elements in a sequence of data elements defining the current task state 110 that have been represented as embedded without modification. That is, the embedding layer may represent the current task state 110 as a set of embeddings by, for example, replacing each value included in the current task state 110 with a corresponding embedment according to a predefined mapping from values to embeddings.

Alternatively, for each location in the current task state 110, the embedding layer may combine (e.g., sum or average) the embedding for that location with the location embedding representing the location in the current task state. Such location embedding may enable the action selection neural network to take full advantage of the order of data elements in the current task state 110, without relying on recursion or convolution.

In one example, the action selection neural network 108 may process the current task state 110 using an embedding layer to generate an embedded set representing the current task state 110. The action selection neural network 108 may then process the embedding representing the current task state 110 using a neural network layer sequence that includes one or more self-attention layers (e.g., a self-attention layer using a query key-value attention mechanism) to generate an updated set of embeddings. The action selection neural network 108 may process the updated embeddings using one or more final neural network layers to project the updated embeddings onto the fractional distribution over the set of possible action data elements. In a particular example, the action selection neural network 108 may have the architecture of a transformer neural network (a neural network characterized by a series of self-attention neural network layers), for example, a decoder of a transformer neural network as described with reference to "Attention is all you need (you need only to be attention)" arXiv:1706.03762v5, 2017, 12, 6. The transformer neural network may include memory to facilitate processing longer sequences of data elements representing the current state of a task. For example, it may have a "transducer-XL" as in Dai et al: attentive Language Models Beyond a Fixed-Length Context (transducer-XL: attention language model outside of fixed Length Context) ", arXiv: the converter-XL architecture described in 1901.02860v3, 2019, month 6 and 2.

In another example, the action selection neural network 108 may include an embedding layer followed by a fully connected layer that is individually applied to represent a respective embedding of each data element in a sequence of data elements that represents the current task state. The updated embeddings generated by the fully connected layers may be combined (e.g., averaged) and then processed by the final fully connected neural network layer to generate a fractional distribution over the set of possible action data elements.

In another example, the action selection neural network 108 may be a Recurrent Neural Network (RNN), such as a Long Short Term Memory (LSTM) neural network. The RNN may be configured to process the embedding of the representation data elements to update a hidden state (e.g., a unit state) of the RNN, and optionally to process the updated hidden state to generate a score distribution over the set of possible action data elements. After receiving the observation, the RNN may process, one at a time and in order from a first position in the sequence, a respective embedding corresponding to each data element in the sequence of data elements representing the observation to repeatedly update the hidden state of the RNN. The RNN may then autoregressively generate a sequence of data elements defining an action to be performed in response to the observation. In particular, for each location in the sequence of action data elements, the RNN processes its current hidden state to generate a score distribution over a set of possible action data elements for the action data element used to select that location. The RNN then processes the embedding of action data elements representing the selection for that location to update its hidden state before generating a score distribution over the set of possible action data elements for the next location.

In some implementations, the action selection system 100 may simulate observations received from an environment, rather than from an external environment. More specifically, at each time step, the action selection system 100 may autoregressively generate a sequence of data elements representing the current action performed by the agent at that time step, and then autoregressively generate a sequence of data elements representing the observation at the next time step.

The action selection system 100 may automatically regressively generate a sequence of data elements representing the observation at a next time step by sequentially generating respective data elements for each location starting from a first location in the sequence of data elements representing the observation. In particular, for each location in the sequence representing an observed data element, the system may process the current task state 110 using an action-selection neural network to generate a score distribution over a set of possible data elements. The system may select data elements for representing positions in the sequence of observed data elements according to a score distribution. For example, the system may select the data element with the highest score under the score distribution, or the system may sample the data element according to a probability distribution over the set of data elements defined by the score distribution over the set of data elements. The system may then update the current task state 110 by concatenating the selected data elements representing the locations in the observed sequence of data elements to the current task state 110. The system may continue to autoregressively generate a sequence of data elements representing the observation until termination criteria are met, e.g., until the system has generated a predefined number of data elements that together define the observation.

In some implementations, the action selection system 100 may be configured to generate actions only, i.e., not receive or generate observations, and in particular, by not including observations in the current task state 110 (with the possible exception of any observations that include hints provided to the action selection system 100). For example, the action selection system 100 may be configured to perform a text question-and-answer task (as will be described in more detail below) by generating a sequence of actions that represent a text response to a question (i.e., without generating or receiving any intermediate observations).

In some implementations, the action selection system 100 may be configured to generate only observations, i.e., not receive or generate actions, and in particular, by not including actions in the current task state 110 (with the possible exception of any actions included in the cues provided to the action selection system 100). For example, the action selection system 100 may be configured to perform a video generation task, in particular by generating a sequence of video frames, wherein each video frame represents a respective observation. For example, the action selection system 100 may be configured to receive a prompt defining a theme of the video (e.g., "generate video on how to change a tire on an automobile"), and in response, the action selection system 100 may generate a corresponding sequence of video frames related to the theme.

In some implementations, the action selection system 100 can be configured to receive an action for each time step at that time step, generate a representation of the action as a sequence of data elements, and concatenate the sequence of data elements representing the action to the current task state 110. The action selection system 100 may then autoregressively generate a sequence of data elements representing the next observation using the action selection neural network 108 (as described above), and then proceed to the next time step. That is, rather than using the action selection neural network 108 to generate actions, the action selection system 100 may receive actions from an external source, and may use the action selection neural network 108 to simulate observations that would result from performing the actions. In some cases, one or more actions may be specified by a user, for example, through an Application Programming Interface (API) available to the action selection system 100. In some cases, one or more actions may be selected using an external action selection policy that is parameterized in any suitable manner (e.g., by an external neural network).

Thus, in some embodiments, the action selection system 100 may generate the action while receiving the observation from an external source (e.g., an environment), while in other embodiments, the action selection system 100 may generate the observation while receiving the action from an external source (e.g., a user). Generating observations while receiving actions from external sources may enable the action selection system 100 to generate an observation sequence that simulates the effects of performing certain actions in an environment.

The training engine 112 may train the action selection neural network 108 based on training data 114 comprising a set of training examples. Each training example is represented as a sequence of data elements, e.g., a sequence of values, an embedded sequence, or a sequence that includes values at some locations and embeddings at other locations. Thus, the action selection neural network 108 may be trained offline in a supervised manner. Alternatively or in combination, offline or online reinforcement learning may be used to train the action selection neural network 108, either partially or entirely.

To train the action selection neural network 108 based on the training examples represented as the sequence of data elements, the training engine 112 may generate a respective prediction for each of the one or more data elements included in the training examples. To generate predictions of specified data elements in the training examples, the training engine 112 may process subsequences of data elements preceding the specified data elements in the training examples (i.e., which collectively represent the "current task state") to generate score distributions over a set of possible data elements. The training engine 112 may determine the gradient of a loss function that measures the error (e.g., cross entropy error) between: (i) Score distribution over a set of possible data elements, (ii) specified data elements in the training example. The training engine 112 may determine gradients of the loss function relative to parameter values of the action-selective neural network, for example, using back propagation. The training engine 112 may use the gradient of the loss function to adjust the current value of the action-selecting neural network parameter using any suitable gradient descent optimization algorithm (e.g., adam or RMSprop).

Each data element included in the training example may be specified as: an action data element, an observation data element, or a hint data element. An action data element refers to a data element in a sequence of one or more data elements representing an action (as described above). An observation data element refers to a data element in a sequence of one or more data elements representing an observation. (alternatively, if the training engine 112 performs a reward adjustment, one of the observation data elements in the sequence of observation data elements for observation may represent a reward, as will be described in more detail below). A hint data element refers to a data element in a sequence of one or more data elements that represent a hint.

In some implementations, the training engine 112 trains the action-selecting neural network to predict only the actions included in each training example. That is, the training engine 112 trains the action selection neural network to generate only the data elements specified as action data elements in each training example (e.g., by masking out other data elements).

In other implementations, the training engine 112 trains the action-selecting neural network to predict both actions and observations included in each training example. That is, the training engine 112 trains the action selection neural network to generate data elements designated as action data elements or observation data elements in each training example.

In general, the action selection neural network may generate a score distribution over a set of possible observation data elements in the same manner that the action selection neural network generates a score distribution over a set of possible action data elements. In some cases (e.g., if the set of possible observation data elements is different from the set of possible action data elements), the action selection neural network includes one output head (i.e., a sub-network) configured to generate a score distribution over the set of possible observation data elements, and a separate output head configured to generate a score distribution over the set of possible action data elements.

Training the action selection neural network to predict observations and actions included in the training examples causes the action selection neural network to implicitly learn a model of environmental dynamics, which may enable the action selection neural network to select actions for more efficiently performing tasks.

When the action selection neural network is used to select an action to be performed to control the agent, the action selection neural network autoregressively generates an action data element. However, it is understood that during training, the training engine may use the action-selective neural network to generate predictions for each data element included in each training example in parallel, which may significantly improve training efficiency.

Alternatively, the training engine may autoregressively generate predictions of motion, observations, or both during training. For example, to generate a prediction of an action data element defining an action in a training example, the training engine may initialize a current task state that includes a subsequence of data elements preceding a first action data element in a sequence of action data elements defining an action in the training example. The training engine may then process the current task state to autoregressively generate predictions of the action data elements defining the action, as described above.

In general, any sequence of data elements from any suitable source may be used as a training example for training the action selection neural network 108. Thus, the training engine 112 may train the action selection neural network 108 based on training examples from a variety of sources including, for example, both simulation and real world data. Some example techniques for generating training examples for training an action selection neural network are described in more detail below.

In one example, training engine 112 may generate training examples that represent interactions of agents with the environment over a sequence of time steps. The interaction of the agent with the environment may be expressed in the form of: Where N is the number of time steps, s _i Is the ambient state at time step i, a _i Is the action performed by the agent at time step i, and r _i Is the reward received at time step i. In general, each state s _i And each action a _i May be represented as an ordered set of values, for example, as a vector, matrix, or other tensor of values. (the number of values in an ordered set of values defining an action may be referred to as the "dimension" of the action). The use of rewards is optional: for example, it may be used to filter training examples to select training examples having at least a threshold percentage of rewards implemented by the expert agent performing the task.

To generate training examples that represent the agent's interactions with the environment, the training engine 112 represents each observation as a sequence of data elements, e.g., a sequence of values or an embedded sequence. For example, the training engine 112 may represent the respective sets of values defining each observation as a sequence of values, e.g., by concatenating the values defining the observations into a sequence of values in an arbitrary but fixed order. (example techniques for representing observations as embedded sequences are described above). The training engine 112 may perform the reward adjustment by determining the rewards, such as by calculating a time discount sum for the rewards, and then concatenating the rewards to the sequence of data elements representing each observation. Alternatively, the training engine 112 may normalize the payback, for example, by dividing the payback by the expert payback for the task performed by the agent, e.g., an average payback value achieved when the agent performs the task one or more times under the control of an expert (e.g., a human expert). The training engine 112 also represents each action as a sequence of data elements, e.g., the training engine 112 may represent an action as a sequence of values by concatenating a set of values representing the action into the sequence of values. The training engine 112 then concatenates the respective sequences of data elements representing the respective observations and the respective actions at each time step into one sequence of data elements. As part of generating the training examples, training engine 112 may optionally discretize the values in a set of values representing observations, actions, or both.

Alternatively, training engine 112 may generate a representation of the cues of the training examples as a sequence of data elements and concatenate the cues to the sequence of data elements representing the training examples.

Performing reward adjustment enables training engine 112 to effectively train action selection neural network 108 based on the training examples in which the agent receives a range of possible rewards, including low rewards. A low return associated with the training examples may indicate that the training examples represent agent interactions with environments where the agents failed to perform their tasks. Without reward adjustment, training the action selection neural network 108 based on the training examples associated with low rewards may reduce performance of the action selection neural network 108, for example, by enforcing an ineffective action selection policy represented by the training examples. Performing reward adjustment enables the action selection neural network to distinguish training examples that represent valid and invalid action selection policies, which may enhance the performance of the action selection neural network 108.

In another example, the training engine 112 may generate a "language modeling" training example that represents a text sequence of natural language. The training engine 112 may represent the text sequence as a sequence of tokens from a predefined set of possible tokens (e.g., characters, n-grams, or words), and then replace each token with a corresponding data element (e.g., an integer number identifier or an embedding that indexes the token in the set of possible tokens). The training examples may then be represented by a sequence of data elements identifying a sequence of tokens, where each token is designated as an action data element (i.e., such that in this case the training examples do not include any observation data elements).

In another example, training engine 112 may generate an "image caption" training example that represents: (i) an image; and (ii) image subtitles defining a text sequence describing the image content. For example, training engine 112 may generate training examples by concatenating sequences of corresponding data elements representing images and image captions. The sequence of data elements representing an image may be designated as a sequence of observation data elements and the sequence of data elements representing an image subtitle may be designated as a sequence of action data elements.

In another example, training engine 112 may generate a "text question and answer" training example that represents: (i) text questions; and (ii) a text answer responsive to the text question. For example, training engine 112 may generate training examples by concatenating sequences of respective data elements representing text questions and text answers. The sequence of data elements representing the text question may be designated as a sequence of observation data elements, and the sequence of data elements representing the text answer may be designated as a sequence of action data elements.

In another example, training engine 112 may generate a "visual question and answer" training example that represents: (i) An image and a text question associated with the image, and (ii) a text answer responsive to the text question. For example, training engine 112 may generate training examples by concatenating sequences of respective data elements representing images, text questions, and text answers. The sequence of data elements representing the image and text questions may be designated as a sequence of observation data elements, and the sequence of data elements representing the text answers may be designated as a sequence of action data elements.

In another example, training engine 112 may generate an "image classification" training example, which represents: (i) an image; and (ii) classifying the image into categories from the predefined set of categories. For example, each category may represent a respective type of object, if the image shows a type of object represented by the category, the image may be classified as being included in the category, and each category may be represented by a respective numerical value. The training engine 112 may generate training examples by concatenating the following: (i) a sequence of data elements representing an image; and (ii) a numerical value representing the classification of the image. The sequence of data elements representing the image may be designated as a sequence of observed data elements and the numerical value representing the classification of the image may be designated as an action data element.

The training engine 112 may train the action selection neural network 108 based on training examples from different sets of multiple different domains. In particular, training engine 112 may train the action selection neural network based on training examples that represent interactions of a plurality of different agents with a plurality of different environments to perform a plurality of different tasks. (examples of possible agents, environments, and tasks are described above). Training the action selection neural network 108 based on training examples from multiple domains may encode a flexible and transferable understanding of agent control in parameters of the action selection neural network, which may enable the action selection neural network to be quickly and efficiently generalized to new domains. In particular, training the action selection neural network 108 over multiple domains may enable the action selection neural network 108 to achieve acceptable performance on tasks in the new domain after being trained on a small number of training examples from the new domain. In some cases, training over multiple domains may enable the action selection neural network 108 to achieve acceptable performance on tasks in the new domain even though the action selection neural network has not been trained on any training examples from the new domain.

Training the action selection neural network 108 based on additional training examples (e.g., the language modeling, image captioning, text questions, visual questions, and image classification training examples described above) other than those representing agent interactions with the environment may speed up training and improve performance of the action selection neural network. For example, training the action selection neural network based on language modeling training examples may improve the ability of the action selection neural network to implicitly infer the meaning of natural language cues provided for control tasks. This may also facilitate generalization, e.g., to tasks in environments where the system has not been trained.

In general, training examples from different domains may use sequences of data elements of different lengths to represent actions and observations. For example, as shown in fig. 2, the training example from "field #1"202 represents an observation using a sequence of four data elements and an action using a sequence of two data elements, and the training example from "field #2"204 represents an observation using a sequence of three data elements and an action using a sequence of three data elements. This can be problematic for conventional action selection neural networks, for example, having a neural network architecture configured to process fixed-size observations to generate fixed-size actions. Rather, the operation of the action selection neural network 108 can be flexibly adapted to handle training examples from any domain, regardless of the domain-specific dimensions of observations and actions. For example, to generate an action having dimensions appropriate for a particular domain, the action selection neural network 108 may continue to auto-regressively sample the action data elements until the generated action has the appropriate dimensions.

Fig. 3A and 3B illustrate operations performed by the action selection system 100 to select actions to be performed by an agent interacting with an environment to complete a task.

Fig. 3A illustrates operations performed to autoregressively generate a sequence of action data elements representing an action to be performed by an agent at a first time step (i.e., t=0). The system initializes the current task state 304, in this example using hints 302. The hints 302 represented as a sequence of data elements can include any suitable data related to the task to be performed by the agent, such as a presentation of the task or natural language instructions related to the task.

The system 100 receives observations from the environment that represent the current state of the environment, for example, in the form of an image of the environment. The system represents the observations as a sequence of observation data elements 310, for example, by concatenating the values in the set of values representing the observations into a sequence in an arbitrary but fixed order.

The system 100 then concatenates the observation data element to the current task state 304.

The system 100 processes the current task state 304 using the action selection neural network 108 to generate a probability distribution over a set of possible action data elements, and then selects the action data elements 312 according to the probability distribution over the set of possible action data elements.

The system 100 concatenates the action data element 312 to the current task state 306 and processes the updated task state 306 using the action selection neural network 108 to generate another action data element 314. More specifically, the system processes the updated task state 306 to generate a probability distribution over the set of possible action data elements, and then selects the action data elements 314 based on the probability distribution over the set of possible action data elements.

The generated sequence of action data elements (i.e., including action data elements 312 and 314) defines an action 316 to be performed by the agent at the first time step.

Fig. 3B illustrates operations performed to autoregressively generate a sequence of action data elements representing an action to be performed at a second time step (i.e., t=1).

The agent performs an action 316 selected at a first time step (t=0) and the environment transitions to a new state due to the action performed by the agent. The system receives observations characterizing the new state of the environment at a second time step, represents the observations as a sequence of observation data elements 318, and concatenates the observation data elements to the current task state 326. Thus, the current task state 326 includes the prompt 302, a sequence of observed data elements representing observations at a first time step, a sequence of action data elements representing actions performed by the agent at the first time step, and a sequence of observed data elements 318 representing observations at a second time step.

The system 100 processes the current task state 326 using the action selection neural network to generate a score distribution over a set of possible action data elements and selects the action data element 320 according to the score distribution over the set of possible action data elements.

The system then concatenates the action data element 320 to the current task state 328.

The system uses the action selection neural network 108 to process the updated task state 328 to generate another score distribution over the set of possible action data elements and to select the action data element 322 according to the score distribution over the set of possible action data elements. The system concatenates the action data element 322 to the current task state 330 and provides the current task state 330 for use in selecting actions to be performed by the agent at the next time step.

The generated sequence of action data elements 320 and 322 define an action 324 to be performed by the agent at a second time step.

FIG. 4 is a flow diagram of an example process 400 for selecting actions to be performed by an agent to interact with an environment at a current time step. For convenience, process 400 will be described as being performed by a system of one or more computers located at one or more locations. For example, an action selection system (e.g., action selection system 100 of fig. 1) suitably programmed in accordance with the present description may perform process 400.

The system generates a current representation of the state of the task performed in the environment by the agent up to the current time step as a sequence of data elements (402). The sequence of data elements may be, for example, a sequence of values, an embedded sequence, or a sequence that includes values at some locations and embedments at other locations.

The system autoregressively generates a sequence of data elements representing a current action to be performed by the agent at a current time step. In particular, the system performs steps 404-410 for each position starting from the first position in the sequence of data elements representing the current action. For convenience, steps 404-410 will be described as being performed for the current position in the sequence of data elements representing the current action.

The system uses the action selection neural network to process the current representation of the state of the task to generate a score distribution over a set of possible data elements (404).

The system selects a data element for the current position in a sequence of data elements representing the current action based on the score distribution (406).

The system updates the current representation of the state of the task by concatenating the selected data elements for the location to the current representation of the state of the task (408).

The system determines whether the current action is complete (410). If the current position is the final position in the sequence of data elements representing the current action, the system determines that the current action is complete and proceeds to step 412. Otherwise, the system determines that the current action is not complete and loops back to step 404.

After autoregressively generating the sequence of data elements representing the current action, the system causes the agent to perform the current action step at the current time (412).

The term "configuration" is used in this specification in connection with systems and computer program components. For a system of one or more computers configured to perform a particular operation or action, it is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that in operation causes the system to perform the operation or action. For one or more computer programs configured to perform particular operations or actions, it is meant that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operations or actions.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed in this specification and their structural equivalents), or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on a manually generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and includes all kinds of apparatus, devices and machines for processing data, including for example a programmable processor, a computer, or multiple processors or computers. The apparatus may also be or further comprise a dedicated logic circuit, such as an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). In addition to hardware, the apparatus may optionally include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (which may also be referred to or described as a program, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a data communication network.

In this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or computers.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, or combination of, special purpose logic circuitry (e.g., an FPGA or ASIC) and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor or both, or any other type of central processing unit. Typically, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory may be supplemented by, or incorporated in, special purpose logic circuitry. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, the computer need not have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disk; CDROM and DVD-ROM discs.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example by sending a web page to a web browser on the user device in response to a request received from the web browser. Further, the computer may interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smart phone running a messaging application) and receiving response messages from the user in return.

The data processing means for implementing the machine learning model may also comprise, for example, dedicated hardware accelerator units for handling public and computationally intensive parts of machine learning training or production, i.e. reasoning, workload.

The machine learning model may be implemented and deployed using a machine learning framework (e.g., a TensorFlow framework).

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface, a web browser, or an application through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include Local Area Networks (LANs) and Wide Area Networks (WANs), such as the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server sends data (e.g., HTML pages) to the user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device acting as a client. Data generated at the user device, e.g., results of a user interaction, may be received at the server from the device.

While this specification contains many specifics of embodiments, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings and described in a particular order in the claims, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method performed by one or more computers for selecting an action to be performed by an agent to interact with an environment using an action selection neural network, the method comprising, at each time step in a sequence of time steps:

generating as a sequence of data elements a current representation of the state of the task performed by the agent in the environment up to the current time step;

autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step, including, for each position starting from a first position in the sequence of data elements representing the current action:

processing the current representation of the state of the task using the action-selection neural network to generate a score distribution over a set of possible data elements;

Selecting a data element for representing the position in the sequence of data elements of the current action according to the score distribution; and

updating the current representation of the state of the task by concatenating the selected data elements for the location to the current representation of the state of the task; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.

2. The method of claim 1, wherein generating a current representation of the state of the task up to the current time step for each time step in the sequence of time steps comprises:

receiving a current observation characterizing a state of the environment at the current time step;

generating a representation of the current observation as a sequence of data elements; and

the representation of the current observation is included as a sequence of data elements in the current representation of the state of the task up to the current time step.

3. The method of claim 2, wherein the current observation is defined by a set of values, and generating the representation of the current observation as a sequence of data elements comprises:

Concatenating each value in the set of values defining the current observation into a sequence of values in a predefined order.

4. The method of claim 3, wherein generating the representation of the current observation as a sequence of data elements further comprises:

discretization defines each value in the set of values for the current observation.

5. The method of any of claims 2-4, wherein the current observation characterizing the current state of the environment at the current time step comprises an image defined by an array of pixels.

6. The method of any of claims 2-5, wherein generating the representation of the current observation as a sequence of data elements comprises:

combining a target reward to be achieved by interaction of the agent with the environment with the representation of the current observation as a sequence of data elements, wherein the target reward defines a cumulative metric of rewards to be achieved as a result of interaction of the agent with the environment.

7. The method of any of claims 2-6, wherein for each time step following a first time step in the sequence of time steps, including the currently observed representation as a sequence of data elements in a current representation of the state of the task up to the current time step comprises:

Receiving a representation of the state of the task as a sequence of data elements up to a previous time step; and

concatenating the representation of the current observation as a sequence of data elements to the representation of the state of the task as a sequence of data elements up to the previous time step to generate the current representation of the state of the task up to the current time step.

8. The method of claim 7, wherein for each time step preceding the current time step, a representation of the state of the task up to the previous time step represents: (i) A respective observation characterizing a state of the environment at the time step, and (ii) a respective action performed by the agent at the time step.

9. The method of any of claims 2-8, wherein at a first time step in the sequence of time steps, including the representation of the current observation as a sequence of data elements in a current representation of the state of the task up to the current time step comprises:

receiving a hint, the hint comprising data characterizing the task to be performed by the agent in the environment;

Generating a representation of the hint as a sequence of data elements; and

concatenating the representation of the current observation as a sequence of data elements to the representation of the hint as a sequence of data elements to generate the current representation of the state of the task up to the current time step.

10. The method of claim 9, wherein the cues comprise one or more of: presentation of the task, target observations characterizing a target state of the environment, or a text sequence in natural language providing instructions related to the task.

11. The method of any preceding claim, wherein the action-selecting neural network has been trained based on a set of training examples, wherein for each training example:

the training examples are represented as a sequence of data elements;

at least one data element in the sequence of data elements representing a training example is designated as an action data element; and

training the action selection neural network based on the training examples includes training the action selection neural network to generate action data elements included in the training examples.

12. The method of claim 11, wherein the set of training examples comprises respective training examples from a plurality of different control domains, wherein each control domain is associated with: (i) a corresponding agent, (ii) a corresponding environment, and (iii) a corresponding task, wherein each training example from each control domain characterizes interaction of the corresponding agent with the corresponding environment by performing an action to complete the corresponding task.

13. The method of claim 12, wherein the plurality of different control domains includes a first control domain in which observations of the corresponding environment have a first dimension and a second control domain in which observations of the corresponding environment have a second, different dimension.

14. The method of claim 12 or 13, wherein the plurality of different control domains includes a first control domain in which actions performed by the corresponding agents have a first dimension and a second control domain in which actions performed by the corresponding agents have a second, different dimension.

15. The method of any of claims 11-14, wherein the set of training examples includes a plurality of language modeling training examples, wherein each language modeling training example represents a text sequence of a natural language.

16. A method according to any preceding claim, wherein the action selection neural network comprises a plurality of self-attention neural network layers.

17. The method of any preceding claim, wherein, for each position from the first position in the sequence of data elements representing the current action, selecting the data element for the position comprises:

The data element with the highest score under the score distribution is selected.

18. A method according to any preceding claim, wherein, for each time step in the sequence of time steps, the sequence of data elements representing the state of the task up to the current time step comprises: a sequence of values; embedding a sequence; or include values at some locations and embedded sequences at other locations.

19. The method of any preceding claim when dependent on claim 2, wherein the current observation comprises an image, and wherein generating the representation of the current observation as a sequence of data elements comprises:

generating a respective initial tile embedding corresponding to each of a plurality of tiles in the image;

processing the initial tile embeddings using an encoder neural network to generate a respective final tile embedment for each of the plurality of tiles in the image;

wherein each final block is embedded as a respective data element included in the sequence of data elements representing the current observation.

20. The method of claim 19, wherein generating a respective initial tile embedding corresponding to a tile in the image comprises:

Generating a pixel embedding representing pixels in the partition in the image;

generating a tile location insert representing a location of a tile in the image; and

the initial tile embedding for the tile is generated by combining the pixel embedding and the tile location embedding for the tile.

21. The method of claim 19 or 20, wherein the encoder neural network comprises one or more self-attention neural network layers.

22. The method of claim 19, 20 or 21, wherein the encoder neural network comprises one or more residual blocks.

23. A method according to any preceding claim, wherein the agent is a mechanical agent that interacts with a real world environment.

24. The method of claim 23, wherein selecting an action to be performed by the mechanical agent comprises selecting an action to cause the mechanical agent to physically manipulate one or more objects in the environment.

25. A system, comprising:

one or more computers; and

one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of claims 1-24.

26. One or more non-transitory computer storage media storing instructions which, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-24.