EP4107669A1 - Erzeugung räumlicher einbettungen durch integration von agentbewegung und optimierung eines prädiktiven objektivs - Google Patents

Erzeugung räumlicher einbettungen durch integration von agentbewegung und optimierung eines prädiktiven objektivs

Info

Publication number
EP4107669A1
EP4107669A1 EP21726089.2A EP21726089A EP4107669A1 EP 4107669 A1 EP4107669 A1 EP 4107669A1 EP 21726089 A EP21726089 A EP 21726089A EP 4107669 A1 EP4107669 A1 EP 4107669A1
Authority
EP
European Patent Office
Prior art keywords
spatial
embedding
neural network
time step
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21726089.2A
Other languages
English (en)
French (fr)
Inventor
Benigno URIA-MARTÍNEZ
Andrea BANINO
Borja IBARZ GABARDOS
Vinicius ZAMBALDI
Charles BLUNDELL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DeepMind Technologies Ltd
Original Assignee
DeepMind Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DeepMind Technologies Ltd filed Critical DeepMind Technologies Ltd
Publication of EP4107669A1 publication Critical patent/EP4107669A1/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • This specification relates to processing data using machine learning models.
  • Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input.
  • Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model.
  • Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input.
  • a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non linear transformation to a received input to generate an output.
  • This specification generally describes a system and method implemented as computer programs on one or more computers in one or more locations for training a spatial embedding neural network having a set of spatial embedding neural network parameters.
  • the spatial embedding neural network is configured to process data characterizing motion of an agent that is interacting with an environment to generate spatial embeddings.
  • an “embedding” refers to an ordered collection of numerical values, e.g., a vector or matrix of numerical values.
  • an example method comprises, for each of a plurality of time steps, obtaining, e.g. inputting, and processing data characterizing the motion of the agent in the environment at the current time step using a spatial embedding neural network, e.g. a recurrent neural network, to generate a current spatial embedding for the current time step.
  • the method determines a predicted score and a target score for each of a plurality of slots in an external memory, wherein each slot stores: (i) a representation of an observation characterizing a state of the environment, and (ii) a spatial embedding.
  • the predicted score for each slot measures a similarity between: (i) the current spatial embedding, and (ii) the spatial embedding corresponding to the slot.
  • the target score for each slot measures a similarity between: (i) a current observation characterizing the state of the environment at the current time step, and (ii) the observation corresponding to the slot.
  • the method determines an update to values of the set of spatial embedding neural network parameters based on an error between the predicted scores and the target scores.
  • the method further comprises, for each of the plurality of time steps, processing the current observation and the current spatial embedding using an action selection neural network to generate an action selection output, and selecting an action to be performed by the agent at the current time step using the action selection output.
  • the environment is a real-world environment
  • the agent is a mechanical agent navigating through the real-world environment
  • the actions control the movement of the agent in the environment, i.e. the action selection system selects actions to enable the agent to perform a task that involves navigating through the environment.
  • the environment is a real-world environment and the agent is a mechanical agent
  • the environment is a simulated environment and the agent is implemented as one or more computer programs.
  • the method may comprise using the trained spatial embedding neural network to enable a mechanical agent to navigate through a new, real-world environment. That is, the spatial embedding neural network may be trained in the real world or in simulation, but the trained spatial embedding neural network may then be used in the real world.
  • Navigating through the new, real-world environment may comprise processing data characterizing the motion of the mechanical agent in the real-world environment using the trained spatial embedding neural network to generate spatial embeddings.
  • An action selection system in particular an action selection neural network of the action selection system, may be used to process the spatial embeddings to select actions to be performed by the mechanical agent to control the movement of the agent in the new, real-world environment to navigate through the new, real-world environment.
  • a method performed by one or more data processing apparatus for training a spatial embedding neural network having a set of spatial embedding neural network parameters that is configured to process data characterizing motion of an agent that is interacting with an environment to generate spatial embeddings comprising, for each of a plurality of time steps: processing data characterizing the motion of the agent in the environment at the current time step using a spatial embedding neural network to generate a current spatial embedding for the current time step; determining a predicted score and a target score for each of a plurality of slots in an external memory, wherein each slot stores: (i) a representation of an observation characterizing a state of the environment, and (ii) a spatial embedding, wherein the predicted score for each slot measures a similarity between: (i) the current spatial embedding, and (ii) the spatial embedding corresponding to the slot, wherein the target score for each slot measures a similarity between: (i) a current observation characterizing the
  • the data characterizing the motion of the agent in the environment at the current time step comprises one or more of: speed data characterizing a speed of the agent at the current time step, angular velocity data characterizing an angular velocity of the agent at the current time step, or translational velocity data characterizing a translational velocity of the agent at the current time step.
  • the current observation characterizing the state of the environment at the current time step comprises an image.
  • the image is captured from a perspective of the agent at the current time step.
  • determining the target score for each slot in the external memory comprises: obtaining respective embeddings of the current observation characterizing the current state of the environment and the observation corresponding to the slot; and determining the target score based on a similarity measure between: (i) the embedding of the current observation characterizing the current state of the environment, and (ii) the embedding of the observation corresponding to the slot.
  • obtaining the embedding of the current observation comprises processing the current observation using an embedding neural network.
  • the error between the predicted scores and the target scores comprises a cross-entropy error between the predicted scores and the target scores.
  • the method further comprises determining an update to the spatial embeddings stored in the external memory based on the error between the predicted scores and the target scores.
  • the spatial embedding neural network does not process the current observation to generate the current spatial embedding for the current time step.
  • the method further comprises storing a representation of the current observation and the current spatial embedding in a slot in the external memory.
  • the method further comprises processing second data characterizing the motion of the agent in the environment at the current time step using a second spatial embedding neural network having a set of second spatial embedding neural network parameters to generate a second current spatial embedding for the current time step, wherein each slot in the external memory also stores a second spatial embedding, wherein for each slot in the external memory, the predicted score for the slot additionally measures a similarity between: (i) the second current spatial embedding, and (ii) the second spatial embedding corresponding to the slot; and determining an update to values of the set of second spatial embedding neural network parameters based on the error between the predicted scores and the target scores.
  • the data characterizing the motion of the agent that is processed by the spatial embedding neural network is a proper subset of the second data characterizing the motion of the agent that is processed by the second spatial embedding neural network.
  • determining the predicted score for the slot comprises determining a product of: (i) a similarity measure between the current spatial embedding and the spatial embedding corresponding to the slot, and (ii) a similarity measure between the second current spatial embedding and the second spatial embedding corresponding to the slot.
  • the method further comprises, for each of the plurality of time steps: processing the current observation and the current spatial embedding using an action selection neural network to generate an action selection output; and selecting an action to be performed by the agent at the current time step using the action selection output.
  • the action selection output comprises a respective score for each action in a predetermined set of actions.
  • selecting the action to be performed by the agent at the current time step comprises selecting an action having a highest score.
  • the action selection neural network is trained using reinforcement learning techniques to encourage the agent to perform a task in the environment.
  • the task is a navigation task.
  • the action selection neural network additionally processes a goal spatial embedding that was generated by the spatial embedding neural network at a time step when the agent was located in a goal location in the environment.
  • the spatial embedding neural network comprises a recurrent neural network
  • generating the current spatial embedding for the current time step comprises: processing: (i) the data characterizing the motion of the agent in the environment at the current time step, and (ii) an output of the spatial embedding neural network at a preceding time step, using the spatial embedding neural network to update a hidden state of the spatial embedding neural network, wherein the updated hidden state defines the current spatial embedding.
  • the method further comprises determining an output of the spatial embedding neural network for the current time step, comprising: identifying the updated hidden state of the spatial embedding neural network as the output of the spatial embedding neural network for the current time step.
  • the method further comprises determining an output of the spatial embedding neural network for the current time step, comprising: determining a respective weight value for each slot in the external memory that characterizes a similarity between: (i) the current observation characterizing the state of the environment at the current time step, and (ii) the observation corresponding to the slot; determining a correction embedding as a linear combination of the spatial embeddings corresponding to the slots in the external memory, wherein each spatial embedding is weighted by the corresponding weight value; determining the output of the spatial embedding neural network based on: (i) the updated hidden state of the spatial embedding neural network, and (ii) the correction embedding.
  • the method further comprises: processing data characterizing the motion of the agent in the environment at the current time step using an integrated embedding neural network to generate a current integrated embedding for the current time step; determining a predicted score and a target score for each of a plurality of slots in an additional external memory, wherein each slot stores: (i) a spatial embedding, and (ii) an integrated embedding, wherein the predicted score for each slot measures a similarity between: (i) the current integrated embedding, and (ii) the integrated embedding corresponding to the slot, wherein the target score for each slot measures a similarity between: (i) the current spatial embedding, and (ii) the spatial embedding corresponding to the slot; and determining an update to values of the set of integrated embedding neural network parameters based on an error between the predicted scores and the target scores for the slots in the additional external memory.
  • one or more (non-transitory) computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective methods described herein.
  • a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods described herein.
  • the system described in this specification can train a spatial embedding neural network that continuously (i.e., over multiple time steps) processes data characterizing the motion (e.g., angular and translational velocity) of an agent to generate a respective spatial embedding for each time step.
  • the system can train the spatial embedding neural network to process the motion data to generate spatial embeddings that are predictive of observations characterizing the state of the environment, e.g., images of the environment captured by a camera of the agent.
  • Spatial embeddings generated by the spatial embedding neural network can implicitly characterize the position of the agent in the environment.
  • An action selection system can process spatial embeddings generated by the spatial embedding neural network to select actions to solve tasks, e.g., that involve navigating through complex, unfamiliar, and changing environments. Processing spatial embeddings generated by the spatial embedding neural network can enable the action selection system to solve tasks more efficiently (e.g., quickly) than it otherwise would, because the spatial embeddings encode rich spatial information content and provide an efficient basis-set for representing spatial information. For example, processing the spatial embeddings can enable the action selection system to select actions that cause the agent to navigate to a goal location using direct (or approximately direct) routes that can cover areas of the environment that were not previously visited by the agent.
  • processing the spatial embeddings can enable the action selection system to exploit the rich spatial information encoded in the spatial embeddings to select actions that cause the agent to take shortcuts that result in the agent reaching goal locations (or otherwise accomplishing tasks) more efficiently than some other systems.
  • Processing spatial embeddings generated by the spatial embedding neural network can also enable the action selection system to be trained to reach an acceptable level of performance (i.e., in solving tasks) over fewer training iterations than some other systems, thereby reducing consumption of computational resources during training.
  • Computational resources can include, e.g., memory and computing power.
  • the spatial embedding neural network can generate spatial embeddings that enable the agent to efficiently navigate through new (i.e., previously unseen) environments without the spatial embedding neural network being retrained on training data characterizing interaction of the agent with the new environments.
  • the system described in this specification can jointly train multiple spatial embedding neural networks, each of which can be configured to process a different set of data characterizing the motion of the agent at each time step.
  • one spatial embedding neural network can be configured to process data characterizing the angular velocity of the agent at each time step
  • another spatial embedding neural network can be configured to process data characterizing the both the angular velocity and the translational velocity of the agent at each time step.
  • the spatial embeddings generated by each spatial embedding neural network can have different properties and be complementary to one another, e.g., one might generate spatial embeddings that depend substantially on the heading of the agent, while another might generate spatial embeddings that depend substantially on the distance of the agent to other objects in the environment at a particular heading.
  • the set of spatial embeddings generated by the respective spatial embedding neural networks can collectively characterize the position of the agent in a variety of complementary ways. Processing the set of spatial embeddings generated by the spatial embedding neural networks can enable the action selection system to select actions that allow the agent to accomplish tasks more effectively.
  • FIG. 1 is a block diagram of an example action selection system.
  • FIG. 2 is a block diagram of an example spatial embedding training system.
  • FIG. 3 shows a data flow that illustrates operations performed by a spatial embedding training system.
  • FIG. 4 is a flow diagram of an example process for training one or more spatial embedding neural networks.
  • FIG. 5 is a flow diagram of an example process for generating a spatial embedding for a time step using a spatial embedding neural network that has a recurrent neural network architecture.
  • FIG. 1 is a block diagram of an example action selection system 100.
  • the action selection system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.
  • the system 100 selects actions 110 to be performed by an agent 112 interacting with an environment 116 at each of multiple time steps to perform a task that involves navigating through the environment.
  • the task can be, e.g., navigating through the environment to locate an object in the environment, navigating through the environment to reach a specified destination in the environment (referred to as a “goal location”), or navigating through the environment to visit as many locations as possible in the environment as possible (e.g., to explore the environment).
  • the environment is a real-world environment and the agent is a mechanical agent navigating through the real-world environment.
  • the agent may be a robot or an autonomous or semi-autonomous land, sea, or air vehicle.
  • the environment is a simulated environment and the agent is implemented as one or more computer programs interacting with the simulated environment.
  • the simulated environment can be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent can be a simulated vehicle navigating through the motion simulation.
  • the system 100 receives motion data 114 characterizing the current motion of the agent in the environment at the time step and an observation 120 characterizing the current state of the environment at the time step.
  • the system 100 processes the motion data 114 and the observation 120 for the time step to select an action to be performed by the agent 112 at the time step.
  • the motion data 114 at each time step can include one or more of: speed data characterizing a speed of the agent at the time step, angular velocity data characterizing an angular velocity of the agent at the time step, or translational velocity data characterizing a translational velocity of the agent at the time step.
  • the speed data can be represented as one or more scalar values, e.g., representing the speed of the agent in meters per second, or any other appropriate unit or, e.g., as sin and cos of angular velocity in radians per second.
  • the angular velocity data can be represented, e.g., as a scalar value representing the rate at which the agent rotates about a vertical axis in radians per second, or any other appropriate unit.
  • the translational velocity data can be represented as a two-dimensional (2D) vector [u, v], e.g., where u and v are expressed in units of meters per second, or any other appropriate unit.
  • the observation 120 at each time step can be generated by or derived from sensors of the agent at the time step.
  • the observation at the time step can include data characterizing the visual appearance or geometry of the environment from the perspective of the agent at the time step, e.g., by one or more images (e.g., color images) captured by a camera sensor of the agent, one or more hyperspectral images captured by a hyperspectral sensor of the agent, or images in the form of geometric data (e.g., a 3D point cloud) captured by a laser sensor of the agent (e.g., a Lidar sensor), or a combination thereof.
  • the observation at each time step can be a simulated observation characterizing the visual appearance or geometry of the simulated environment from the perspective of the agent at the time step.
  • the action 110 performed by the agent at each time step can control the movement of the agent in the environment, e.g., by changing the translational velocity of the agent, the angular velocity of the agent, or both.
  • the actions can be represented, e.g., as control signals to control the agent.
  • Each action can represent, e.g., a respective torque that should be applied to a joint of the agent, an acceleration action to change the acceleration of the agent, or a steering action to change the heading of the agent.
  • the actions can be multi-dimensional actions, e.g., such that each action includes both a respective acceleration control signal and a respective steering control signal.
  • the system 100 can receive a reward 118 based on the current state of the environment 116 and the action 110 performed by the agent 112 at the time step.
  • the reward 118 can be represented a numerical value.
  • the reward 118 can indicate whether the agent 112 has accomplished a task in the environment, or the progress of the agent 112 towards accomplishing a task in the environment. For example, if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise.
  • the action selection system 100 includes one or more spatial embedding neural networks 102, a spatial embedding training system 200, and an action selection neural network 106, which are each described in more detail next.
  • Each spatial embedding neural network 102 is configured to, at each time step, process a subset, e.g. a so-called proper subset, of the agent motion data 114 for the time step to generate a respective spatial embedding 104 for the time step.
  • a spatial embedding 104 generated by a spatial embedding neural network 102 is an embedding that implicitly characterizes the position of the agent in the environment at the time step.
  • the single spatial embedding neural network 102 can process all of the agent motion data 114 at each time step.
  • each spatial embedding neural network can process a different subset of the agent motion data 114 at each time step.
  • one spatial embedding neural network can be configured to process data characterizing the angular velocity of the agent at each time step, and another, second spatial embedding neural network can be configured to process data characterizing the both the angular velocity and the translational velocity of the agent at each time step.
  • each spatial embedding neural network can have different properties and be complementary to one another. For example one, receiving e.g. angular velocity data, might generate spatial embeddings that depend substantially on the heading of the agent, while another, receiving e.g. angular and translational velocity data, might generate spatial embeddings that depend substantially on the distance of the agent to other objects in the environment at a particular heading.
  • a spatial embedding neural network that receives agent motion data comprising only angular velocity can generate spatial embeddings that encode the heading of the agent e.g. in which activation of neural network units generating the spatial embeddings have an “activation bump” that encodes the heading.
  • the spatial embeddings in particular such an activation bump, may encode the heading relative to a visual cue in the environment.
  • Each spatial embedding neural network 102 can have any appropriate neural network architecture that enables it to perform its described function, i.e., processing agent motion data 114 to generate corresponding spatial embeddings 104.
  • each spatial embedding neural network 102 can include any appropriate types of neural network layers (e.g., recurrent layers, attention layers, fully-connected layers, convolutional layers, etc.) in any appropriate numbers (e.g., 5 layers, 25 layers, or 125 layers), and connected in any appropriate configuration (e.g., as a linear sequence of layers).
  • each spatial embedding neural network 102 can be a recurrent neural network (e.g., a neural network with one or more recurrent neural network layers, e.g., long short-term memory (LSTM) layers, or any other appropriate recurrent neural network layers) that maintains a respective hidden state.
  • the hidden state of the spatial embedding neural network 102 at each time step can define the spatial embedding 104 generated by the spatial embedding neural network 102 at the time step.
  • Each spatial embedding neural network can update its hidden state at each time step by processing: (i) agent motion data 114 for the time step, and (ii) data generated by the spatial embedding neural network 102 at the previous time step.
  • FIG. 5 An example process for processing agent motion data 114 for a time step using a spatial embedding neural network 102 implemented as a recurrent neural network is described in more detail with reference to FIG. 5.
  • the spatial embedding neural network is implemented as a recurrent neural network with multiple recurrent neural network layers that each maintain a respective hidden state
  • the hidden state of the spatial embedding neural network can be understood as the concatenation of the respective hidden states of one or more of the recurrent neural network layers).
  • the action selection neural network 106 receives an input that includes: (i) the current spatial embeddings 104 generated by the spatial embedding neural networks 102 at the time step, and (ii) the current observation 120 characterizing the state of the environment at the time step.
  • the input received by the action selection neural network 106 can include additional data, e.g., the reward 118 received at the previous time step, a representation of the action 110 performed at the previous time step, or both.
  • the task being performed by the agent involves repeatedly navigating to a “goal” location in the environment, and the input received by the action selection neural network 106 can include “goal” spatial embeddings.
  • the goal spatial embeddings can be spatial embeddings that were generated by the spatial embedding neural networks 102 at a previous time step when the agent was located at the goal location in the environment.
  • the action selection neural network 106 processes its input to generate an action selection output 108, and the system 100 selects the action 110 to be performed by the agent 112 at the time step based on the action selection output 108.
  • the action selection output 108 can include a respective action score for each action in a set of possible actions, and the system 100 can select the action 110 to be performed by the agent at the time step using the action scores.
  • the system 100 can select the action having the highest action score as the action to be performed at the time step.
  • the system 100 can process the action scores (e.g., using a soft-max function) to determine a probability distribution over the set of possible actions, and then sample the action to be performed at the time step in accordance with the probability distribution.
  • the system 100 can select the action to be performed at each time step in accordance with an exploration policy, e.g., an e-greedy exploration policy.
  • an e -greedy exploration policy the system 100 selects an action randomly from the set of possible action with probability e, and the system 100 selects an action using the action selection output 108 for the time step with probability 1 — e (where e > 0 is a small positive value).
  • Selecting actions to be performed by the agent in accordance with an exploration policy can enable the agent to explore the environment rapidly and thereby generate a higher diversity of training data that can facilitate more effective training of the action selection neural network 106.
  • the action selection neural network 106 can have any appropriate neural network architecture that enables it to perform its described functions, e.g., processing spatial embeddings and observations to generate action selection outputs for use in selecting actions to be performed by the agent.
  • the action selection neural network architecture can include any appropriate types of neural network layers (e.g., recurrent layers, attention layers, fully-connected layers, convolutional layers, etc.) in any appropriate numbers (e.g., 5 layers, 25 layers, or 125 layers), and connected in any appropriate configuration (e.g., as a linear sequence of layers).
  • the spatial embedding training system 200 is configured to train the spatial embedding neural networks 102 to generate spatial embeddings that encode rich spatial information content and provide an efficient basis-set for representing spatial information.
  • the training system 200 trains the spatial embedding neural networks 102 to process data characterizing the motion of the agent to generate spatial embeddings 104 that are predictive of observations characterizing the state of the environment.
  • the training system 200 trains the spatial embedding neural networks 102 to generate spatial embeddings, based on the motion of the agent, that are predictive of the visual or geometric appearance of the environment from the perspective of the agent.
  • An example of a spatial embedding training system 200 for training the spatial embedding neural networks 102 is described in more detail with reference to FIG. 2.
  • Processing the spatial embeddings 104 generated by the spatial embedding neural networks 102 can enable the system 100 to select actions to efficiently solve complex navigation tasks, e.g., that involve navigating through unfamiliar and changing environments.
  • processing the spatial embeddings can enable the system 100 to select actions that cause the agent to navigate to a goal location using direct (or approximately direct) routes that can cover areas of the environment that were not previously visited by the agent.
  • the spatial embedding training system 200 can train the spatial embedding neural networks 102 based on trajectories representing agent interaction with one or more environments. Each trajectory can include, for each time step, data representing the agent motion at the time step and the observation of the state of the environment at the time step. After being trained based on trajectories representing agent interaction with one or more environments, the spatial embedding neural networks 102 can be used by the action selection system to control an agent interacting with a new environment without being retrained on trajectories representing agent interaction with the new environment. That is, the trained parameter values of the spatial embedding neural networks 102 can generalize to new environments without being retrained based on agent interaction with the new environment.
  • the system 100 trains the action selection neural network 106 using a reinforcement learning technique to select actions that increase a cumulative measure of rewards (e.g., a time- discounted sum of rewards) received by the system 100 as a result of the interaction of the agent with the environment. More specifically, the system 100 trains the action selection neural network 106 by iteratively adjusting the values of some or all of the parameters of the action selection neural network 106 using gradients of a reinforcement learning objective function.
  • the system 100 can train the action selection neural network 106 using any appropriate reinforcement learning techniques, e.g., actor-critic techniques or Q-learning techniques.
  • the system 100 can train the action selection neural network 106 independently of the spatial embedding neural networks 102, e.g., such that gradients of the reinforcement learning objective function are not backpropagated into the spatial embedding neural networks 102.
  • the system 100 can be used to control an agent interacting with either a simulated environment or a real-world environment as described above.
  • the system 100 can be used to control an agent interacting with a simulated environment, and the system 100 (in particular, the spatial embedding neural networks 102 and the action selection neural network 106) can be trained based on the agent interaction with the simulated environment.
  • the agent can then be deployed in a real-world environment, and the trained system 100 can be used to control the interaction of the agent with the real-world environment.
  • Training the system 100 based on interactions of the agent with a simulated environment can avoid wear-and-tear on the agent and can reduce the likelihood that, by performing poorly chosen actions, the agent can damage itself or aspects of its environment.
  • FIG. 2 is a block diagram of an example spatial embedding training system 200.
  • the spatial embedding training system 200 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.
  • the training system 200 trains one or more spatial embedding neural networks 102, e.g., that are included in an action selection system, as described with reference to FIG. 1.
  • Each spatial embedding neural network 102 is configured to process respective agent motion data 202 characterizing the motion of an agent in an environment at a time step to generate a spatial embedding 208 that implicitly characterizes the position of the agent in the environment at the time step.
  • the training system 200 includes an observation embedding neural network 206, an external memory 220, a scoring engine 212, and a training engine 218, which are each described in more detail next.
  • the observation embedding neural network 206 is configured to process an observation 204 characterizing the state of the environment to generate an embedding 210 of the observation 204.
  • the training system 200 can train the observation embedding neural network 206 to perform an auto-encoding task on a training set of environment observations before using the observation embedding neural network as part of training the spatial embedding neural networks 102.
  • the observation embedding neural network 206 processes an observation to generate a corresponding observation embedding that, when processed by a “decoder” neural network, enables reconstruction of the original observation.
  • training of the spatial embedding neural network(s) 102 uses a trained observation embedding neural network 206.
  • the observation embedding neural network can have any appropriate neural network architecture that enables it to perform its described function, e.g., processing observations to generate observation embeddings.
  • the observation embedding neural network architecture can include any appropriate types of neural network layers (e.g., attention layers, fully- connected layers, convolutional layers, etc.) in any appropriate numbers (e.g., 5 layers, 25 layers, or 125 layers), and connected in any appropriate configuration (e.g., as a linear sequence of layers).
  • the external memory 220 includes a set of logical data storage spaces, referred to as “slots” 222.
  • Each slot corresponds to a respective time step during the interaction of the agent with the environment, and stores: (i) an observation embedding 226 for the time step, and (ii) a respective spatial embedding 224 for the time step corresponding to each spatial embedding neural network 102
  • the training system 200 generates the respective observation embedding 226 stored in each slot of the external memory 220 by processing the observation for the corresponding time step using the observation embedding neural network 206.
  • the training system generates the respective spatial embeddings 224 stored in each slot of the external memory by processing agent motion data for the corresponding time step using the spatial embedding neural networks 102.
  • the training system 200 can modify the spatial embeddings stored in the slots of the external memory over the course of training, e.g., using gradients of an objective function, as will be described in more detail below.
  • the training system 200 receives: (i) agent motion data 202 characterizing the motion of the agent at the time step, and (ii) an observation 204 characterizing the state of the environment at the time step.
  • the training system 200 provides the agent motion data 202 for the time step to the spatial embedding neural networks 102, and each spatial embedding neural network 102 processes a respective input based on the agent motion data 202 to generate a respective current spatial embedding 208.
  • the training system 200 provides the observation 204 for the time step to the observation embedding neural network 206, and the observation embedding neural network 206 processes the observation 204 to generate a current observation embedding 210.
  • the scoring engine 212 generates: (i) a respective target score 216, and (ii) a respective predicted score 214, for each slot in the external memory 220 based on the current observation embedding 210 and the current spatial embeddings 208.
  • the target score 216 for each slot in the external memory 220 characterizes a similarity between: (i) the current observation embedding 210, and (ii) the observation embedding stored in the slot in the external memory 220.
  • the scoring engine 212 can generate the target score 216 for each slot based on a similarity measure (e.g., a cosine similarity measure, a Euclidean similarity measure, or any other appropriate similarity measure) between the current observation embedding 210 and the observation embedding stored in the slot in the external memory.
  • a similarity measure e.g., a cosine similarity measure, a Euclidean similarity measure, or any other appropriate similarity measure
  • the scoring engine 212 can generate the target score T s for each slot s as: where b is a positive scalar parameter, y is the transpose of the current observation embedding 210, m s is the observation embedding stored in slot s.
  • the parameter b is an inverse-temperature parameter which may be chosen for sparse selection of memory slots, so that there is low interference between memories.
  • the predicted score 214 for each slot in the external memory 220 characterizes a similarity between: (i) the current spatial embeddings 208, and (ii) the spatial embeddings stored in the slot in the external memory 220.
  • the scoring engine 212 can determine, for each spatial embedding neural network 102, a respective similarity measure between: (i) the current spatial embedding 208 generated by the spatial embedding neural network, and (ii) the spatial embedding corresponding to the spatial embedding neural network that is stored in the slot.
  • the similarity measures can be, e.g., cosine similarity measures, Euclidean similarity measures, or any other appropriate similarity measures.
  • the scoring engine 212 can then determine the predicted score 214 for the slot by aggregating the determined similarity measures between the current spatial embeddings 208 and the spatial embeddings stored in the slot in the external memory 220, e.g., by a product operation, a sum operation, or any other appropriate operation.
  • t is the transpose of the current spatial embedding generated by spatial embedding neural network r, m ⁇ is the spatial embedding corresponding to spatial embedding neural network r stored in slot s.
  • the training engine 218 receives the predicted scores 214 and the target scores 216 for the time step, and updates the parameter values of the spatial embedding neural network 102 to optimize an objective function that measures an error between the predicted scores 214 and the target scores 216.
  • the objective function can be any appropriate objective function that measures an error between the predicted scores 214 and the target scores, e.g., cross-entropy objective function L given by: where s indexes the slots of the external memory, S is the number of (occupied) slots in the external memory, T s is the target score for slot s, and P s is the predicted score for slot s.
  • the training engine 218 can determine gradients of the objective function with respect to the spatial embedding neural network parameters, e.g., using backpropagation. The training engine 218 can then use the gradients to update the spatial embedding neural network parameters using any appropriate gradient descent optimization technique, e.g., RMSprop or Adam.
  • any appropriate gradient descent optimization technique e.g., RMSprop or Adam.
  • the training engine 218 can also update a variety of other system parameters using gradients of the objective function.
  • a learning rate for the spatial embeddings 224 may be higher than for the spatial embedding neural network parameters, e.g. of order 10 -2 rather than lO -4 , as there is low interference between memories.
  • the spatial embedding neural network parameters may be frozen whilst storing, and optionally updating, new spatial embeddings 224 in the external memory 220, optionally also retraining the action selection neural network using reinforcement learning.
  • Training the spatial embedding neural networks to minimize an error between the predicted scores and the target scores encourages the spatial embedding neural networks to generate spatial embeddings that are predictive of observations of the environment. More specifically, the training encourages the spatial embedding neural networks to integrate agent motion data to generate embeddings that are predictive of the visual or geometric appearance of the environment from the perspective of the agent (i.e., as characterized by the observations). In implementations, therefore, the observation embeddings 210 stored in the external memory are not parameters updated by the training engine 218.
  • the training system 200 can train one or more additional neural networks, referred to as “integrated embedding neural networks,” that are each configured to process some or all of the agent motion data 202 at each time step to generate a corresponding embedding, referred to as an “integrated embedding” for the time step.
  • each integrated embedding neural network can process one or more additional inputs at each time step (i.e., in addition to the agent motion data for the time step), e.g., spatial embeddings generated by one or more of the spatial embedding neural networks at the time step.
  • the integrated embeddings generated by the integrated embedding neural networks can be provided as inputs to the action selection neural network (i.e., of the action selection system described with reference to FIG. 1).
  • the training system 200 can store, in each slot of the external memory 220 (or in the slots of an additional external memory), a respective integrated embedding for the time step corresponding to the slot.
  • the training system generates the respective integrated embedding stored in each slot of the external memory by processing agent motion data (and any other appropriate inputs) for the corresponding time step using the integrated embedding neural network.
  • the training system can modify the integrated spatial embeddings stored in the slots of the external memory over the course of training, e.g., using gradients of an objective function, as will be described in more detail below.
  • the training system determines an “integrated” predicted score and an “integrated” target score for each slot in the external memory.
  • the training system generates the integrated predicted score for the slot based on a similarity between: (i) the current integrated embedding generated by the integrated embedding neural network for the time step, and (ii) the integrated embedding corresponding to the slot in the external memory.
  • the training system can generate the integrated predicted scores using any appropriate similarity measure, e.g., a Euclidean similarity measure, a cosine similarity measure, or the similarity measure described with reference to equation (1).
  • the training system further generates an integrated target score for each slot that measures a similarity between: (i) one or current spatial embedding generated by the spatial embedding neural network for the time step, and (ii) one or more of the spatial embedding corresponding to the slot.
  • the training system can generate the integrated target scores using any appropriate similarity measure (e.g., a Euclidean similarity measure or a cosine similarity measure) to measure a similarity between: (i) the concatenation of one or more current spatial embedding for the time step, and (ii) the concatenation of one or more spatial embeddings corresponding to the slot.
  • any appropriate similarity measure e.g., a Euclidean similarity measure or a cosine similarity measure
  • the training system can update the parameter values of the integrated embedding neural networks, and can optionally update the integrated embeddings stored in the slots of the external memory, to optimize an objective function that measures an error between the integrated predicted scores and the integrated target scores.
  • the objective function can be, e.g., a cross-entropy objective function, e.g., as described with reference to equation (3).
  • the training system can update the parameter values of the integrated embedding neural networks, e.g., by backpropagating gradients of the objective function into the integrated neural networks.
  • FIG. 3 shows a data flow 300 that illustrates the operations performed by the spatial embedding training system 200 that is described in more detail with reference to FIG. 2.
  • the training system 200 processes the observation 204 for the time step using an observation embedding neural network 206 to generate a current observation embedding y t .
  • the training system 200 can then determine a respective target score 216 corresponding to each slot in the external memory based on a respective similarity measure between: (i) the current observation embedding y t , and (ii) each of the observation embeddings th , ..., thz stored in respective slots of the external memory, e.g., as described above with reference to equation (1).
  • the training system 200 processes respective agent motion data using each spatial embedding neural network 102-1 - 102-3 to generate respective current spatial embeddings x l t , X 2 ,L > 3 ⁇ 4 t ⁇
  • the training system 200 can then determine, for each spatial embedding neural network r 6 (1, 2, 3), a set of similarity measures (shown as 302-1 - 302-3) based on a respective similarity measure between: (i) the current spatial embedding x r t generated by spatial embedding neural network r, and (ii) each of the spatial embeddings corresponding to spatial embedding neural network r that are stored in respective slots of the external memory.
  • the training system 200 then aggregates the sets of similarity measures 302-1 - 302-3 (e.g., by a product operation) to determine a respective predicted score 214 corresponding to each slot in the external memory.
  • FIG. 4 is a flow diagram of an example process 400 for training one or more spatial embedding neural networks.
  • the process 400 will be described as being performed by a system of one or more computers located in one or more locations.
  • a training system e.g., the spatial embedding training system 200 of FIG. 2, appropriately programmed in accordance with this specification, can perform the process 400.
  • the steps of the process 400 are performed for each time step in a sequence of time steps over which the agent interacts with the environment.
  • the description of the process 400 which follows refers to a current time step in the sequence of time steps.
  • the system receives data characterizing the motion of the agent in the environment at the current time step and an observation characterizing the state of the environment at the current time step (402).
  • the agent motion data can include one or more of: speed data characterizing a speed of the agent at the current time step, angular velocity data characterizing an angular velocity of the agent at the current time step, or translational velocity data characterizing a translational velocity of the agent at the current time step.
  • the observation can include, e.g., an image captured by a camera of the agent that depicts the visual appearance of the environment from the perspective of the agent at the time step.
  • the system processes the observation using an observation embedding neural network to generate an embedding of the observation (404).
  • the observation embedding neural network can be, e.g., a convolutional neural network that is trained to perform an auto-encoding task, i.e., by generating observation embeddings that, when processed by a decoder neural network, enable reconstruction of the original observation.
  • the observation embedding neural network is a dimensionality-reducing neural network, i.e., such that the observation embedding has a lower dimensionality than the observation itself.
  • the system processes a respective subset of the agent motion data using each spatial embedding neural network to generate a respective spatial embedding using each spatial embedding neural network (406).
  • An example process for generating a spatial embedding using a spatial embedding neural network is described in more detail below with reference to FIG. 5.
  • the system determines a respective target score for each slot in an external memory based on the current observation embedding (408). Each slot in the external memory corresponds to a respective previous time step and stores: (i) an embedding (representation) of an observation characterizing the state of the environment at the previous time step, and (ii) a respective spatial embedding for the time step corresponding to each spatial embedding neural network.
  • the system determines the target score for each slot in the external memory based on a similarity measure between: (i) the current observation embedding, and (ii) the observation embedding stored at the slot in the external memory, e.g., as described above with reference to equation (1).
  • the system determines a respective predicted score for each slot in the external memory based on the current spatial embeddings (410). To generate the predicted score for a slot in the external memory, the system can determine, for each spatial embedding neural network, a respective similarity measure between: (i) the current spatial embedding generated by the spatial embedding neural network, and (ii) the spatial embedding corresponding to the spatial embedding neural network that is stored in the slot. The system can then determine the predicted score for the slot by aggregating the determined similarity measures between the current spatial embeddings and the spatial embeddings stored in the slot in the external memory, e.g., as described above with reference to equation (2).
  • the system updates the parameter values of each spatial embedding neural network parameters based on an error between the predicted scores and the target scores (412). For example, the system can determine gradients of an objective function (e.g., a cross-entropy objective function) that measures an error between the predicted scores and the target scores, and backpropagate gradients of the objective function into the spatial embedding neural network parameters.
  • an objective function e.g., a cross-entropy objective function
  • FIG. 5 is a flow diagram of an example process 500 for generating a spatial embedding for a time step using a spatial embedding neural network that has a recurrent neural network architecture.
  • the process 500 will be described as being performed by a system of one or more computers located in one or more locations.
  • a training system e.g., the spatial embedding training system 200 of FIG. 2, appropriately programmed in accordance with this specification, can perform the process 500.
  • the system receives a network input for the spatial embedding neural network that includes: (i) agent motion data characterizing the motion of the agent at the time step, and (ii) an output generated by the spatial embedding neural network at the previous time step (502).
  • the output generated by the spatial embedding neural network at the previous time step can be, e.g., the spatial embedding for the previous time step, or an alternative output generated based in part on the observation at the previous time step. Generating an alternative output of the spatial embedding neural network that is based in part on the observation at the time step is described in more detail with reference to steps 508-512.
  • the system processes the network input using the spatial embedding neural network to update a hidden state of the spatial embedding neural network (504).
  • the updated hidden state of the spatial embedding neural network defines the spatial embedding for the time step.
  • the system can provide the updated hidden state as the output of the spatial embedding neural network for the time step (506).
  • the system can generate an alternative output that is based in part on the observation at the time step, as will be described with reference to steps 508-512.
  • Generating an alternative output that is based in part on the observation for the time step can enable the spatial embedding neural network to correct for the accumulation of errors and to incorporate positional and directional information from the observation into the hidden state at the next time step.
  • the system determines a respective weight value for each slot in the external memory (508).
  • the system can determine the respective weight value for each slot based on a similarity measure between: (i) an embedding of the current observation, and (ii) the observation embedding stored in the slot.
  • the system can determine the weight value w s for slot s as: where g is a positive scalar parameter that determines the entropy of the distribution of weights, y[ is the transpose of the current observation embedding, is the observation embedding stored in slot s of the external memory, s' indexes the slots, S is the total number of slots.
  • g is one of the parameters optimized by the training engine 218.
  • the system determines a “correction” embedding based on the weight values for the slots in the external memory (510). For example, the system can determine the correction embedding x as: where s indexes the slots in the external memory, S is the number of slots, w s is the weight value for slot s, and m s is the spatial embedding corresponding to the spatial embedding neural network that is stored in slot s.
  • the system generates the output for the time step using the correction embedding (512).
  • the system can process: (i) the updated hidden state of the spatial embedding neural network, and (ii) the correction embedding, using one or more neural network layers (e.g., recurrent layers) of the spatial embedding neural network to generate the output for the time step.
  • the output for the time step can be an embedding having the same dimensionality as the hidden state of the spatial embedding neural network.
  • the output for the time step which depends on the observation for the time step, can be provided as an input to the spatial embedding neural network at the next time step and processed as part of updating the hidden state of the spatial embedding neural network at the next time step.
  • the spatial embedding neural network can correct errors in the spatial information represented by the hidden state as a result of integrating mainly motion information over a possibly lengthy sequence of time steps.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus.
  • the computer storage medium can be a machine-readable storage device, a machine- readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
  • engine is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.
  • an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
  • Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
  • a central processing unit will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • the central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
  • a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
  • Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute intensive parts of machine learning training or production, i.e., inference, workloads.
  • Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.
  • a machine learning framework e.g., a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.
  • Data generated at the user device e.g., a result of the user interaction, can be received at the server from the device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)
EP21726089.2A 2020-05-15 2021-05-12 Erzeugung räumlicher einbettungen durch integration von agentbewegung und optimierung eines prädiktiven objektivs Pending EP4107669A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063025477P 2020-05-15 2020-05-15
PCT/EP2021/062704 WO2021228985A1 (en) 2020-05-15 2021-05-12 Generating spatial embeddings by integrating agent motion and optimizing a predictive objective

Publications (1)

Publication Number Publication Date
EP4107669A1 true EP4107669A1 (de) 2022-12-28

Family

ID=75953853

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21726089.2A Pending EP4107669A1 (de) 2020-05-15 2021-05-12 Erzeugung räumlicher einbettungen durch integration von agentbewegung und optimierung eines prädiktiven objektivs

Country Status (4)

Country Link
US (1) US20230124261A1 (de)
EP (1) EP4107669A1 (de)
CN (1) CN115315708A (de)
WO (1) WO2021228985A1 (de)

Also Published As

Publication number Publication date
CN115315708A (zh) 2022-11-08
WO2021228985A1 (en) 2021-11-18
US20230124261A1 (en) 2023-04-20

Similar Documents

Publication Publication Date Title
CN110088774B (zh) 使用强化学习的环境导航
US11836596B2 (en) Neural networks with relational memory
US11727281B2 (en) Unsupervised control using learned rewards
US11662210B2 (en) Performing navigation tasks using grid codes
CN110546653B (zh) 使用管理者和工作者神经网络的用于强化学习的动作选择
US10860927B2 (en) Stacked convolutional long short-term memory for model-free reinforcement learning
EP3602409A1 (de) Auswahl von aktionen unter verwendung von multimodalen eingaben
US11714990B2 (en) Jointly learning exploratory and non-exploratory action selection policies
US20220326663A1 (en) Exploration using hyper-models
WO2023222887A1 (en) Intra-agent speech to facilitate task learning
CN115066686A (zh) 使用对规划嵌入的注意操作生成在环境中实现目标的隐式规划
US20230214649A1 (en) Training an action selection system using relative entropy q-learning
US20230124261A1 (en) Generating spatial embeddings by integrating agent motion and optimizing a predictive objective
US20240189994A1 (en) Real-world robot control using transformer neural networks

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220920

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230505

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20240318