CN120712567A - Controlling agents using language-based success detectors - Google Patents

Controlling agents using language-based success detectors

Info

Publication number
CN120712567A
CN120712567A CN202480011683.2A CN202480011683A CN120712567A CN 120712567 A CN120712567 A CN 120712567A CN 202480011683 A CN202480011683 A CN 202480011683A CN 120712567 A CN120712567 A CN 120712567A
Authority
CN
China
Prior art keywords
task
neural network
language model
environment
observation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202480011683.2A
Other languages
Chinese (zh)
Inventor
杜宇晴
科塞尼亚·科纽什科瓦
塞尔坎·卡比
若昂·费迪南多·戈梅斯德弗雷塔斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gdm Holdings LLC
Original Assignee
Gdm Holdings LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gdm Holdings LLC filed Critical Gdm Holdings LLC
Publication of CN120712567A publication Critical patent/CN120712567A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Systems and methods for controlling actions of an agent in an environment to perform tasks are useful for training or evaluating action selection systems used by such agents, and for obtaining training data for training such action selection systems. The task may be one of a number of different tasks that an agent may be trained or instructed to perform. The system and method use a multimodal language model that jointly processes language and data of another modality, such as visual data or acoustic data. The multimodal language model is used as a "success detector" to determine whether tasks performed by the agent have been performed.

Description

Controlling agents using language-based success detectors
Background
The present description relates to controlling agents using a neural network-based multimodal language model.
The machine learning model receives input and generates an output, such as a predicted output, based on the received input. Some machine learning models are parametric models and generate an output based on the received input and the values of the parameters of the model.
Neural networks are machine-learning models that employ one or more layers of nonlinear cells to predict output for a received input. In addition to the output layer, some neural networks include one or more hidden layers. The output of each hidden layer serves as an input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received inputs based on the current values of the respective set of parameters.
Disclosure of Invention
The specification describes systems and methods implemented as computer programs on one or more computers in one or more locations that can be used to control agents (agents) to act in an environment to perform tasks, and to train or evaluate action selection systems used by such agents, as well as to obtain training data for training such action selection systems. The action selection system may be, but need not be, based on a neural network. The task may be one of a number of different tasks that an agent may be trained or instructed to perform.
The system and method uses a multimodal language model, which is a model that jointly processes a natural or computer language and data of another modality, such as visual data or acoustic data. The multimodal language model is used as a "success detector", i.e., it is used to determine whether a task performed by an agent has been achieved, i.e., has been successfully completed.
In one aspect, a computer-implemented method and corresponding system for training or evaluating an action selection system for controlling an agent acting in an environment to perform a task using a language model neural network is described.
In another aspect, a computer-implemented method and corresponding system for selecting training data for training an action selection system for controlling an agent acting in an environment to perform a particular task using a language model neural network is described.
In another aspect, a computer-implemented method and corresponding system for instructing (e.g., controlling) an agent acting in an environment to perform a task using a language model neural network are described.
In another aspect, an agent configured to select actions to perform one or more tasks in an environment is described, the agent comprising a system as described herein. The agent includes one or more observation capture subsystems to capture observations of an environment, such as still or moving images of a real-world environment.
In a further aspect, a digital assistant device comprising a system as described herein is provided. The digital assistant device may include a user interface to enable a user to request assistance and output information.
The subject matter described in this specification can be implemented in specific embodiments to realize one or more of the following advantages.
Some implementations of the described system provide a robust and versatile method to determine whether a task has been achieved. For example, while successful detectors may be trained for a particular task, such models often encounter difficulties when environmental conditions change as observed from the training data or in the event of a task change. This may be caused, for example, by natural visual changes in the environment (e.g., due to illumination changes), or by camera position changes, or by differences between specific objects, or by disturbances in the environment. The described techniques are robust to such changes, and to changes in the language used to describe the task (e.g., "lifting a duckling" or "lifting a toy duck object"), and may handle ambiguities, such as where success is not precisely defined (e.g., "fast moving"). Implementations of the described system may be used to detect task success based on object state or agent behavior and across a wide range of tasks and environmental conditions. The robustness and versatility of the described technology facilitates training, evaluation, and use of agent action selection systems with generic action selection policies.
In implementations of the described system, the use of pre-trained (multimodal) language model neural networks facilitates versatility of the system, e.g., facilitates recognition of the same underlying task specified in different languages, or facilitates recognition under environmental changes such as natural vision changes. Thus, implementations of the system may facilitate operation of an agent in performing tasks in a real-world environment, potentially in unstructured, open, or evolving contexts (e.g., in the case of previously unseen objects or in new conditions).
The details of one or more embodiments of the subject matter of the specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
Drawings
FIG. 1 is a first example computer system that uses a language model neural network to train or evaluate an action selection system for controlling an agent to perform tasks.
FIG. 2 is a flow chart of a first example process for training or evaluating an action selection system for controlling an agent to perform a task using a language model neural network.
FIG. 3 is a flow chart of a second example process for training or evaluating an action selection system for controlling an agent to perform a task using a language model neural network.
Fig. 4A and 4B are second and third example computer systems for instructing an agent to perform a task.
FIG. 5 is a flow chart of a third example process for instructing an agent to perform a task.
FIG. 6 is an example of an agent configured to select actions to perform one or more tasks in an environment.
Fig. 7 is an example of a digital assistant device for instructing a human to perform a task.
FIG. 8 is an example multimodal language model neural network.
FIG. 9 illustrates generating an example training data item.
FIG. 10 is a flow chart of an example process for training a multimodal language model neural network.
FIG. 11 illustrates task success detection using a multimodal language model neural network.
Fig. 12A to 12C show evaluation performed on successful detection.
FIG. 13 illustrates generating training data items for successful detection of human task execution.
Like reference numbers and designations in the various drawings indicate like elements.
Detailed Description
FIG. 1 illustrates a first example computer system 100 that uses a language model neural network 120 to train or evaluate an action selection system 110 for controlling an agent 102 acting in an environment 104 to perform tasks. Computer system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.
The action selection system 110 is configured to process observations 106 characterizing the current state of the environment 104 to generate a policy output 112 for selecting an action 114 to be implemented by the agent 102 to perform a task. Observations 106, which may be referred to as action selection observations, may be processed at each time step in a series of time steps to generate a policy output 112 at each time step. The observations 106 may include observations of a physical, real-world environment of the agent, such as an image of the environment captured by a camera. The actions may include actions performed by the agent in a physical, real world environment. The agent in a physical, real world environment may be, but is not necessarily, a mechanical agent such as a robot.
There are many ways in which the policy output 112 may be used to select an action. For example, the policy output may define a probability distribution over a set of actions that the agent may perform, such as a Gaussian (Gaussian) distribution. The probability distribution may then be used to select an action, for example by randomly selecting an action from samples in the probability distribution, or by selecting an action with the highest probability. The policy output 112 may parameterize such a probability distribution, or it may define the probability distribution as a set of scores (e.g., a score for each action in a set of possible actions) from which an action may be selected. As another example, the policy output may define (directly) an action to be implemented by the agent to perform the task, for example by identifying a position, speed or torque of the mechanical action. In general, the actions may be continuous or discrete, alternatively, the continuous actions may be discretized. The actions may include a plurality of individual or original actions to be performed at time steps, such as a mixture of continuous and discrete actions. In some implementations, the policy output 112 may include multiple outputs, for example, from multiple heads on a neural network, for selecting multiple actions at a particular time step.
In general, any type of action selection system may be used for action selection system 110. In some implementations, but not in essence, the action selection system 110 includes an action selection neural network. The action selection system 110 may use the action selection neural network to process the observations 106 according to a learnable parameter (e.g., weight) of the action selection neural network to generate a policy output 112. Such an action-selecting neural network may have any suitable architecture, such as an attention-based neural network architecture (such as a transducer architecture), a convolution architecture, a fully-connected architecture, or any other suitable neural network architecture. The action selection system 110 may be pre-trained, or it may be trained using a language model neural network 120 (e.g., as described later).
Optionally, for example in a multitasking implementation, the action selection system 110 may also be configured to process a task input 132 that provides information defining a task to be performed by the action selection system 110, for example by describing the task or a goal of the task. The task input 132 may be provided in the form of a text sequence, for example in a natural or computer language. The text sequence may be encoded using a segmenter (tokenizer) such as SENTENCEPIECE to represent text as a series of text tokens in a vocabulary of text tokens, e.g., each text token representing a word, word segment, or character in a natural or computer language. The action selection system 110 (e.g., an action selection neural network) may process observations conditioned on the encoded representation of the text sequence to generate a policy output 112. For example, observations may also be encoded into a sequence of observation tokens, e.g., using an observation encoder, and the text tokens and the observation tokens may be processed by an action selection system 110 (e.g., implemented as a sequence processing action selection neural network) to generate a policy output 112. Some examples of this type of action selection system are described in PaLM-E (Driess et al arXiv: 2303.03378), RT-1 (Brohan et al arXiv: 2212.06817), and RT-2 (Brohan et al arXiv: 2307.15818).
The language model neural network 120 includes a multimodal model and is configured to process both to generate a language model output 124 i) a sequence of text tokens representing text (e.g., words in a natural or computer language), and ii) observation data including observations of an environment (which may be referred to as language model observations). The language model observations may be, but need not be, one of the action selection observations processed by the action selection system 110. As described later, in an implementation, the language model neural network 120 is used to process an input word string 122 that represents a question asking if a task or stage of a task has been implemented.
The multimodal language model neural network 120 is multimodal in that it processes a plurality of different types or modalities of input data, i.e., text and observation data. In some implementations, the multimodal language model neural network 120 can have multimodal inputs for the input lemma string 122 and multimodal inputs for the observation data. In some implementations, the multimodal language model neural network 120 internally processes the observation data by representing the observation data as tokens that are processed with the tokens of the input token string 122. Additionally or alternatively, the multimodal language model neural network 120 can use cross-attention to combine observation data and data from the input lemma string 122. In some implementations, the viewing data includes visual data representing still or moving image viewing. Additionally or alternatively, the observation data may include audio data representing values of the audio waveform, such as instantaneous amplitude data or time-frequency domain data, or data representing other types of observations, such as proprioceptive observations of the robotic agent 102.
In some implementations, the (multi-modal) language model neural network 120 includes a pre-trained language model neural network. That is, in an implementation, the language model neural network 120 is not trained while controlling the agent or while training or evaluating the action selection system 110 using the language model neural network 120, but it is trained in advance. In some implementations, the language model neural network 120 may be trained, e.g., fine-tuned, as described later.
The multimodal language model neural network 120 can have any suitable architecture, such as an attention-based neural network architecture (such as a transducer architecture), a convolution architecture, a fully connected architecture, or any other suitable neural network architecture.
In some implementations, the multimodal language generating neural network 120 includes a so-called sequence-to-sequence model that receives input strings of natural or computer language text tokens and observation data (which may also be encoded as tokens, i.e., observation tokens), and generates output strings that include one or more natural or computer language text tokens. The output string may be generated autoregressively, one token at a time.
One particular example implementation of the multimodal language generating neural network 120 is described later, but there are many different types of models that can be employed. For example, where the observation data includes visual data, the multimodal language model neural network 120 may include a Visual Language Model (VLM), such as Flamingo (ALAYRAC ET al arXiv: 2204.14198), paLI (Chen et al arXiv: 2209.06794), or PaLI-X (Chen et al arXiv: 2305.18565). Some implementations of the multimodal language model neural network 120 may be described as "large" multimodal models, e.g., they may have more thanThe application of the described techniques is not limited to such models, but rather, to a learnable (trained) parameter.
Generally, as used herein, a language may be a natural or computer language. For example, the multimodal language model neural network 120 may be configured (e.g., trained) to process formalized machine-readable languages, such as code in a computer language (such as a computer programming language or other formalized language). When the language is a computer language, a "word" may be an element of the language, such as a formalized command.
In an implementation, the input string 122 represents a question in a natural or computer language asking if a task or stage of a task has been implemented. That is, the input lemma string 122 may include text lemmas selected from a lemma vocabulary that represent words in a natural or computer language that posed a problem. The question may be an explicit question asking if the task is completed, such as "Did the robot successfully place the medium-sized gear on the shaft? successfully put the medium-sized gear on the shaft successfully put the medium-sized the gear being arranged on the shaft. The question may be an implicit question, such as stating The goal and asking if The statement is true or false, such as "The medium-sized gear is on The shaft.
In an implementation, the language model output 124 includes (or defines) an answer to a natural or computer language question, and the answer defines whether a task has been implemented. That is, in an implementation, language model output 124 provides a binary classification output for classifying the results of a task into one of two categories, success of the task or failure of the task. Thus, an implementation of the multimodal language model neural network 120 performs "successful detection".
In an implementation, the multimodal language model neural network 120 includes a language generating neural network. The language model output 124 may be configured to generate an output lemma string comprising one or more output lemmas (i.e., text lemmas selected from a lemma vocabulary) defining natural or computer language answers to questions. For example, the answer may comprise a single word or symbol in a natural or computer language defining the success or failure of the task, in particular, a binary answer such as "yes" or "no" or "Y" or "N".
In some implementations, the language model neural network processes the input lemma string 122 and the observation data from the environment to generate a score distribution (e.g., a probability distribution) that assigns a respective score (probability) to each text lemma in the lemma vocabulary. The language model neural network may then use the score distribution to select text tokens of the output token string from the token vocabulary. For example, the text word with the highest score may be selected, or the text word may be sampled from the distribution, e.g., using kernel sampling or other sampling.
The language model neural network need not explicitly generate natural or computer language output. For example, instead of generating an output string of lemmas, it may process the input string of lemmas to generate language model output 124 comprising vector or scalar output. In some implementations, where the language model output defines a distribution of text lemmas scores over a lemma vocabulary (e.g., a score for each text lemma comprising the vocabulary), the lemma scores of one, two, or more corresponding lemmas of the lemma vocabulary may be processed to determine an answer, which may be, for example, a scalar value. For example, the answer may be determined from the score of the term indicating success, or from the difference or ratio between the scores of two terms indicating success and failure of the task, respectively.
Some implementations of computer system 100 include a user interface 130 for a human user, but this is not required as the described techniques may be used without human input. The user interface 130 may provide an input mechanism for obtaining information from a user defining tasks to be performed by the action selection system 110. The information defining the task may include, for example, natural or computer language input from the user. Additionally or alternatively (in a multitasking implementation), information defining the tasks to be performed may be obtained in some other way (e.g. automatically, or by a user selecting a task from a plurality of available tasks, e.g. via a user interface).
The information defining the task may be used to generate natural or computer language questions that ask whether the task or stage of the task has been implemented. In the event that the input from the user has been in the form of a suitable natural or computer language question or statement (e.g., target definition), this may involve using the input from the user with little or no modification. In implementations (e.g., multi-tasking implementations), information defining the tasks may also be provided to the action selection system 110.
By way of example only, the user interface 130 may include, for example, a mobile device, a keyboard (and optionally a display), and/or a voice-based input mechanism, for example, for inputting audio data representing a voice waveform representing voice in a natural or computer language input from a user, and converting the audio data into tokens representing voice in a natural or computer language, i.e., tokens representing transcription of spoken input. Additionally or alternatively, the user interface 130 may include a camera, for example, to enable a user to demonstrate a target state or trajectory, and/or to observe a user performing a task.
In some implementations, the computer system 100 includes a training engine 140 to train the action selection system 110, as described later. In some implementations, the training engine 140 may access training data items (not shown in fig. 1), for example, in order to train the action selection system 110 using simulated learning.
FIG. 2 is a flow chart of a first example process for training or evaluating an action selection system for controlling an agent acting in an environment to perform a task using a language model neural network. The process of FIG. 2 may be performed by a system of one or more computers located in one or more locations, such as by computer system 100 of FIG. 1.
The process involves obtaining information defining a task (step 200). The information defining the task may define the task as discrete or continuous behavior, such as "Bring me the banana THAT IS IN THE PANTRY (take bananas in a food storage room me)" or "Move around the room quickly (move quickly in a room)", or as a desired target state defining when the task has been achieved, or as a desired movement trajectory. The information defining the task may include text, e.g., code, in a natural or computer language, and/or some other form of information, such as a task identifier or an image defining a desired target state of the task (e.g., a particular item placed on a table), or an image defining a desired trajectory of the task (e.g., a sketch or example of a desired trajectory of the robotic end effector).
Optionally, the process may also involve generating a natural or computer language question therefrom asking if the task has been fulfilled (i.e., completed) (step 202). As used herein, "problem" is to be understood broadly and as previously described, a problem may be explicit or implicit. The information defining the task may have implicitly defined the problem and generating the problem may include using the information defining the task without modification. Alternatively, question marks or "questions (questions:)" may be added to natural or computer language sentences. In some other implementations, the questions may be generated from the natural language definitions of the tasks by processing the natural language definitions of the tasks according to templates or using a language model (which may be a multimodal language generating neural network 120) with appropriate prompts that require the language model to convert the task definitions into natural or computer language questions that ask whether the tasks have been implemented.
The process may then generate an input string 122 of lexemes representing text derived from the information defining the task (e.g., representing a natural or computer language question for the language model neural network) (step 204).
The process may also obtain observation data from observations characterizing the current state of the environment 104 (step 206). The observation data indicates that the agent is acting in the environment to perform the task while the agent is using the action selection system to attempt to perform the task, for example, at one or more action selection time steps. In an implementation, the observation data is derived from the observations 106. For example, the observation data may represent observations of a physical real-world environment of the agent, such as an image of the environment 106 captured by a camera.
The input word string 122 and observation data from the observation of the environment are processed using the language model neural network 120 to generate a language model output 124 (step 208). In an implementation, the language model output 124 includes answers, particularly answers to natural or computer language questions. The answer defines whether the task has been achieved, i.e. completed. The answer may include text, such as one or more words in a natural or computer language, or a score or figure of merit (figure-of-merit) that characterizes the result of the task (e.g., that indicates the extent to which the task was achieved or the completion, or the likelihood that the task was successfully completed). Such scores may then be processed to obtain a binary classification of whether the task was fulfilled.
The process may then involve using the action selection system 110 based on the answer (step 210). As one example, the answers may be used to train the action selection system 110 to perform tasks. As another example, the answer may be used to evaluate whether to perform a task using the action selection system 110, e.g., the same task as the task that the action selection system was observed to be performing or another similar task.
Then, after the training or depending on the outcome of the evaluation, an implementation of the process may use the action selection system 110 to control the agent 102 to perform tasks (step 212).
As one example, multiple (trained) action selection systems 110 may each be evaluated using the procedure described above, and the results of the evaluation used to select a particular action selection system for performing a task, i.e., based on answers to natural or computer language questions asking if the task has been implemented. For example, an action selection system may be selected that has an answer indicating "yes" or that has a score above a threshold.
As another example, the above-described process may also involve training the action selection system 110, for example using reinforcement learning techniques based on rewards determined from the answers, or using mimicking learning techniques such as behavioral cloning. Any reinforcement learning technique or mimicking learning technique may be used. For example, training may be performed online using online reinforcement learning techniques such as maximum a posteriori policy optimization. Additionally or alternatively, training may be performed offline using previously stored data, for example using offline reinforcement learning techniques such as criticizing regularized regression (critic regularized regression) (Wang et al 2020). Additionally or alternatively, training may be performed offline using imitation learning techniques such as behavioral cloning, reverse reinforcement learning, or generative antagonism imitation learning (GENERATIVE ADVERSARIAL Imitation Learning) (GAIL. ArXiv:1606.03476, ho et al). Training using simulated learning may involve filtering previously stored data to remove samples of the execution task that were judged unsuccessful by the multimodal language model neural network 120. That is, a corpus of training data may be obtained that includes training data items, each of which includes one or more samples of an agent (robot or human) performing one or more tasks, and such agents may be referred to as presentation agents. The training data may then be filtered to retain only training data items that include one or more successful samples of the one or more of the performance tasks performed by the presentation agent, and then the action selection system 110 may be trained on the filtered training data, for example, using simulated learning.
Generally, training the action selection system 110 involves iteratively adjusting a learnable parameter value of the action selection system 110, e.g., a learnable parameter value such as a weight of an action selection neural network of the action selection system 110. This may involve iteratively back-propagating the reinforcement learning or mimicking the gradient of the learning objective function through the action selection system 110 to update the learnable parameters. Such training may use any suitable gradient descent optimization algorithm, such as Adam or another optimization algorithm. Any suitable reinforcement learning or mimicking learning objective function may be used, for example, an objective function that depends on the square of the bellman error (for reinforcement learning), or an objective function that depends on the difference between the action profile defined by the policy output 112 and the action profile defined by the action of the presentation agent(s) in the set of training data items (for mimicking learning).
FIG. 3 is a flow chart of a second example process for using a language model neural network to select training data for training an action selection system for controlling an agent acting in an environment to perform a particular task. The process of fig. 3 may be performed by a system of one or more computers located in one or more locations, such as by computer system 100 of fig. 1.
The process involves obtaining a set of training data items (step 300). Each training data item includes observations, such as visual observations, of the entity's actions in the environment while attempting to perform a task. The entity may be a person or another agent, for example an agent controlled by another computer system, and the set of training data items may include both. Generally, but not essentially, as the system may have versatility, the set of training data items will include at least some data items in which the environment is similar to the environment in which the trained agent will act, and at least some data items in which the task matches a particular task that the action selection system will be trained to perform. In an implementation, the observation of the entity includes a still or moving image of the entity. The observations (e.g., visual observations) may be simulated observations, and the language model neural network may be trained, partially or entirely, using the simulated data, for example, prior to processing the real world data using the system (particularly the language model neural network 120).
The process may also obtain information defining the particular task (step 302), and optionally generate a natural or computer language question from the information asking whether the particular task has been achieved (step 304). In some implementations, for example, where different training data items include observations of different tasks, information defining a particular task and corresponding questions may be obtained for each training data item.
For each of the training data items, the process generates an input lemma string 122 representing text derived from information defining the task (e.g., natural or computer language questions of a language model neural network) (step 306). The input lemma string 122 and observation data from the observations in the training data items are processed using a language model neural network to generate a language model output 124 (step 308). The observations in the training data items may be, for example, still or moving images (such as video clips). As described above, language model output 124 includes, for example, answers to natural or computer language questions defining whether a particular task has been achieved in the training data item. The process identifies for the training data item whether a particular task has been achieved in the training data item (step 310). This provides an identified training data item, i.e. a training data item with an identification indicating whether a particular task is implemented in the training data item. The training data item may be associated with a plurality of different tasks, in which case the identification may indicate whether a task represented by a particular observation in the training data item is achieved.
The process may train the action selection system 110 to perform a particular task using the identified training data items (e.g., by selectively using the identified training data items) (step 312). For example, each of the training data items may be labeled with an answer, e.g., depending on whether a particular task is being accomplished, and the labeled training data items may be used to train the action selection system 100. As another example, training data items may be filtered to retain only a subset of those training data items in which a particular task is fulfilled or fulfilled with a score greater than a threshold. This subset of training data items may then be used to train action selection system 110. In some implementations, the action selection system is trained using simulated learning (e.g., behavioral cloning) based on the labeled or filtered training data items. In some implementations, the action selection system is trained while being used to perform tasks, such as through reinforcement learning.
The process may also involve using the trained action selection system to control the agents acting in the environment to perform specific tasks (step 314).
FIG. 4A illustrates a second example computer system 400 for using the language model neural network 120 to instruct (e.g., control) an agent 102 acting in the environment 104 to perform a task. Computer system 400 is an example of a system implemented as a computer program on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.
In the arrangement of fig. 4A, the action selection system 110 receives and uses the language model output 124 from the language model neural network 120, e.g., as described with reference to fig. 5.
FIG. 5 is a flow diagram of a third example process for instructing (e.g., controlling) an agent acting in an environment to perform a task. The process of fig. 5 may be performed by a system of one or more computers located in one or more locations, such as by computer system 400 of fig. 4A.
The process involves obtaining information defining a task (step 500), and optionally generating from the information a natural or computer language question asking if the task has been implemented (step 502). An input string 122 of lexemes representing text (e.g., natural or computer language questions) derived from the information defining the task is provided to the language model neural network 120 (step 504).
The process also obtains observation data from the environment 104 (step 506). The observation data represents actions taken by the agents 102 in the environment 104 to perform tasks, i.e., including observations of those actions. The input word string 122 and the observed data from the environment are processed using the language model neural network 120 to generate the language model output 124 (step 508). The language model output 124 includes answers to natural or computer language questions that define whether a task has been implemented.
The process involves indicating an agent using the answer (step 510), for example, by using the answer control action selection system 110.
As one example, the process may involve controlling the agent to stop taking action to perform the task when the answer indicates that the task has been achieved, e.g., to terminate the task prematurely. Additionally or alternatively, the agent may be controlled to continue taking actions to perform the task when the answer indicates that the task has not been achieved. The control agent may be involved in controlling the action selection system 110.
As another example, the agent 102, and more particularly, the action selection system 110, may be configured to perform a series of tasks. The process may then involve controlling the agent to stop taking action to perform one task and begin taking action to perform the next task in the series when the answer indicates that the task has been achieved. Thus, each task in the series of tasks may be a step or subtask of a larger overall task.
In some implementations, using the answer to instruct the agent includes using the answer to control an action taken by the agent to perform the task. For example, a control agent (e.g., a mechanical agent) may be involved in controlling the motion selection system 110. Instructing the agent may include providing control signals to the agent to control actions taken by the agent.
Fig. 4B illustrates a third example computer system 450 for instructing an agent to perform a task, wherein the agent 102 is a human. Computer system 450 is another example of a system implemented as a computer program on one or more computers in one or more locations in which the process of FIG. 5 may be implemented.
In fig. 4B, using the answer to indicate the agent 102 includes using the answer to indicate a human. The digital assistant device 700 (also sometimes referred to as a virtual assistant) observes the human agent 102 and provides instructions based on the observation that enable a human to perform tasks. The digital assistant device 700 provides observations or observation data derived from observations and questions 702 to the multimodal language model neural network 120, as previously described, and receives answers 704 from the multimodal language model neural network 120 indicating the human 102.
Thus, using the answer to instruct the human agent 102 may include instructing a human user of the digital assistant device 700. The digital assistant device 700 may be a smart speaker, or a smart display, or a mobile device, or generally any computing device suitable for communicating with a human user to implement the process of fig. 5. The digital assistant device 700 captures observations of actions taken by the human 102 in the environment 104 to perform tasks, for example, using a camera. In addition to or instead of using the answer to indicate the agent, the answer may be used to control the digital assistant device, e.g., to stop receiving and/or processing observations, to protect privacy or to reduce power usage.
In general, in the systems and methods described above, the observations of the environment may include any type of observations, including, but not limited to, visual observations, audio observations, and proprioceptive observations. As an example, in some implementations, the observations may include captured still images or captured moving images (video clips) or audio clips, and the observation data may include visual data, including images or image sequences, for example, or audio data. Thus, the observation data may comprise, for example, pixel values of pixels of a still or moving image, audio signal values in the time or frequency domain, or (for proprioceptive observation) sensor signal values.
In some implementations, observations may be captured by a camera or lidar sensor, i.e., "image" and "visual data" include a lidar point cloud, or by a microphone or other audio transducer, or by a position, force, torque, acceleration, or other sensor. In an implementation, image or audio or proprioceptive information is captured from the real world, e.g. by a camera or microphone or sensor. The observations used by the action selection system may be from the same or different sensors as those used by the language model neural network, and these observations may be captured at the same or different times.
Answers (to natural or computer language questions) defining whether a task has been fulfilled may be used in a variety of ways. For example, it may be used directly or indirectly to train or evaluate an action selection system, identify training data items, or control an agent.
In some implementations, a system or method as described herein may involve obtaining observation data, such as video or audio clips, from the environment at a series of stages of execution of a task, such as at the beginning and end of a task, or at multiple stages of an entire task (e.g., as defined by a sliding time window). Then, for each of the stages, the method may involve processing an input word string (which may represent a question asking whether the stage has been implemented) and observation data from the environment for the stage using a language model neural network to generate a language model output that includes an answer for the stage. The method may determine from language model outputs of two or more phases whether a (overall) task has been achieved. The question may ask whether the (overall) task has been completed or whether one of the stages of the task has been completed. The method may then control the agent, or train or evaluate the action selection system, or select the training data item, for example, in response to a determination of whether the task or stage of the task has been achieved.
As one example, determining whether a task has been implemented may include determining a difference between answers of two or more of the stages (one toward the beginning of the task and one toward the end of the task) to determine whether the task has been implemented, e.g., to reduce the computational load of processing each observation. This may be appropriate where the success of a task is determined by the state of the environment (which may be defined by the state of one or more objects in the environment).
As another example, determining whether a task has been implemented may include determining whether an answer indicates that the task has been implemented for any of the stages. This is appropriate where the success of the task is determined by the ongoing behavior of the agent (e.g., moving around a room and sweeping the room). In the case of successful detection specified by ongoing behavior, successful detection may be useful if the observation data includes moving image (i.e., video) data.
Fig. 6 illustrates an example of an agent 600 configured to select actions to perform one or more tasks in an environment. In fig. 6, an agent 102 as described above includes a system implemented as a computer program on one or more computers in one or more locations, in which the process described above may be implemented.
The agent 600 includes one or more observation capture subsystems 610 to capture observations of the environment, particularly for the action selection system 110 and the language model neural network 120. The environment 104 may be a real-world environment and the observation capture subsystem 610 may, for example, capture still or moving images of the real-world environment 104.
The agent also has a natural or computer language interface 620 to receive a natural or computer language description 622 of the task to be performed, i.e., including information defining the task.
Agent 600 includes an action selection system 110 and may include or have a language model interface 630 to a language model neural network 120 (which may be local to the agent or remote to the agent, e.g., on a remote server). The action selection system 110 is configured to process the first captured observation of the environment to generate an action selection policy output 112 for selecting an action 114 at a time step to control the agent to perform a task. Agent 600 may include an agent control system (not shown) to interface with the agent and control the agent to perform tasks according to the selected actions. For example, where the agent 600 comprises a robot, the agent 600 may comprise a robot control system.
The agent may interface with language model neural network 120 using language model interface 630. The language model neural network 120 may process the input word string, particularly the previously described input word string 122 that represents a question asking if a task has been implemented, and a second captured observation of the environment, to obtain a language model output 124 that includes an answer to the question. The observation of the second capture of the environment may be, but need not be, the same as the observation of the first capture of the environment. The input word string 122 representing the question may be generated by the agent 600, for example using the language model interface 630, or it may be generated by the language model neural network 120. That is, in some implementations, language model neural network 120 may receive text input from agent 600 and generate text output for agent 600.
The action selection system 110 may be configured to be controlled using answers from the language model neural network 120 (e.g., answers obtained via the language model interface 630 (if present)).
As previously described, in some implementations, the agent 102 includes a human user of a digital assistant device such as a smart speaker, smart display, or other device. Information defining the task may then be obtained from the digital assistant device, and the digital assistant device may be used to instruct the user based on the answer. For example, this may include receiving a request for assistance from a user at a digital assistant device, and determining a series of tasks, such as steps or sub-tasks, for the user to perform in response to the request. Then, for one or more tasks in the series of tasks, in an implementation of each task, for example, until the last task in the series, the digital assistant device may be used to output an indication of the task (e.g., step or sub-task) to be performed to the user. This may be done using natural language, for example, on a display and/or using a voice synthesis subsystem of a digital assistant device. Visual (e.g., video) and/or audio observations of a user performing a task may be captured, for example, using a digital assistant device. The system as described above may then be used to determine from the answer whether the user has successfully achieved a task or task step or subtask. If there are further tasks to complete, in response, the digital assistant device may proceed to the next task (if any) in the series of tasks, for example, by outputting an indication of the next task to execute. In this way, the user may be gradually directed to complete a series of tasks to perform the overall task.
Fig. 7 illustrates an example of a digital assistant device 700 that includes a system as described above, which may be implemented as a computer program on one or more computers in one or more locations.
The digital assistant device 700 may include a user interface 710 to enable a user to request assistance and output information. In an implementation, this is a natural language user interface and may include a keyboard, a voice input output subsystem, and/or a display. The user interface 710 may receive a request for assistance with a task (e.g., a general task as described above) and may output instructions to the user to perform the action 114 to complete each of a series of subtasks of the general task.
The digital assistant device may include a local or remote auxiliary subsystem 730 configured to determine a series of tasks to be performed by a user in response to a request. In an implementation, this may include locally or remotely generating a (large) language model, particularly configured for conversations, such as conversation agents such as LaMDA (arXiv: 2201.08239 of Thoppilan et al). The language model may be prompted appropriately to generate an output identifying a series of tasks (e.g., "How you will. The generative (large) language model may be the language model neural network 120, or it may be another model.
The digital assistant device 700 may include an observation capture subsystem 750 and a language model interface 740 for the language model neural network 120, which may also be implemented locally or remotely. The observation capture subsystem 750 may be similar to the observation capture subsystem 610 described above and may capture visual and/or audio observations of tasks performed on a human user and provide corresponding observation data to the language model interface 740.
The digital assistant device 700 may include an auxiliary control subsystem 720 configured to assist a user. Secondary control subsystem 720 may receive indications of a series of tasks that the user should perform, either serially or in parallel, from secondary subsystem 730. The auxiliary control subsystem may be configured to implement the process described above for one or more tasks in a series of tasks, for example (e.g., until the last task in the series). More specifically, the auxiliary control subsystem 720 may output an indication of the next task to be performed to the user, for example, via the user interface 710.
The auxiliary control subsystem 720 may also use the observation capture subsystem 750 to capture visual and/or audio observations of tasks performed by the user. It may use language model interface 740 to provide the question to language model neural network 120 asking if a task has been achieved (based on the task and observations), and may determine from the answer if the user has successfully achieved the task. In response to determining that the user has successfully achieved the task, the digital assistant device 700 (and in particular the auxiliary control subsystem 720) may proceed to the next task in the series of tasks and/or control the digital assistant device 700, for example, to stop capturing observations.
As an illustrative example, a user may interface with the digital assistant device 700 and request assistance in performing an overall task with multiple steps, such as cooking pasta. As the user performs a task, the digital assistant device receives audio and/or video input representing the progress of the user on the task, such as an image or video or sound clip that the user is cooking. The captured audio and/or video is provided to the language model neural network 120 along with a question asking the user if a particular step has been completed (e.g.,' Has the user finished chopping THE PEPPERS. If the answer confirms that the user has completed the step successfully, the digital assistant device 700 proceeds to tell the user to perform the next step, or if at the end of the task, or if the overall task is a single step task, the digital assistant device may indicate this to the user. The digital assistant device may then cease receiving or processing audio and/or video input to ensure privacy and/or reduce power usage.
The various systems and processes described above generally involve providing an input word string representing a natural or computer language question to the language model neural network 120. Input word strings may be obtained using a word segmentation system as previously described to address natural or computer language issues, many of which are known. In general, an input lemma string may comprise a sequence of text lemmas representing words in a natural or computer language, and lemmas may be selected from a lemma vocabulary. The sequence of text tokens specifies to the language model neural network what needs to be derived from the language model output, i.e., the text tokens represent natural or computer language questions asking whether a particular task has been achieved. Typically, a text lemma defines a word or word segment, such as a word segment or morpheme, but it may also define letters or words. A lemma may comprise a lemma representing a punctuation, character, number, or symbol, particularly in the context of a computer language.
In some implementations, the input lemma string may additionally include text lemmas representing words and/or characters intended to direct the language model output. For example, natural or computer language questions may include a prompt language or label, e.g., preceded by "Q (Question:)" or "Question:)" or appended ". The prompt language or label may also be converted into text tokens of the input token string.
In some implementations, successful execution of a task may be defined by target observations (such as target images or target sounds) or by a target trajectory defining target movements. For example, the target image may be used to show an environment in a particular state, such as an object in the environment, or the target image may show a desired trajectory, or the target video may show a target behavior or trajectory. Similarly, the target sound may indicate when the environment (e.g., an object in the environment) has achieved a particular state, such as a click when the male ethernet connector is fully plugged into its socket. In such implementations, obtaining information defining the task may include obtaining target observation data defining one or more target states of the environment for successfully executing the task.
Alternatively, the target observation may be referenced in the input string of lemmas, for example by including a tag that indicates to the language model neural network that the target observation is part of a natural or computer language question. Such a tag may be included, for example, at the beginning of a natural or computer language problem. As an example, a "< image >" tag may be used to reference a target image, followed by a question such as "DID THE AGENT successfully ACHIEVE THIS goal state (is the agent successfully achieved this target state. The tokens of the tag may be included in a token vocabulary.
In this case, the input lemma string including the tag referencing the target observation may be provided as a prompt to the multimodal language model neural network before the language model neural network processes the observation data from the environment. That is, the multimodal language model neural network may process the input word string and the target observation data from the environment, and then process the observation data from the environment, prompting the multimodal language model neural network 120 in a substantially similar manner to prompting a large language model.
In some implementations, the language model neural network is configured to jointly model the input lemma string 122 and the observation data to determine the language model output 124.
As previously described, the multimodal language model neural network 120 may include a pre-trained Visual Language Model (VLM), which has many examples. By way of illustration only, FIG. 8 shows details of one manner in which the multimodal language model neural network 120 may be implemented.
More specifically, FIG. 8 illustrates an example multimodal language model neural network 800 that can be implemented as a computer program on one or more computers in one or more locations. The multimodal language model neural network 800 is suitable for use as the multimodal language model neural network 120.
In fig. 8, the input word string 122 and observation data from the environment are processed using a language model neural network 800 to generate a language model output 124. The observation 106 (more specifically, the observation data) is processed using an observation encoder neural network subsystem 810 to determine a compressed representation of the observation data (e.g., a set of observation terms 812). The compressed representation and data derived from the input lemma string 122 may be processed by applying a cross-attention mechanism between the compressed representation and the data derived from the input lemma string to generate the language model output 124.
The cross-attention mechanism may be implemented by a cross-attention neural network layer 820. There are many possible cross-attention mechanisms. One such cross-attention mechanism is a query-key-value (QKV) attention operation, in which a query is derived using text tokens that are focused on observation data to obtain keys and values from the observation data. For example, a set of key-value vector pairs may be obtained from the compressed representation, and one or more query vectors may be derived from the input lemma string 122. The output of the cross-attention mechanism may be determined by determining a weighted sum of the values weighted by the similarity function of the query to each respective key.
In some implementations, the language model neural network includes a stack of processing layers including a plurality of word processing neural network layers 830 and a plurality of cross-attention neural network layers 820. Each cross-attention neural network layer 820 may be arranged to receive a compressed representation (e.g., may be conditioned on a compressed representation), and may implement a cross-attention mechanism as described above. The word processing neural network layer 830 may be interleaved with the cross-attention layer. For example, the layer output of one of the cross-attention neural network layers 820 may provide a layer input to one of the word processing neural network layers 830. Similarly, the layer output of one of the word processing neural network layers 830 may provide a layer input for one of the cross-attention neural network layers 820, and so on.
In some implementations, some or all of the cross-attention neural network layers 820 are gated cross-attention layers. In a gated cross-attention tier, the output of the tier is combined with the input of the tier in a ratio dependent on the gating parameters, e.g., a larger gating parameter indicates a larger ratio of tier outputs. This allows the gating parameters to gradually increase during training, for example, as the word processing neural network layer and the cross-attention neural network layer learn to cooperate with each other.
Processing the compressed representation of the observation data and the data derived from the input lemma string 122 may include processing the data derived from the input lemma string using the stack of processing neural network layers 820, 830, focusing on the compressed representation using each of the cross-attention layers 820.
In some implementations, the word element processing neural network layer 830 may include a self-attention neural network layer, and optionally, a feed forward neural network layer. The self-attention neural network layer has an attention layer input and is configured to apply an attention mechanism to the attention layer input to generate an attention layer output. Many different attention mechanisms may be used, such as the QKV mechanism described above. In the self-attention mechanism, queries, keys and value vectors are all derived from (same) attention layer inputs. The self-focusing neural network layer and/or the feedforward neural network layer may have residual connections.
In some implementations, the view encoder neural network subsystem 810 includes a view encoder 814 (e.g., a neural network such as ResNet) and one or more view encoder cross-attention neural network layers 816.
The compressed representation may include a set of observation tokens 812, which may include a fixed number of tokens independent of the observation data, e.g., each image or video clip includes a fixed number of tokens. The observation phrase 812 may be determined by processing the observation data using the observation encoder 814 to generate encoded observation data that includes a set of observation features (e.g., visual features). For example, each image of an image or video clip may be encoded in this manner. The set of observation tokens 812 may then be determined from the encoded observation data by processing the set of (learned) potential vectors 818 using one or more observation encoders configured to cross-focus observation features to generate the set of observation tokens, e.g., as described in :Jaegle et al., "Perceiver IO: A General Architecture for Structured Inputs and Outputs", arXiv 2107.14795. in the following document, that is, after training, the set of potential vectors includes a set of predetermined potential vectors. The method may be used, for example, to generate a set of visual tokens (of fixed size) from an image or video clip.
The language model neural network 800 may include a word embedding subsystem that applies an embedding function to each word of the input word string to convert it into a corresponding word embedding, i.e., a vector of values, before processing the input word string using the stack of processing layers. Such embedding functions may be fixed or learned. In an implementation, position encoding may be applied (e.g., added or connected to) to the lemma embedding to indicate the position of each respective lemma in the input lemma string 122.
As previously described, a pre-trained model (e.g., VLM) may be used for the language model neural network 120 of fig. 1, as such a pre-trained model may be capable of performing the type of question-answer tasks, e.g., visual question-answer tasks, herein for task success detection. Nevertheless, in some implementations, such pre-trained models may be "fine-tuned" (i.e., further trained) using additional training data items. These additional training data items may be specific to task success detection. Such additional training data items may include observations of the task, natural or computer language questions asking if the task was implemented, and answers appropriate for the observations.
Additionally or alternatively, if further adaptation of the pre-trained multimodal language model neural network is beneficial, the multimodal language model neural network 120 may be prompted with one or more examples of successful detection of tasks (e.g., including images, question text, and example answer text).
In an implementation, neither trimming nor prompting of the multimodal language model neural network 120 is necessary.
In the case of training the multimodal language model neural network 120 or the language model neural network 800 from scratch, a two-stage training process may be used. In a first pre-training stage, the multimodal language model neural network may train on a large but noisy corpus of data. In this corpus, the training data items may include images or other observations and associated text, but need not be specific to a successful detection task. Such training data items may be obtained, for example, from the web, as is an example of which the ALIGN data set (Jia et al arXiv: 2102.05918). In this context, "large" may mean having more than one billion training data items, such as image text pairs. In the second fine-tuning stage, the multimodal language model neural network may train on a smaller corpus of additional training data items that are relatively fewer in number than the training data items used in the first stage, but specific to successful detection.
In the case of fine tuning, the (additional) training data item may be generated by manually annotating the observation of the task to indicate whether the task has been completed. For observations that extend in time, such as video or audio clips, manual annotations may also indicate whether a task has been completed when a success occurs. For example, for video or audio segment observations, this may be accomplished by splitting the segment into non-overlapping sub-sequences before and after the success point. Two (additional) training data items may then be generated, each comprising a question as to whether the task has been completed successfully, and each comprising an appropriate true value answer, e.g. yes if the sub-sequence ends in one or more successful frames, no otherwise. For multiple human raters, success or failure may be determined by majority voting, and the success point may be determined as the median of the frames annotated as successful across the first of the raters.
These (additional) training data items may include observations of an environment or task that is similar or identical to the environment or task in which the system will be used later, although this is not required. When the task description "{ task }" is available, a template such as "Did the robot/agent/person successfully { task } (whether the robot/agent/person successfully completed { task })" may be used to generate a problem. When no task description is available but there is a narrative corresponding to an action, for example, as in the Ego4D dataset (Grauman et al', arXiv: 2110.07058), a trained VLM such as Flamingo (arXiv: 2204.14198) may be used to convert the narrative to a question, for example, a narrative such as "The person is scooping THE ICE CREAM (the person is scooping ice cream)" to "Did the person successfully scoop THE ICE CREAM (the person is scooping ice cream successfully.
In one illustrative example only, a training dataset of (additional) training data items is generated by using a human operator to provide 101,789 presentations of 6 tasks using a 6DoF control device, panda robotic arm (FRANKA RANKKA GmbH). Then, each episode (episode) is annotated with rewards by humans for each task, 6 rewards annotated for each episode, one for each task, all frames with a successful status (i.e., where the task is resolved) are marked with positive rewards, otherwise marked with zero rewards (if the task is accidentally incomplete within the episode, then the reward annotations revert to zero). The bonus annotation and corresponding episode frame are then converted into training data items. A true value answer is obtained from the human annotation (if the video segment contains a single transition from zero to positive rewards or has only positive rewards all the time, the video segment is marked with a "successful" answer, otherwise the segment is marked as unsuccessful). The multimodal language model neural network 120 is then trained to detect the success of a plurality of different tasks.
FIG. 9 taken from this training dataset shows an example of generating two (additional) training data items 910, 912 from an annotated behavior trace video 900 with success points 902 that includes a task. Each training data item 910, 912 includes a natural language question, a portion of the video indicating the failure or success of the task, respectively, and a corresponding answer to the natural language question, no or yes, respectively. In case the answer is yes, the part of the video in the training data item contains or follows a successful frame.
Using the example language model neural network 800 of fig. 8, one example training process that may be used for pre-training or fine-tuning is now described. Other language model neural networks may be trained in a manner appropriate to their architecture, which may be similar or different from the example of fig. 8.
FIG. 10 is a flowchart of an example process for training a multimodal language model neural network, such as language model neural network 800 of FIG. 8. The process of fig. 10 may be performed by a system of one or more computers located in one or more locations, such as by computer system 100 of fig. 1.
As previously described, there are many different types of models (e.g., VLMs) that can be used for a multimodal language model neural network. In the case where the model is trained (e.g., trimmed) for successful detection of the task, this may be done in a manner generally appropriate for training the model.
As one example, the multimodal language model neural network may include a model, such as a Transformer-based model, configured to process a sequence of tokens including tokens from an input token string and observed tokens. The observation lemmas may include image lemmas (e.g., from encoding an image or a piece of an image) or multimodal lemmas. Such a model may process and train on a mixed sequence of such tokens. For example, such a model may be natively trained on a combination (e.g., sequence) of text tokens and observation tokens (such as image tokens) provided as inputs to the model. In general, such training may involve a counter-propagating gradient of the loss function, for example as described below. Alternatively, but not necessarily, such a model may also include a cross-attention mechanism as described below.
In the example of fig. 10, training the multimodal language model neural network 800 begins with a pre-trained and frozen plain text language model. However, it is not necessary to start with a pre-trained and frozen plain text language model.
Continuing with this particular example, in some implementations, the above-described lemma processing layer 830 may be obtained from a pre-trained and frozen plain text language model (e.g., from an autoregressive model that contains such layers). The language model may be a so-called large language model, for example having more than 10 10、1011 or 10 12 trainable (trained) parameters. There are many such models, and as just one example, the "CHINCHILLA" model (Hoffmann et al, "Training computer-Optimal Large Language Models", arXiv: 2203.15556) may be used. Such plain text language models can be trained on a very large number of available unlabeled text (e.g., on books and the internet). The multimodal language model neural network 800 can then be trained while keeping the neural network parameter values (e.g., weights) of the word processing layer 830 frozen, i.e., constant, during training.
In more detail, this example training process may include obtaining a trained, generative natural or computer language neural network (step 1000). The generative natural or computer language neural network is configured to process an input lemma string representing a word in a natural or computer language to generate an output lemma in the natural or computer language, and includes a stack of trained lemma processing layers.
The process may then form a multimodal language model neural network (step 1002). The multimodal language model neural network is configured to process both i) a sequence of text tokens representing words in a natural language or a computer language, and ii) observation data comprising observations of an environment, one at a time, to generate a language model output comprising a string of one or more output tokens.
The multimodal language model neural network may be as described above, i.e., similar to the multimodal language model neural network 800. For example, the multimodal language model neural network may include a stack of processing layers including a plurality of trained word processing layers 830 and a plurality of cross-attention layers 820, each cross-attention layer arranged to receive a compressed representation of observation data, the trained word processing layers interleaved with the cross-attention layers.
The process may further include obtaining a set of training data items (step 1004). Each training data item may include an observation of the environment, and natural or computer language text associated with the observation.
The process may train the multimodal language model neural network using training data items by adjusting parameters of the cross-attention layer. In some implementations, but not necessarily, this is done while keeping parameters of the trained word processing layer frozen (i.e., constant) (step 1006). For example, the multimodal language model neural network may be trained to predict text tokens representing natural or computer language text in a training data item while conditioned on observation data derived from observations in the training data item.
Training the multimodal language model neural network may involve counter propagating gradients of the loss function (relative to parameters of the cross-attention layer) through the trained word processing layer 830. The penalty function may be a function that encourages the multimodal language model neural network to generate language model outputs 124, particularly output word strings representing natural or computer language text that match the real valued language corresponding to observations in the training data item. This may involve using a multimodal language model neural network to process observation data derived from observations in training data items and input lemma strings representing observed natural or computer language text.
The output lemma string may be generated one lemma at a time. For example, the loss function may be based on, wherein,Representing the observation data (in particular the compressed representation described above),Is the number of tokens in the input token string,Representing the first input character stringThe number of the word elements is the same,Front representing input character string A character, and wherein,Representing multimodal language model neural network pass-through processingAndGenerating and generatingProbability of the corresponding output word. For example, the loss function may include a negative log likelihood termFor example, the averaging is performed over a small batch of training data items.
In some implementations, training the multimodal language model neural network also involves training the observation encoder neural network 814 described above, i.e., adjusting parameters of the observation encoder neural network. Alternatively, however, pre-trained and frozen observational encoder neural networks may be used, such as those trained based on contrast learning that does not require labeled data. In the case where gating parameters are used for the cross-attention layer, this may be gradually increased during training to gradually increase the impact of the cross-attention layer.
For pre-training, the training data items may include, for example, still or moving images (e.g., video clips or audio clips), as well as accompanying natural or computer languages, such as narratives, that describe the content of the images or audio clips. The natural or computer language in the training data item need not be in question-and-answer format. For fine tuning, manual or automatic annotations may be used to generate some (additional) training data items, e.g. to collect visual data from cameras in the environment, or to collect data from robots for proprioceptive observation, if necessary.
In some implementations, the multimodal language model neural network may be trained partially or fully in the simulation, i.e., using simulated training data representing a simulation of the real world environment, before the multimodal language model neural network is used to process observations from the simulated real world environment. In such cases, simulations need only be approximated, as one advantage of the described techniques is that they can be generalized to new tasks, environments, behaviors, and languages.
Some examples of robustness of the described techniques are now described. Tasks such as "Bring me the banana THAT IS IN THE PANTRY (take bananas to me in a food storage room)" may also be expressed as "Bring some fruits or vegetables from THE PANTRY (take some fruits or vegetables from a food storage room)" or "Bring the yellow colored object near me (take yellow objects to me vicinity)". Nonetheless, language model-based methods, and in particular VLM-based methods, help the system understand that these methods can be related to different descriptions of the same task. One example implementation of the system exhibits near human performance (88%) when successfully testing for both previously unseen task samples (samples extracted from the training dataset) and previously unseen behavior (new, out-of-distribution, agent behavior). The described technique is also able to detect the success of new tasks that are not seen during training, such as "arange 4 pointy objects in a square shape in the bedroom (arranging 4 sharp objects in a square shape in the bedroom)" (where only "arc" and "triangle" are mentioned in the training dataset and no "square") or "HIT THE CANDLE using the pillow WHICH IS LEFT of AIRPLANE IN THE LIVING room (hitting candles with pillows left by the aircraft in the living room)" (where "hit" is not mentioned in the training dataset). It is speculated that this generic capability is facilitated by pre-training the multimodal language model neural network over a large (e.g., web-scale) dataset.
Fig. 11 shows the use of the multimodal language model neural network 120 for task success detection, particularly for determining whether a Panda robotic arm task was successfully performed ("Q: did the robot successfully insert a medium gear.
Fig. 12A to 12C illustrate evaluation performed on successful detection when changing the camera angle of view and when including one or more interfering objects. Fig. 12A shows a video clip from a front camera as used in the training data in this particular implementation (baseline), fig. 12B shows a video clip from a rear camera (not present in the training data), and fig. 12C shows a video clip from a test that includes an interfering object (nail). The baseline successful detection performance of the "insert" task of the type shown is in the range of 90% to 95%, drops by about 5% with a change in rear camera view angle, and exhibits little change when an interfering object is present.
Robustness of the described techniques to this type of change in tasks, agents, and environments (e.g., visual changes such as camera view angle, lighting conditions, and background changes) facilitates efficient training, evaluation, and use of the action selection system 110 as previously described. Thus, implementations of the system may provide an action selection system 110 with improved capabilities for acting in a real-world environment.
As previously described, implementations of the described techniques are also capable of detecting successful execution of a task by a human and, thus, may be used to guide a human to execute the task. Such a system may be trained, for example, using an Ego4D dataset of publicly available first view real scene human activity (egocentric human-in-the-wild) video. The video shows that people are performing common tasks (e.g., washing dishes, washing cars, growing flowers and trees), and the Ego4D prediction+hand and object (FHO) dataset has a corresponding narrative describing the actions of the camera wearer in the video. Other annotations include so-called critical state changes, i.e. "how the camera wearer changes the state of the object by using or manipulating the object", as well as "critical frames" PRE, no return Points (PNR) and POST (which indicate when a state change has occurred), and action verbs, object nouns and state change types. PNR frames annotate the beginning of the state change, PRE frames indicate the point before the state change, and POST frames indicate the point after the state change is complete. PNR frames can be considered as points where "success" occurs, and to generate negative samples, frames preceding PRE frames can be used. The problem may be generated by modifying the description to be a problem, for example using a VLM such as Flamingo (supra). Fig. 13 shows the generation of (additional) training data items for successful detection of human task execution from the Ego4D dataset for a rolling task.
In some implementations of the systems and processes described above, the agent 102 is a mechanical agent, the environment 104 is a real-world environment, and the observations (particularly the observations processed by the action selection system 110 and from which the observation data for the language model neural network 120 is derived) are from one or more sensors that sense the real-world environment. Action 114 is used to control a mechanical agent acting in a real-world environment to perform a task and may be used to control agent 102 in a real-world environment.
In some implementations, the agent 102 is a simulation of a mechanical agent, the environment 104 is a simulation of a particular real-world environment, the observations are related to the particular real-world environment, and the actions are related to actions to be performed by the mechanical agent acting in the particular real-world environment to perform the task. After simulated training or evaluation of the action selection system 110 described above using mechanical agents in the simulation of a particular real-world environment, the process may then perform tasks using mechanical agents in the particular real-world environment. That is, training or evaluation of the action selection system may be performed partially or fully in the simulation before the action selection system is used in the real world.
More specifically, where environment 104 is a real-world environment, agent 102 may be a mechanical agent that interacts with the real-world environment, such as a robot or autonomous or semi-autonomous land, air, or sea vehicle that operates in or navigates through the environment, and the action is an action taken by the mechanical agent in the real-world environment to perform a task. For example, an agent may be a robot or vehicle that interacts with an environment to accomplish a particular task (e.g., locate an object of interest in the environment, or manipulate or change the state of a specified object, or move a specified object to a specified location in the environment, or navigate to a specified destination in the environment). Additionally or alternatively, the task may be characterized by ongoing activity (e.g., "sweeping a house with a cleaner"). Information defining a task, such as an object, state, location, destination, or behavior, may be specified by a natural or computer language or other instructions, such as from a user interface of the system, as previously described.
Observations processed by the action selection system 110 and/or by the language model neural network 120 may be derived from one or more sensors of the environment 104 (e.g., from a camera or other image sensor or from a microphone). The sensor may be mounted on the agent and/or located separately from the agent in the environment. In addition to or instead of the previously described sensor data, the observations may include various types of sensor data, such as object position data, data from distance or position sensors, data from actuators, or sensed electronic signals (such as motor current or temperature signals). The observations may also include data characterizing the current state of the mechanical agent or robot, such as one or more of joint position, joint velocity, joint force, torque, or acceleration (e.g., gravity compensated torque feedback), and global or relative pose of one or more portions of the object or agent held by the robot. Alternatively, in any of the described implementations, the observations at any given time step may include (observation) data from a previous time step that may be beneficial in characterizing the environment.
Act 114 may include control signals for controlling the robot or other mechanical agent and the control system may be used to generate control signals for controlling the mechanical agent. The control signals may include, for example, the torque of a joint of the robot or the torque of a control surface or other control element (e.g., a steering control element of a vehicle), or a higher level control command. More generally, the control signals may include, for example, position, velocity or force, torque or acceleration data of one or more joints or other portions of a robot or other mechanical agent. The control signals may also or alternatively include electronic control data, such as motor control data, or signals that control navigation (e.g., steering, movement, braking, and/or acceleration of the agent).
The term "configured" is used in this specification in connection with systems and computer program components. By a system of one or more computers to be configured to perform a particular operation or action is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that in operation causes the system to perform the operation or action. By one or more computer programs to be configured to perform a particular operation or action, it is meant that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operation or action. Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium, for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus.
The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The device may also be or further comprise dedicated logic circuitry, e.g. an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). In addition to hardware, the device may optionally include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (which may also be referred to or described as a program, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, such as files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a data communication network.
In this specification, the term "database" is used broadly to refer to any collection of data that need not be structured in any particular way, or structured at all, and that may be stored on storage in one or more locations. Thus, for example, an index database may include multiple data sets, each of which may be organized and accessed differently.
Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases one or more computers will be dedicated to a particular engine, in other cases multiple engines may be installed and run on the same computer or computers.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, or in combination with, special purpose logic circuitry, e.g., an FPGA or ASIC, and one or more programmed computers.
A computer suitable for executing a computer program may be based on a general-purpose or special-purpose microprocessor or both, or any other kind of central processing unit. Typically, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. Elements of a computer are a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory may be supplemented by, or incorporated in, special purpose logic circuitry. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable magnetic disks), magneto-optical disks, and CD ROM and DVD-ROM disks.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide interaction with the user, for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback, and input from the user may be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user, e.g., by sending web pages to a web browser on the user's device in response to requests received from the web browser. Further, the computer may interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smart phone running a messaging application) and receiving responsive messages from the user in response.
The data processing apparatus for implementing the machine learning model may also include, for example, a dedicated hardware accelerator unit for handling the general and computationally intensive portions of machine learning training or production (i.e., inference, workload).
The machine learning model may be implemented and deployed using a machine learning framework (e.g., tensorFlow framework).
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), such as the internet.
The computing system may include clients and servers. The client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data (e.g., HTML pages) to the user device, e.g., for the purpose of displaying data to and receiving user input from a user interacting with the device acting as a client. Data generated at the user device, e.g., results of the user interaction, may be received at the server from the device.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings and described in a particular order in the claims, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Specific embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims (29)

1.一种使用语言模型神经网络来训练或评估动作选择系统的计算机实现的方法,所述动作选择系统用于控制在环境中行动的智能体执行任务,1. A computer-implemented method for using a language model neural network to train or evaluate an action selection system for controlling an intelligent agent acting in an environment to perform a task, 其中,所述动作选择系统被配置为处理表征所述环境的当前状态的观察,以生成用于选择要由所述智能体实现以执行所述任务的动作的策略输出,并且wherein the action selection system is configured to process observations representing the current state of the environment to generate a policy output for selecting an action to be implemented by the agent to perform the task, and 其中,所述语言模型神经网络包括多模态模型,并且被配置为处理以下两者以生成语言模型输出:i)表示呈自然或计算机语言的单词的文本词元序列,以及ii)包括对所述环境的观察的观察数据;wherein the language model neural network comprises a multimodal model and is configured to process both: i) a sequence of text tokens representing words in a natural or computer language, and ii) observation data comprising observations of the environment to generate a language model output; 所述方法包括:The method comprises: 获得定义所述任务的信息;obtaining information defining said task; 从定义所述任务的所述信息生成询问所述任务或所述任务的阶段是否已经被实现的自然或计算机语言问题;generating, from the information defining the task, a natural or computer language question asking whether the task or a stage of the task has been achieved; 为所述语言模型神经网络生成表示所述自然或计算机语言问题的输入词元字符串;generating an input word-meta string representing the natural or computer language question for the language model neural network; 在所述智能体使用所述动作选择系统来尝试执行所述任务的同时从所述环境获得观察数据,所述观察数据表示所述智能体在所述环境中行动以执行所述任务;以及obtaining observation data from the environment while the agent attempts to perform the task using the action selection system, the observation data representing the agent acting in the environment to perform the task; and 使用所述语言模型神经网络来处理所述输入词元字符串和来自所述环境的所述观察数据,以生成所述语言模型输出,processing the input word-gram string and the observation data from the environment using the language model neural network to generate the language model output, 其中,所述语言模型输出包括对所述自然或计算机语言问题的回答,所述回答定义所述任务或所述任务的所述阶段是否已经被实现;以及wherein the language model output comprises an answer to the natural or computer language question, the answer defining whether the task or the stage of the task has been achieved; and 训练所述动作选择系统以使用所述回答来执行所述任务,或者评估是否使用所述动作选择系统以使用所述回答来执行所述任务。The action selection system is trained to perform the task using the answer, or the action selection system is evaluated for use to perform the task using the answer. 2.如权利要求1所述的方法,进一步包括在所述训练之后或者取决于所述评估的结果,使用所述动作选择系统来控制所述智能体执行所述任务。2. The method of claim 1, further comprising using the action selection system to control the agent to perform the task after the training or depending on the results of the evaluation. 3.如权利要求1或2所述的方法,其中,训练所述动作选择系统以使用所述回答来执行所述任务包括:3. The method of claim 1 or 2, wherein training the action selection system to use the answer to perform the task comprises: 使用基于从所述回答确定的奖励的强化学习技术来训练所述动作选择系统。The action selection system is trained using a reinforcement learning technique based on rewards determined from the responses. 4.一种使用语言模型神经网络来训练动作选择系统的计算机实现的方法,所述动作选择系统用于控制在环境中行动的智能体执行特定任务,4. A computer-implemented method for using a language model neural network to train an action selection system for controlling an intelligent agent acting in an environment to perform a specific task, 其中,所述动作选择系统被配置为处理表征所述环境的当前状态的观察,以生成用于选择要由所述智能体实现以执行所述特定任务的动作的策略输出,并且wherein the action selection system is configured to process observations representing the current state of the environment to generate a policy output for selecting an action to be implemented by the agent to perform the specific task, and 其中,所述语言模型神经网络包括多模态模型,并且被配置为处理以下两者以生成语言模型输出:i)表示呈自然或计算机语言的单词的文本词元序列,以及ii)包括对所述环境的观察的观察数据;wherein the language model neural network comprises a multimodal model and is configured to process both: i) a sequence of text tokens representing words in a natural or computer language, and ii) observation data comprising observations of the environment to generate a language model output; 所述方法包括:The method comprises: 获得一组训练数据项,每个训练数据项包括对实体在环境中行动同时尝试执行任务的观察;obtaining a set of training data items, each training data item comprising an observation of an entity acting in an environment while attempting to perform a task; 获得定义所述特定任务的信息;obtaining information defining the specific task; 从定义所述特定任务的所述信息生成询问所述特定任务或所述特定任务的阶段是否已经被实现的自然或计算机语言问题;以及generating, from the information defining the specific task, a natural or computer language question asking whether the specific task or a stage of the specific task has been achieved; and 对于所述训练数据项中的每一个:For each of the training data items: 为所述语言模型神经网络生成表示所述自然或计算机语言问题的输入词元字符串;generating an input word-meta string representing the natural or computer language question for the language model neural network; 使用所述语言模型神经网络来处理所述输入词元字符串和来自所述训练数据项中的所述观察的观察数据,以生成所述语言模型输出,processing the input word-gram string and the observed data from the observations in the training data item using the language model neural network to generate the language model output, 其中,所述语言模型输出包括对所述自然或计算机语言问题的回答,所述回答定义所述特定任务或所述特定任务的所述阶段在所述训练数据项中是否已经被实现;以及wherein the language model output comprises an answer to the natural or computer language question, the answer defining whether the specific task or the stage of the specific task has been achieved in the training data item; and 对于所述训练数据项,标识所述特定任务在所述训练数据项中是否已经被实现;以及For the training data item, identifying whether the specific task has been implemented in the training data item; and 使用经标识的训练数据项来训练所述动作选择系统以执行所述特定任务。The action selection system is trained using the identified training data items to perform the particular task. 5.一种使用语言模型神经网络来指示在环境中行动的智能体执行任务的计算机实现的方法,其中,所述语言模型神经网络包括多模态模型,并且被配置为处理以下两者以生成语言模型输出:i)表示呈自然或计算机语言的单词的文本词元序列,以及ii)包括对所述环境的观察的观察数据,所述方法包括:5. A computer-implemented method for instructing an agent acting in an environment to perform a task using a language model neural network, wherein the language model neural network comprises a multimodal model and is configured to process both i) a sequence of text tokens representing words in a natural or computer language, and ii) observation data comprising observations of the environment to generate a language model output, the method comprising: 获得定义所述任务的信息;obtaining information defining said task; 从定义所述任务的所述信息生成询问所述任务或所述任务的阶段是否已经被实现的自然或计算机语言问题;generating, from the information defining the task, a natural or computer language question asking whether the task or a stage of the task has been achieved; 向所述语言模型神经网络生成表示所述自然或计算机语言问题的输入词元字符串;generating an input word-meta string representing the natural or computer language question to the language model neural network; 从所述环境获得观察数据,所述观察数据表示所述智能体在所述环境中行动以执行所述任务;obtaining observation data from the environment, the observation data representing actions of the agent in the environment to perform the task; 使用所述语言模型神经网络来处理所述输入词元字符串和来自所述环境的所述观察数据,以生成所述语言模型输出,processing the input word-gram string and the observation data from the environment using the language model neural network to generate the language model output, 其中,所述语言模型输出包括对所述自然或计算机语言问题的回答,所述回答定义所述任务或所述任务的所述阶段是否已经被实现;以及wherein the language model output comprises an answer to the natural or computer language question, the answer defining whether the task or the stage of the task has been achieved; and 使用所述回答来指示所述智能体。The response is used to instruct the agent. 6.如权利要求5所述的方法,其中,使用所述回答来指示所述智能体包括使用所述回答来控制所述智能体执行所述任务所采取的动作。6. The method of claim 5, wherein using the answer to instruct the agent comprises using the answer to control actions taken by the agent to perform the task. 7.如权利要求1至6中任一项所述的方法,其中,所述语言模型神经网络被配置为对所述输入词元字符串和所述观察数据进行联合建模以确定所述语言模型输出。7. The method of any one of claims 1 to 6, wherein the language model neural network is configured to jointly model the input word-gram string and the observation data to determine the language model output. 8.如权利要求1至7中任一项所述的方法,其中,使用所述语言模型神经网络来处理所述输入词元字符串和来自所述环境的所述观察数据包括:8. The method of any one of claims 1 to 7, wherein processing the input word-gram string and the observation data from the environment using the language model neural network comprises: 使用观察编码器神经网络子系统来处理所述观察数据以确定所述观察数据的压缩表示;以及processing the observation data using an observation encoder neural network subsystem to determine a compressed representation of the observation data; and 通过在所述压缩表示和从所述输入词元字符串导出的数据之间应用交叉注意力机制来处理所述压缩表示和从所述输入词元字符串导出的所述数据,以生成所述语言模型输出。The compressed representation and the data derived from the input word-gram string are processed by applying a cross-attention mechanism between the compressed representation and the data derived from the input word-gram string to generate the language model output. 9.如权利要求8所述的方法,其中,所述语言模型神经网络包括处理层的堆叠,所述处理层的堆叠包括多个词元处理层和多个交叉注意力层,每个交叉注意力层被布置为接收所述压缩表示,所述词元处理层与所述交叉注意力层交错;并且9. The method of claim 8, wherein the language model neural network comprises a stack of processing layers, the stack of processing layers comprising a plurality of word-gram processing layers and a plurality of criss-cross attention layers, each criss-cross attention layer being arranged to receive the compressed representation, the word-gram processing layers being interleaved with the criss-cross attention layers; and 其中,处理所述压缩表示和从所述输入词元字符串导出的所述数据包括:Wherein, processing the compressed representation and the data derived from the input word-meta string comprises: 使用所述处理层的堆叠来处理从所述输入词元字符串导出的所述数据,processing the data derived from the input word-meta string using the stack of processing layers, 其中,使用所述处理层的堆叠来处理从所述输入词元字符串导出的所述数据进一步包括使用所述交叉注意力层中的每一个来关注所述压缩表示。Wherein processing the data derived from the input word-gram string using the stack of processing layers further comprises attending to the compressed representation using each of the criss-cross attention layers. 10.如权利要求8或9所述的方法,其中,所述压缩表示包括一组观察词元,并且其中,使用所述观察编码器神经网络子系统来处理所述观察数据以确定所述观察数据的所述压缩表示进一步包括:10. The method of claim 8 or 9, wherein the compressed representation comprises a set of observation tokens, and wherein processing the observation data using the observation encoder neural network subsystem to determine the compressed representation of the observation data further comprises: 使用观察编码器来处理所述观察数据以生成包括一组观察特征的经编码的观察数据;以及processing the observation data using an observation encoder to generate encoded observation data comprising a set of observation features; and 通过使用被配置为交叉关注所述观察特征的一个或多个观察编码器交叉注意力层处理一组潜在向量,来从所述经编码的观察数据确定所述一组观察词元,以生成所述一组观察词元。The set of observation word-grams is determined from the encoded observation data by processing a set of latent vectors using one or more observation encoder cross-attention layers configured to cross-attend to the observation features to generate the set of observation word-grams. 11.如权利要求1至10中任一项所述的方法,其中,所述语言模型神经网络是语言生成神经网络,其中,所述输入词元字符串包括从表示所述自然或计算机语言问题的自然或计算机语言的单词的词元词表中选择的文本词元,并且其中,所述语言模型输出被配置为生成包括从所述词元词表中选择的一个或多个文本词元的输出词元字符串,所述输出词元字符串定义对所述自然或计算机语言问题的自然或计算机语言回答。11. The method of any one of claims 1 to 10, wherein the language model neural network is a language generation neural network, wherein the input word-gram string comprises text words selected from a word-gram vocabulary of words in a natural or computer language representing the natural or computer language question, and wherein the language model output is configured to generate an output word-gram string comprising one or more text words selected from the word-gram vocabulary, the output word-gram string defining a natural or computer language answer to the natural or computer language question. 12.如权利要求1至10中任一项所述的方法,其中,所述语言模型神经网络是语言生成神经网络,其中,所述输入词元字符串包括从表示所述自然或计算机语言问题的自然或计算机语言的单词的词元词表中选择的文本词元,并且其中,所述语言模型输出定义词元分数在所述词元词表上的分布,所述方法进一步包括处理所述词元词表的两个相应词元的所述词元分数以确定所述回答。12. The method of any one of claims 1 to 10, wherein the language model neural network is a language generation neural network, wherein the input word-gram string comprises text words selected from a word-gram vocabulary of words in a natural or computer language representing the natural or computer language question, and wherein the language model output defines a distribution of word-gram scores over the word-gram vocabulary, the method further comprising processing the word-gram scores of two corresponding words of the word-gram vocabulary to determine the answer. 13.如权利要求11或12所述的方法,其中,所述语言生成神经网络是被配置为处理所述输入词元字符串的词元以依序生成所述输出字符串的词元的自回归神经网络。13. The method of claim 11 or 12, wherein the language generation neural network is an autoregressive neural network configured to process the word-grams of the input word-gram string to sequentially generate the word-grams of the output string. 14.如权利要求11或权利要求13所述的方法,当从属于权利要求11时,其中,所述自然或计算机语言回答包括呈所述自然或计算机语言的定义所述任务的成功或失败的单个单词。14. A method as claimed in claim 11 or claim 13 when dependent on claim 11, wherein the natural or computer language answer comprises a single word in the natural or computer language defining success or failure of the task. 15.如权利要求1至14中任一项所述的方法,其中,从所述环境获得观察数据包括:15. The method of any one of claims 1 to 14, wherein obtaining observation data from the environment comprises: 在所述任务的执行的阶段序列处从所述环境中获得观察数据,并且对于所述阶段中的每个阶段:Observation data is obtained from the environment at a sequence of phases of performance of the task, and for each of the phases: 使用所述语言模型神经网络来处理所述输入词元字符串和所述阶段的来自所述环境的所述观察数据,以生成包括所述阶段的所述回答的所述语言模型输出;以及processing the input word-gram string and the observation data from the environment for the stage using the language model neural network to generate the language model output comprising the answer for the stage; and 从所述阶段中的两个或更多个阶段的所述语言模型输出确定所述任务或所述任务的所述阶段是否已经被实现。Whether the task or the stage of the task has been achieved is determined from the language model outputs of two or more of the stages. 16.如权利要求15所述的方法,其中,确定所述任务或所述任务的所述阶段是否已经被实现包括确定所述阶段中的两个或更多个阶段的所述回答之间的差异。16. The method of claim 15, wherein determining whether the task or the phase of the task has been achieved comprises determining a difference between the answers for two or more of the phases. 17.如权利要求15所述的方法,其中,确定所述任务或所述任务的所述阶段是否已经被实现包括确定所述回答是否指示所述任务对于所述阶段中的任何阶段均已经被实现。17. The method of claim 15, wherein determining whether the task or the phase of the task has been achieved comprises determining whether the answer indicates that the task has been achieved for any of the phases. 18.如权利要求1至17中任一项所述的方法,其中18. The method of any one of claims 1 to 17, wherein 获得定义所述任务的所述信息包括获得目标观察数据,所述目标观察数据定义用于所述任务或所述任务的所述阶段的成功执行的所述环境的目标状态;并且进一步包括:Obtaining the information defining the task includes obtaining target observation data defining a target state of the environment for successful execution of the task or the phase of the task; and further comprising: 从所述自然或计算机语言问题和所述目标观察数据的组合为所述语言模型神经网络生成提示;以及generating a prompt for the language model neural network from a combination of the natural or computer language question and the target observation data; and 使用所述语言模型神经网络来处理包括所述自然或计算机语言问题和所述目标观察数据的所述提示,并且然后处理来自所述环境的所述观察数据。The prompt comprising the natural or computer language question and the target observation is processed using the language model neural network, and the observation from the environment is then processed. 19.如权利要求1至18中任一项所述的方法,其中,所述观察数据包括包含图像或图像序列的视觉数据。19. The method of any one of claims 1 to 18, wherein the observation data comprises visual data comprising an image or a sequence of images. 20.如权利要求1至19中任一项所述的方法,其中,所述任务或所述任务的所述阶段是否已经被实现是由所述智能体是否显示正在进行的行为来定义的。20. A method as claimed in any one of claims 1 to 19, wherein whether the task or the phase of the task has been achieved is defined by whether the agent displays ongoing behaviour. 21.一种训练多模态语言模型神经网络在如权利要求1至20中任一项所述的方法中使用的计算机实现的方法,其中,所述方法包括:21. A computer-implemented method for training a multimodal language model neural network for use in the method of any one of claims 1 to 20, wherein the method comprises: 获得经训练的生成式自然或计算机语言神经网络,其中所述生成式自然或计算机语言神经网络被配置为处理表示呈自然或计算机语言的单词的输入词元的字符串以生成呈所述自然或计算机语言的输出词元,并且包括经训练的词元处理层的堆叠;obtaining a trained generative natural or computer language neural network, wherein the generative natural or computer language neural network is configured to process a string of input tokens representing words in a natural or computer language to generate output tokens in the natural or computer language, and comprises a stack of trained token processing layers; 形成所述多模态语言模型神经网络,其中,所述多模态语言模型神经网络被配置为处理以下两者以生成包括一个或多个输出词元的字符串的语言模型输出:i)表示呈自然或计算机语言的单词的文本词元序列,以及ii)包括对所述环境的观察的观察数据,forming the multimodal language model neural network, wherein the multimodal language model neural network is configured to process both i) a sequence of text tokens representing words in a natural or computer language, and ii) observational data comprising observations of the environment to generate a language model output comprising one or more output tokens, 所述多模态语言模型神经网络包括处理层的堆叠,所述处理层的堆叠包括多个所述经训练的词元处理层和多个交叉注意力层,每个交叉注意力层被布置为接收所述观察数据的压缩表示,所述经训练的词元处理层与所述交叉注意力层交错;the multimodal language model neural network comprising a stack of processing layers, the stack of processing layers comprising a plurality of the trained word-gram processing layers and a plurality of criss-cross attention layers, each criss-cross attention layer being arranged to receive a compressed representation of the observation data, the trained word-gram processing layers being interleaved with the criss-cross attention layers; 获得一组训练数据项,每个训练数据项包括对环境的观察以及与所述观察相关的自然或计算机语言文本;以及obtaining a set of training data items, each training data item comprising an observation of an environment and natural or computer language text associated with the observation; and 通过调整所述交叉注意力层的参数同时保持所述经训练的词元处理层的参数冻结,使用所述训练数据项来训练所述多模态语言模型神经网络。The multimodal language model neural network is trained using the training data items by adjusting parameters of the crisscross attention layer while keeping parameters of the trained word-unit processing layer frozen. 22.如权利要求1至21中任一项所述的方法,其中,所述智能体是机械智能体,所述环境是现实世界环境,所述观察来自感测所述现实世界环境的一个或多个传感器,并且所述动作用于控制在所述现实世界环境中行动的所述机械智能体执行所述任务。22. A method as described in any one of claims 1 to 21, wherein the agent is a mechanical agent, the environment is a real-world environment, the observation comes from one or more sensors sensing the real-world environment, and the action is used to control the mechanical agent acting in the real-world environment to perform the task. 23.如权利要求1至21中任一项所述的方法,当从属于权利要求1至4中任一项时,其中,所述智能体是对机械智能体的模拟,所述环境是对特定现实世界环境的模拟,所述观察与所述特定现实世界环境相关,并且所述动作与要由在所述特定现实世界环境中行动的所述机械智能体执行以执行所述任务的动作相关,所述方法进一步包括在对所述特定现实世界环境的所述模拟中使用对所述机械智能体的所述模拟来训练或评估所述动作选择系统之后,使用所述特定现实世界环境中的所述机械智能体来执行所述任务。23. A method as claimed in any one of claims 1 to 21, when dependent on any one of claims 1 to 4, wherein the agent is a simulation of a mechanical agent, the environment is a simulation of a specific real-world environment, the observations are related to the specific real-world environment, and the actions are related to actions to be performed by the mechanical agent acting in the specific real-world environment to perform the task, the method further comprising using the mechanical agent in the specific real-world environment to perform the task after using the simulation of the mechanical agent in the simulation of the specific real-world environment to train or evaluate the action selection system. 24.如权利要求5至21中任一项所述的方法,当从属于权利要求5时,其中,所述智能体包括数字助理的用户,所述方法包括:24. A method as claimed in any one of claims 5 to 21 when dependent on claim 5, wherein the agent comprises a user of a digital assistant, the method comprising: 从所述数字助理获得定义所述任务的所述信息;以及obtaining the information defining the task from the digital assistant; and 使用所述数字助理来基于所述回答指示所述用户。The digital assistant is used to instruct the user based on the answer. 25.如权利要求24所述的方法,进一步包括:25. The method of claim 24, further comprising: 在所述数字助理处接收来自所述用户的对辅助的请求;receiving, at the digital assistant, a request for assistance from the user; 响应于所述请求而确定所述用户要执行的一系列任务;以及determining a series of tasks to be performed by the user in response to the request; and 对于所述一系列任务中的一个或多个任务:For one or more tasks in the series of tasks: 从所述数字助理向所述用户输出对要执行的所述任务的指示;outputting, from the digital assistant to the user, instructions for the task to be performed; 使用所述数字助理来捕获所述用户执行所述任务的视觉或音频观察;using the digital assistant to capture visual or audio observation of the user performing the task; 从所述回答确定所述用户是否已成功实现所述任务,并且作为响应,继续进行所述一系列任务中的下一任务。A determination is made from the answer as to whether the user has successfully achieved the task, and in response, proceeding to the next task in the series of tasks. 26.一个或多个存储指令的计算机存储介质,所述指令在由一个或多个计算机执行时致使所述一个或多个计算机执行如权利要求1至25中任一项所述的相应方法的操作。26. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1 to 25. 27.一种系统,包括一个或多个计算机和存储指令的一个或多个存储装置,所述指令在由所述一个或多个计算机执行时致使所述一个或多个计算机执行如权利要求1至25中任一项所述的相应方法的操作。27. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods of any one of claims 1 to 25. 28.一种被配置为选择动作以在环境中执行一个或多个任务的智能体,所述智能体包括如权利要求27所述的系统,并且进一步包括:28. An agent configured to select actions to perform one or more tasks in an environment, the agent comprising the system of claim 27 and further comprising: 一个或多个观察捕获子系统,所述一个或多个观察捕获子系统用于捕获对所述环境的观察;one or more observation capture subsystems for capturing observations of the environment; 自然或计算机语言接口,所述自然或计算机语言接口用于接收要执行的任务的自然或计算机语言描述,其中,对要执行的所述任务的所述自然或计算机语言描述包括定义所述任务的所述信息;a natural or computer language interface for receiving a natural or computer language description of a task to be performed, wherein the natural or computer language description of the task to be performed includes the information defining the task; 动作选择系统,其中,所述动作选择系统被配置为处理对所述环境的第一捕获的观察,以生成用于在时间步处选择动作的动作选择策略输出,以控制所述智能体执行所述任务;以及an action selection system, wherein the action selection system is configured to process a first captured observation of the environment to generate an action selection policy output for selecting an action at a time step to control the agent to perform the task; and 所述语言模型神经网络的接口,其中,所述语言模型神经网络被配置为处理所述输入词元字符串和对所述环境的第二捕获的观察,以生成包括所述回答的所述语言模型输出;并且an interface to the language model neural network, wherein the language model neural network is configured to process the input word-gram string and a second captured observation of the environment to generate the language model output comprising the answer; and 其中,所述动作选择系统被配置为使用来自所述语言模型神经网络的所述回答而被控制。Wherein, the action selection system is configured to be controlled using the response from the language model neural network. 29.一种数字助理装置,包括如权利要求27所述的系统,并且进一步包括:29. A digital assistant device comprising the system of claim 27, and further comprising: 用户接口,所述用户接口用于使用户能够请求辅助并输出信息;a user interface for enabling a user to request assistance and output information; 辅助子系统,所述辅助子系统被配置为响应于所述请求而确定所述用户要执行的一系列任务;an assistance subsystem configured to determine a series of tasks to be performed by the user in response to the request; 观察捕获子系统,所述观察捕获子系统用于捕获对所述用户执行任务的视觉或音频观察;an observation capture subsystem for capturing visual or audio observations of the user performing a task; 所述语言模型神经网络的接口;以及An interface to the language model neural network; and 辅助控制子系统,所述辅助控制子系统被配置为辅助所述用户,其中,所述辅助控制子系统被配置为对于所述一系列任务中的一个或多个任务:an auxiliary control subsystem configured to assist the user, wherein the auxiliary control subsystem is configured to, for one or more tasks in the series of tasks: 从所述数字助理向所述用户输出对要执行的所述任务的指示;outputting, from the digital assistant to the user, instructions for the task to be performed; 使用所述观察捕获子系统来捕获所述用户执行所述任务的视觉或音频观察;并且using the observation capture subsystem to capture visual or audio observation of the user performing the task; and 从所述回答确定所述用户是否已成功实现所述任务,并且作为响应,继续进行所述一系列任务中的下一任务。A determination is made from the answer as to whether the user has successfully achieved the task, and in response, proceeding to the next task in the series of tasks.
CN202480011683.2A 2023-01-05 2024-01-05 Controlling agents using language-based success detectors Pending CN120712567A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202363437344P 2023-01-05 2023-01-05
US63/437,344 2023-01-05
PCT/EP2024/050243 WO2024146961A1 (en) 2023-01-05 2024-01-05 Controlling agents using language-based success detectors

Publications (1)

Publication Number Publication Date
CN120712567A true CN120712567A (en) 2025-09-26

Family

ID=89619094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202480011683.2A Pending CN120712567A (en) 2023-01-05 2024-01-05 Controlling agents using language-based success detectors

Country Status (3)

Country Link
EP (1) EP4627480A1 (en)
CN (1) CN120712567A (en)
WO (1) WO2024146961A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12437238B1 (en) * 2024-03-20 2025-10-07 Anthropic, Pbc Generation of agentic trajectories for training artificial intelligence agents to automate multimodal interface task workflows
CN119150863B (en) * 2024-11-05 2025-03-18 中国计量大学 A dynamic reasoning method and system under a large language model
CN119377679B (en) * 2024-12-27 2025-04-01 鹏城实验室 Model training method, device, storage medium and computer equipment

Also Published As

Publication number Publication date
EP4627480A1 (en) 2025-10-08
WO2024146961A1 (en) 2024-07-11

Similar Documents

Publication Publication Date Title
US11663441B2 (en) Action selection neural network training using imitation learning in latent space
US10766136B1 (en) Artificial intelligence system for modeling and evaluating robotic success at task performance
US10766137B1 (en) Artificial intelligence system for modeling and evaluating robotic success at task performance
CN120712567A (en) Controlling agents using language-based success detectors
US10872294B2 (en) Imitation learning using a generative predecessor neural network
JP7354425B2 (en) Data-driven robot control
JP2019049604A (en) Instruction statement estimation system and instruction statement estimation method
JP7674599B2 (en) Controlling interactive agents using multimodal input
CN120283238A (en) Guided dialog using language to generate neural networks and searches
Yu et al. Human motion based intent recognition using a deep dynamic neural model
JP2024506580A (en) Neural network with adaptive gradient clipping
US20230019745A1 (en) Multi-modal sensor based process tracking and guidance
US20250209340A1 (en) Intra-agent speech to facilitate task learning
CN110192205A (en) mirror loss neural network
CN120476406A (en) Improved training of large neural networks
Sutanto et al. Supervised learning and reinforcement learning of feedback models for reactive behaviors: Tactile feedback testbed
EP3788554B1 (en) Imitation learning using a generative predecessor neural network
US11731279B2 (en) Systems and methods for automated tuning of robotics systems
CN116868203A (en) Neural network using adaptive gradient clipping
Wang et al. Multi-feature fusion for deep reinforcement learning: sequential control of mobile robots
KR101058471B1 (en) Intermediate goal generation method based on behavior trigger model, task learning method and system based on it
US20250284971A1 (en) Training neural networks through reinforcement learning using multi-objective reward neural networks
US20250131254A1 (en) Composable function-preserving expansions of neural networks
US20250245502A1 (en) Training neural networks using weight norm regularizations
US20220400905A1 (en) Systems and Methods for a Robot-adapted Cutting Board and Knife

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination