CN118103116A

CN118103116A - Basic multi-modal agent interactions

Info

Publication number: CN118103116A
Application number: CN202280069306.5A
Authority: CN
Inventors: W·B·多兰; R·沃鲁姆; C·J·布罗克特; G·A·德斯加伦内斯; S·拉奥
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-10-14
Filing date: 2022-09-20
Publication date: 2024-05-28

Abstract

Aspects of the present disclosure relate to basic multimodal proxy interactions in which user input is processed using a multimodal machine learning model to generate model outputs. The model output may then be processed to affect the behavior of the application, for example, to enable a user to control the application and/or support user interaction with the session proxy, or other examples. In some cases, at least a portion of the model output may be executed or parsed, for example, to invoke an application programming interface or function of the application. Thus, the use of a multimodal machine learning model in accordance with aspects described herein may enable the behavior of an application to be affected accordingly using natural language input provided by a user.

Description

Basic multi-modal agent interactions

Background

The user may provide natural language input for processing by the session proxy. Similarly, the session proxy may generate natural language output provided in response to the user, thereby enabling the user and the session proxy to communicate. However, interactions between the user and the session proxy may thus be limited to natural language, which may lead to reduced usability of the session proxy for the user and/or limited richness of such interactions.

It is with respect to these and other general considerations that the embodiments have been described. Furthermore, while relatively specific problems have been discussed, it should be understood that embodiments should not be limited to addressing the specific problems identified in the background.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

Non-limiting and non-exhaustive examples are described with reference to the following figures.

FIG. 1 illustrates an overview of an example system for underlying multi-modal agent interactions in accordance with aspects described herein.

FIG. 2A illustrates an overview of an example method for influencing an application based on model output from a multimodal machine learning model in accordance with aspects described herein.

FIG. 2B illustrates an overview of an example method for generating a multimodal response in accordance with aspects described herein.

FIG. 3 illustrates an overview of an example method for controlling session proxy using a multimodal generation platform in accordance with aspects described herein.

FIG. 4 illustrates an overview of an example method for controlling a video game application using a multimodal generation platform in accordance with aspects described herein.

FIG. 5 is a block diagram illustrating example physical components of a computing device that may be used to practice aspects of the disclosure.

Fig. 6A and 6B are simplified block diagrams of mobile computing devices that may be used to practice aspects of the present disclosure.

FIG. 7 is a simplified block diagram of a distributed computing system in which aspects of the present disclosure may be practiced.

Fig. 8 illustrates a tablet computing device for performing one or more aspects of the present disclosure.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments or examples. These aspects may be combined, other aspects may be utilized, and structural changes may be made without departing from the present disclosure. Embodiments may be practiced as methods, systems, or devices. Thus, embodiments may take the form of a hardware implementation, an entirely software implementation, or an implementation combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims and their equivalents.

In an example, a user and a conversation agent interact using natural language, where user input is processed using a generative machine learning model to generate natural language output. Natural language output may then be provided in response to the user input. However, using the generated machine learning model in this manner may limit the usefulness of the session proxy, for example, to contexts in which only natural language interactions may achieve user goals. Furthermore, such interactions may lack richness or depth since limited to natural language communications, which may also reduce the relevant user experience. For example, a session proxy that interacts using only natural language may not affect the application state to facilitate user interaction with the application.

Accordingly, aspects of the present disclosure relate to basic multi-modal agent interactions. In an example, generating the multimodal machine learning model processes user input and generates multimodal output. For example, a conversation agent in accordance with aspects described herein may receive user input such that the user input may be processed using a generate multimodal machine learning model to generate multimodal output. The multimodal output may include natural language output and/or program output, or other examples. The multimodal output can be processed and used to influence the state of the associated application. For example, at least a portion of the multimodal output may be executed or may be used to invoke an Application Program Interface (API) of an application. In some examples, generating a multi-modal machine learning model (also referred to herein generally as a multi-modal machine learning model) used in accordance with aspects described herein may be generating a transformer model. In some cases, explicit and/or implicit feedback may be processed to improve the performance of the multimodal machine learning model.

In an example, the user input and/or model output is multimodal, as used herein, which may include one or more types of content. Example content includes, but is not limited to, written language (which may also be referred to herein as "natural language output"), code (which may also be referred to herein as "program output"), images, video, audio, gestures, visual features, intonation, outline features, gestures, style, fonts, and/or transitions, and the like. Thus, in contrast to machine learning models that process natural language inputs and generate natural language outputs, aspects of the present disclosure may process inputs and generate outputs having any of a variety of content types.

It should be appreciated that all inputs and outputs for the content agent and its associated machine learning model need not be multi-modal in nature. Instead, the input or output may have a single content type. For example, the user may provide natural language only input to the session proxy such that the session proxy provides program output as a response, or other examples. Thus, a machine learning model according to aspects described herein may be referred to as multi-modal due to its ability to handle multiple content types. For example, in the above example, the machine learning model interoperates between natural language and program content types.

Due to the multi-modal nature of the machine learning model used by the session agents described herein, users may interact with the session agents to affect application state. Examples include, but are not limited to, manipulating one or more parameters of an application, causing interactions similar to interactions provided by a control surface of the application (e.g., with user interface elements, gestures, keystrokes, mouse inputs, or menu items), causing execution of an API call, or manipulating or otherwise controlling generation of one or more types of content (e.g., text, visual, and/or auditory content). Additional examples of such aspects are discussed in more detail below.

Returning to the above example of multimodal interactions associated with natural language and program content, natural language input received from a user may be processed to generate program output such that at least a portion of the program output may be executed. For example, the user may provide instructions as natural language input, such that the machine learning model generates program output that causes the application to run in accordance with the user instructions. For example, the program output may be a series of program steps that are executed or otherwise performed by the application to perform one or more complex tasks associated with the application. Similarly, in addition to or as an alternative to such programmatic output, the machine learning model may generate natural language output. In some examples, such natural language output may itself be in the form of a program output, for example, including code similar to a comment or "print" or "return" function.

While the machine learning model may be fine-tuned for one or more particular scenarios (examples of which are discussed in more detail below), aspects of the present application may be performed using machine learning models that have not been specifically used for such particular scenarios. As an example, the multimodal machine learning model may use hints or other contexts to launch or build the machine learning model to bias the machine learning model toward a particular behavior. For example, the hints may provide one or more multi-modal examples that illustrate relationships between multiple content types. Thus, hints can be designed or generated to increase the likelihood that a model will exhibit a particular behavior. In some examples, one or more such hints may be distributed and/or (re) used in a particular context, such that the same machine learning model may be used to render different behaviors in different scenarios (e.g., where each scenario may have an associated behavior).

Returning to the above examples of natural language and program content, hints in such examples can include text annotations and associated code segments to launch a machine learning model to process similar natural language inputs and generate similar program outputs. While example hints are described, it should be appreciated that any of a variety of techniques can be used to launch a machine learning model to perform the multi-modal processing aspects described herein. For example, other techniques may be used to associate two types of content (e.g., in addition to comments and association codes). Similarly, it should be appreciated that in some examples, as is the case when the machine learning model has been fine-tuned for a given scenario, the machine learning model may not be started or built as described above.

In an example, a context may be maintained for interaction with a session proxy. For example, the context may include a prompt and one or more instances of user input and associated model output. Thus, such context may enable the conversation agent to learn from previous interactions with the user, such as correcting machine learning model behavior, defining new behavior, and/or introspecting previous behavior, or other examples. Any of a variety of techniques may be used to maintain a context associated with a session agent, such as associated with a particular user, a particular user session, a particular group of users, or more generally, a larger group of users. As another example, the context may include a predetermined number of previous interactions, or the machine learning model may have limited attention, such context affecting the model output generated by the machine learning model.

As used herein, a session proxy may perform processing based on model output, such as model output that may be generated based on user input associated with a user. For example, the session proxy may be a virtual assistant or a non-player character (NPC). In some examples, a conversation agent may have a visual representation (e.g., an avatar or graphical representation), aspects of which may be controlled based on model output according to aspects described herein. In other examples, the session proxy may integrate functionality into the application such that it may not have visual manifestations. For example, a user may provide natural language input into a text box or as a spoken dialog to an application, such that the input is processed (e.g., by a conversational agent of the application program) to affect the behavior of the application accordingly.

In example interactions between a user and a conversation agent in accordance with aspects described herein, user input may be received, and an indication of the user input may be provided to a multimodal machine learning model in association with a prompt, such that the machine learning model generates model output accordingly. As described above, the user input and the resulting model output need not be the same content type. Conversely, in some examples, at least a portion of the model output may be a different content type than the user input. Model outputs are processed to influence the behavior of the application, thereby enabling a user to control various aspects of the application using natural language inputs (or in other examples, any of a variety of alternative or additional types of inputs).

For example, model outputs obtained from the multimodal machine learning model may include program outputs that are executed to control the behavior of the application. As an example, natural language user input may be received indicating a user request to cause an NPC to jump, move, or follow an avatar of the user, or other examples. Thus, user input is processed by the multimodal machine learning model in accordance with aspects described herein to generate a program output comprising a set of program steps that, when executed, cause the NPC to implement the behavior requested by the user. Similar techniques may be applied to affect the behavior of the application, such as executing program output to control various functions of the application accordingly.

In an example where a user requests a NPC jump, the generated program output may include a first program step that invokes a "jump" function associated with the NPC and a second program step that causes the NPC to stop jumping after a predetermined amount of time. The program output may comprise a plurality of program steps. For example, in the case where the user requests that the NPC follow the user avatar, the generated program output may include a first program step of determining the position of the user avatar, a second program step of causing the NPC to move toward the determined position, and a third program step of causing the first and second program steps to be performed after a predetermined period of time. Thus, the generated program output may cause a loop to periodically update the position of the NPC based on the position of the user avatar. Thus, the aspects described herein enable a user to invoke complex application functions using natural language input, which may be cumbersome or difficult to implement through a user interface, or may not be available to the user at all.

As described above, the user input and/or model output may form a context for use in subsequent interactions with the session proxy. As an example, when a subsequent user interaction is received, a context associated with the subsequent user interaction is provided, the context including a prompt in addition to previous user inputs and/or model outputs. Thus, a subsequent model output is received, which may be processed similarly to the model output discussed above. Thus, subsequent user interactions with the session agent may be iterative in nature, with context maintained to enable the user to correct machine learning model behavior, define new behavior and associated output of the machine learning model, and/or request the machine learning model to introspection of previous behavior, or other examples. As described above, the context may not be maintained in other examples.

In some cases, the model may not generate enough model output. For example, it may be determined that a confidence level associated with the model output is below a predetermined threshold. As another example, while the multi-modal machine learning model may attempt to generate content having a particular content type, the generated content may be invalid. Returning to the example or program output, the resulting model output may be syntactically and/or semantically ineffective such that it cannot be used to influence application behavior. For example, the model output may not conform to a particular API, or may reference non-existent user interface elements or other functions. In this case, an indication may be provided that the user input was not properly understood, or the user may be prompted to reformulate the provided input, or other examples. In some examples, the user inputs and associated model outputs may be stored for subsequent use as training data to improve model performance in the future.

As described above, in some examples, the multimodal machine learning model may be fine-tuned, for example, to incorporate additional and/or more specific knowledge for a given scenario. As an example, a multimodal machine learning model may be trained using content associated with a particular scene or domain in which the model may be used. In examples of program output, machine learning models may be trained using related documents, libraries, and code instances. Thus, in addition to or as an alternative to using hints to launch machine learning models, such adjustments may be used to support interoperability between user inputs and model inputs having various content types. Additionally, training data generated as a result of a user interacting with the session agent may be used to improve model performance, for example based on recorded inputs and outputs in the event that the machine learning model fails to generate sufficient model output.

Thus, aspects of the present disclosure enable a user to interact more naturally with a conversation agent, beyond a simpler natural language conversation. Furthermore, the generated model output may be more dynamic than slot filling or other natural language processing techniques that map natural language inputs to specific application behaviors, as the machine learning model may effectively learn from user interactions and then generate corrected model outputs accordingly. Furthermore, the generated model output may fit into any of a variety of scenarios due to associated hints, context, and/or fine-tuning, rather than filling the slots strictly in accordance with the markers identified within the user input. In general, aspects of the present disclosure enable a user experience to have a tangible technical outcome of an application that stems from multimodal machine learning model output, thereby affecting the behavior of the application.

Aspects of the present disclosure may enable interactive development and debugging. For example, user input may be processed to generate program model output, which may be executed or otherwise used to influence application behavior in accordance with aspects described herein. Such processing may enable a user to observe the effects of the generated program model output so that the user may provide subsequent user input to correct the behavior of the machine learning model. Thus, subsequent outputs from the machine learning model may be improved (e.g., maintained by virtue of context associated with the interaction of the user and the session proxy) as compared to previous model outputs. Similarly, user input may cause model output to exhibit new behavior that the machine learning model would not have generated before. For example, the user input may correct the model output grammar, API calls, or other content. As another example, the resulting model output and associated processing may enable a user to access functions of the application that may not be (easily) accessible to the user, for example, due to the combination of functions associated with multiple controls and/or control interfaces of the application.

Aspects of the present disclosure may be used to control video game applications. As an example, functions and/or other parameters of the video game application may be controlled based on user input (e.g., which may be received as natural language input). For example, user input may be processed to generate model output associated with one or more API calls, user interface elements, or other commands of a video game. Similar to the debugging aspects discussed above, in the event that the model output and resulting behavior are not user desired, the user may iteratively refine the behavior exhibited by the session proxy. In some cases, the model output may ultimately produce a macro in which the functionality of the video game application is combined in response to user input so that the user may invoke the macro to affect the behavior of the video game application accordingly.

As another example, the prompt may define a persona associated with a session proxy of the video game, such as an NPC, a narrator, or a virtual assistant. As used herein, a persona of a conversation agent may interact with a conversation volume, a character, a purpose or goal, a personality, one or more behavioral patterns or animations, and/or a physical appearance, or other examples. For example, an NPC may have the role of a store owner or wizard while also having an associated goal (e.g., selling as many items as possible or helping a user to complete a task). Thus, the role of the NPC may remain substantially unchanged or may evolve gradually over the course of the game, while the goals of the NPC may be relatively more dynamic (e.g., according to a given context).

Furthermore, two or more NPCs may each have the same or similar roles, but with different objectives, and vice versa. As another example, an NPC may have multiple targets that may be consistent with each other or may conflict with each other. In such examples, one target may therefore receive a higher weight or more attention than another target (e.g., depending on context). Thus, the hint may include an indication of the attitude, mood, role or goal of the NPC, as well as other such persona attributes.

As used herein, a "user" may be a developer of a video game application (e.g., utilizing interactive debugging and/or defining personas of NPCs) and/or a player of a video game application (e.g., interacting with a video game and/or one or more NPCs therein), or other examples. For example, according to aspects described herein, a player may use natural language input to manipulate the application state or environment of a video game application.

It should be appreciated that hints need not be limited to language initiation, and can alternatively or additionally include examples of behavioral manners, actions, or states (e.g., as may be linguistically and/or programmatically defined), and the like. In another example, the hint may include code or other program definitions associated with dynamically accessing one or more variables or other aspects of the video game application.

In some cases, a first portion of the hint may define a general persona for the conversation agent, while a second scene-specific portion may define more specific aspects (e.g., may be associated with a scene such as a scene, venue, map, or specific storyline). In this case, the first part may be used to generate model outputs for the session proxy in multiple scenarios, while which scenario-specific part is used as the second part may vary from scenario to scenario. For example, when a character transitions from one scene to another in a video game application, the hint may be updated to replace the first scene-specific portion with the second scene-specific portion (e.g., while maintaining the same general character portion). In some cases, such transitions may similarly result in the associated context being truncated, or reset.

Similarly, the context associated with the session agent may be maintained across a plurality of such scenarios, thereby enabling user interaction with the session agent to take advantage of knowledge of past interactions. Such aspects may be used, for example, when NPCs appear in multiple scenes of a story. In such an example, a portion of the past context may be used as a future hint.

Thus, the behavior of the NPC may be controlled based on model outputs generated by the machine learning model. Model outputs may include natural language outputs, program outputs (e.g., to affect conversation agent movements), audio outputs and/or intonation outputs (e.g., to affect the manner in which the conversation agent speaks), image outputs (e.g., to affect associated textures)), and/or video outputs (e.g., to affect associated animations), or other examples. It should be appreciated that a single multimodal machine learning model need not be used to generate all of these and other content types. Instead, a set of such multi-modal machine learning models may be used, where each machine learning model may have a set of related content types that at least partially overlap.

For example, in instances where the model outputs various visual and/or auditory aspects (e.g., animation, facial expression, intonation, etc.) for controlling the NPC, the associated machine learning model may have been trained according to video training in which the motion of the video topic is annotated according to the skeletal model and the data of the conversation is annotated according to the associated intonation, both of which may be further associated with the natural language meaning of the conversation.

As a result of these aspects of using hints (and, in some examples, associated contexts) to define session agents, multiple session agents may be efficiently generated, e.g., each session agent having a different attitude, role, goal, and/or feasibility, even though each session agent may use the same multimodal machine learning model.

It should be appreciated that aspects of the present application are not necessarily limited to user interaction with an application of a computing device. For example, similar techniques may be applied to user interactions with any of a variety of other devices, such as virtual assistants, smart home devices, robotic and/or animatronics devices, or other examples.

FIG. 1 illustrates an overview of an example system 100 for underlying multi-modal agent interactions in accordance with aspects described herein. As shown, the system 100 includes a multimodal generation platform 102, a computing device 104, a computing device 106, and a network 108. In an example, the multimodal generation platform 102, the computing device 104, and/or the computing device 106 communicate via a network 108, the network 108 may include a local area network, a wireless network, or the internet, or any combination thereof, or other examples.

While system 100 is described in the example of performing processing using a machine learning model that is remote from computing devices 104 and 106 (e.g., at multimodal generation platform 102), it should be understood that in other examples, at least some aspects described herein with respect to multimodal generation platforms may be performed locally at computing device 104 and/or computing device 106.

Multimodal generation platform 102 is shown to include a request processor 110, a machine learning engine 112, a prompt store 114, and a training data store 116. In an example, the request processor 110 receives requests from the computing device 104 and/or the computing device 106 (e.g., from the model interaction manager 120 and the model interaction manager 126, respectively) to generate model outputs (e.g., as may be generated by the machine learning engine 112). For example, the request may include an indication and/or prompt or context of user input, or other examples. In some cases, the request includes an indication of the hint stored by hint store 114, or as another example, model interaction manager 120 can request a hint from hint store 114, which can then be provided in the request to generate the model output.

The machine learning engine 112 may include a multi-modal machine learning model in accordance with aspects described herein. For example, machine learning engine 112 may include a multi-modal machine learning model trained using a training data set having a plurality of content types (e.g., associated with one or more types of user inputs that may be received from computing devices 104 and/or 106, and one or more types of model outputs generated in response). Thus, given a first type of content, machine learning engine 112 may generate a first type of content and/or a second type of content.

The multimodal generation platform 102 is further illustrated as including a hint store 114 that can support the distribution of hints with which model outputs can be generated in accordance with aspects described herein. For example, the cues stored by the cue store 114 may have associated versions such that cues used by the computing device 104 and/or the computing device 106 may be updated accordingly. In other cases, the hint store 114 stores hints (and, in some examples, associated contexts) associated with particular users or groups of users (e.g., of the computing device 104 and/or the computing device 106), or given applications (e.g., the video game application 118 and/or the video game application 124), or other examples.

The training data store 116 may store training data associated with the machine learning engine 112. In an example, training data store 116 is updated based on instances when model output of machine learning engine 112 is determined to be inadequate (e.g., based on an associated confidence level or an indication received from model interaction manager 120 and/or model interaction manager 126, or other examples) such that machine learning engine 112 may then be retrained to improve its performance.

As shown, the computing device 104 includes a video game application 118, a model interaction manager 120, and a context repository 122. Similarly, computing device 106 includes video game application 124, model interaction manager 126, and context repository 128. Aspects are described below with reference to computing device 104. Computing device 106 may be similar to computing device 104 and, therefore, aspects of computing device 106 need not be re-described in detail below.

In an example, the video game application 118 may be a native application or a web-based application program. As another example, the video game application 118 may operate substantially locally on the computing device 104 or may operate in accordance with a server/client paradigm in conjunction with one or more game servers (not shown).

The video game application 118 may implement one or more session agents in accordance with aspects described herein. For example, such session agents may be implemented as NPCs, virtual assistants, or as functions of the video game application 118 that enable model-based output control of the video game application 118 in response to received user inputs. It should be appreciated that in other examples, aspects of the video game application 118 (and in some examples the model interaction manager 120 and/or the context repository 122) need not be local to the computing device 104 and may instead be implemented remotely, such as through one or more game servers (not shown).

Thus, the model interaction manager 120 may process the received user input to support generation of model output according to aspects described herein. For example, model interaction manager 120 may determine cues that may be used to process received user inputs (e.g., as may be associated with a session agent of video game application 118) such that the determined cues and indications of user inputs are provided to multimodal generation platform 102 (e.g., where received by request processor 110 and processed by machine learning engine 112 to generate model outputs). In response, the model interaction manager 120 receives the generated model output. In an example, at least a portion of the user input and/or generated model output may be stored as context in context store 122 (e.g., associated with a session agent for which the user input was received) such that the context may be used for subsequent requests for model output from multimodal generation platform 102.

According to aspects described herein, the model interaction manager 120 may process model outputs to affect the behavior of the video game application 118. For example, the model output may include any of a number of types of content, each of which may affect certain aspects of the video game application 118. As an example, the model output may include a program output that may be executed, parsed, or otherwise processed (e.g., as one or more API calls or function calls) by the model interaction manager 120. As another example, the model output may include a natural language output that may be presented to a user of the computing device 104 (e.g., as a dialog for an NPC, as an alert or message, or as content for the video game application 118).

As described above, the model output may control any of a number of other aspects of the video game application 118, such as intonation involving textures, animations, and/or spoken dialog, as well as behavior patterns, actions, gestures, and/or the state of the NPC. Thus, while example processing and associated content types are described, it should be appreciated that the model interaction manager 120 may use any of a variety of techniques to process multimodal model output in accordance with aspects of the present disclosure.

In some cases, the model interaction manager 120 may determine that the model output is insufficient, such as when the model output has a correlation confidence below a predetermined threshold, in addition to or as an alternative to the model output, an indication of an error or other problem is received from the multimodal generation platform 102, or that processing of at least a portion of the model output fails (e.g., when the model output includes code or other output that is grammatically or semantically incorrect), or other examples. In this case, model interaction manager 120 may provide a failure indication to the user, for example, indicating that the user may retry or reformulate the user input, that the user input is not properly understood, or that the requested function may not be available. While example problems and associated problem-handling techniques are described, it should be appreciated that any of a variety of other problems and/or problem-handling techniques may be encountered/used in other examples.

As described above, the same machine learning model (e.g., machine learning engine 112) may be used to control or otherwise provide multiple NPCs or virtual assistants, each of which may have a different associated persona (via associated hints as discussed). In such examples, the model interaction manager 120 may determine a different hint for each NPC or virtual assistant. Further, at least a portion of the cues may change according to the state of the video game application 118 (e.g., as the user progresses in the story, according to a map, as the user increases the level associated with the user's account, etc.).

The system 100 is illustrated with a computing device 104 and a computing device 106 to show that the machine learning engine 112 can be used by a plurality of computing devices to generate any of a variety of model outputs associated with the video game application 118 and the video game application 124, respectively. In an example, the hints and/or contexts used by model interaction managers 120 and 126 can be different or can be similar (e.g., as may be the case when the contexts are shared among a group of users or computing devices 104 and 106 share the same user). In other examples, the multimodal generation platform 102 can have a context repository in which session proxy contexts can be stored.

In some cases, the multimodal generation platform 102 may include multiple machine learning engines, such as associated with different video game applications (e.g., as may be the case when the machine learning engines have been fine tuned for a given scenario), such that the request processor 110 may determine the machine learning engine to process the requests received from the model interaction manager 120 and/or the model interaction manager 126. Further, while aspects are described with respect to generating model outputs using a single multimodal generation machine learning model, it should be understood that in other examples, multiple such models may be used. For example, a first multimodal generation machine learning model may have an associated first set of content types, and a second multimodal generation machine learning model may have an associated second set of content types, at least some of which may be different from the first set of content types.

FIG. 2A illustrates an overview of an example method 200 for influencing an application based on model output from a multi-modal machine learning model in accordance with aspects described herein. In an example, aspects of the method 200 are performed by a model interaction manager, such as the model interaction manager 120 or the model interaction manager 126 discussed above with respect to fig. 1.

The method 200 begins at operation 202, where a hint generated for multiple modalities is obtained. For example, the hints may be obtained from a hint store of the multimodal generation platform (e.g., hint store 114 of multimodal generation platform 102). As another example, an application may be distributed with a set of hints from which hints may be obtained. In another example, at least a portion of the prompt may be user-provided, such as when a user creates or modifies a prompt for use with a given application. Thus, it should be appreciated that the cues may be obtained according to any of a variety of techniques.

At operation 204, a user input is received. Example user inputs include, but are not limited to, natural language inputs (e.g., text inputs or voice inputs), program inputs and/or gesture inputs, and any of a variety of other interactions with an application. In some cases, the received user input may be multimodal, including, for example, multiple content types.

Flow proceeds to operation 206 where indications and prompts of user input are provided to the multimodal generation platform (e.g., multimodal generation platform 102). In an example, the received user input may be processed prior to being provided to the multimodal generation platform, for example, to perform automatic speech recognition or to perform gesture recognition. In other examples, such processing may instead be performed by a multimodal generation platform.

Moving to operation 208, a response is received from the multimodal generation platform. In an example, the response includes a model output generated by a machine learning engine (e.g., machine learning engine 112). In some cases, the model output itself may be multimodal or at least a portion thereof may have a different associated content type than the user input provided at operation 206. In other cases, at least a portion of the model output may have the same content type as the provided user input.

At operation 210, the response is processed to affect application behavior accordingly. In an example, the processing performed at operation 210 depends on the type of content output by the model. For example, if the model output includes a natural language output, the natural language output may be presented to the user (e.g., as text or as spoken dialog). As another example, if the model output includes a program output, the program output may be parsed or otherwise executed. In some cases, the model output includes a plurality of content types such that operation 210 includes identifying a plurality of sub-portions therein and processing each sub-portion accordingly. In other cases, the model output may be a program output in which the natural language output is packaged such that executing, parsing, or otherwise processing the program output will result in the natural language output being presented to the user. Although the method 200 is described using examples of natural language output and/or program output, it should be understood that similar techniques may be applied to any of a variety of content types.

In some cases, operation 210 may include determining that the model output is insufficient and processing such identified problems accordingly. As described above, a failure indication may be presented to the user, for example, indicating that the user may retry or reformulate the user input, that the user input is not properly understood, or that the requested function may not be available. While example problems and associated problem-handling techniques are described, it should be appreciated that any of a variety of other problems and/or problem-handling techniques may be encountered/used in other examples.

Flow proceeds to operation 212 where updated context is generated. For example, the context may be updated in a context store, such as context store 122 or context store 128 discussed above with reference to FIG. 1. The updated context may include the prompt obtained at operation 202, the user input received at operation 204, and/or the response received at operation 208, etc. As described above, the context may only maintain a portion of the previous session proxy interactions, such that subsequent interactions (e.g., more than a predetermined number of interactions or after a predetermined time) may be omitted. Operation 212 is shown using a dashed box to indicate that operation 212 may be omitted in some examples. For example, the context may not be maintained so that subsequent session agent interactions similarly utilize the hints obtained at operation 202.

Finally, flow proceeds to operation 214 where subsequent user inputs are received. Thus, the received user input and an indication of the context (or the prompt obtained at operation 202 without maintaining the context) are provided to the multimodal generation platform. Aspects of operations 214 and 216 are similar to those discussed above with respect to operations 204 and 206 and therefore need not be described in detail again. In the described example, the machine learning engine may not maintain the state associated with the operation of method 200. Rather, any such status may be maintained by means of the context generated at operation 212 and/or the hint obtained at operation 202. Thus, in some examples, the use of a machine learning engine to generate an associated model output may be referred to as a "zero sample.

Flow returns to operation 208 where a response is received from the multimodal generation platform based on the provided user input so that it can be processed accordingly at operations 210 and 212. Accordingly, the flow may loop between operations 208-216 to process user input and implement session proxy interactions using a multimodal machine learning model in accordance with aspects described herein.

FIG. 2B illustrates an overview of an example method 250 for generating a multimodal response in accordance with aspects described herein. In an example, aspects of the method 250 are performed by a multimodal generation platform, such as the multimodal generation platform 102 discussed above with respect to fig. 1.

As shown, the method 250 begins at operation 252, where a generation request is received. For example, a request may be received from a model interaction manager, such as model interaction manager 120 or model interaction manager 126 discussed above with reference to FIG. 1. In an example, the request is received as a result of performing aspects of operation 206 or operation 216 discussed above with respect to method 200 in fig. 2A. The request may include an indication of user input, a prompt, and/or a context.

At operation 254, the request is processed to generate a model output. For example, the request may be processed using a multimodal machine learning model of a machine learning engine, such as machine learning engine 112. In an example, processing the request includes determining a machine learning model from a set of machine learning models, as is the case when the application that received the request at operation 252 is associated with a particular machine learning model.

Flow proceeds to operation 256 where a response is provided to the request received at operation 252. For example, the request may include the model output generated at operation 254. In some examples, the response may additionally or alternatively include a confidence level associated with the model output or an indication that the model output is insufficient.

At decision 258, a determination is made as to whether to update training data associated with the machine learning engine. For example, the determination may include evaluating a confidence level associated with the model output generated at operation 254. In other examples, the indication may be received from a model interaction manager. Thus, it should be appreciated that it may be determined to update training data in any of a variety of scenarios.

Thus, if it is determined to update training data, flow branches yes to operation 260 where at least a portion of the request received at operation 252 is stored as training data associated with the generated model output. For example, the training data may be stored in a training data store, such as training data store 116 discussed above with respect to fig. 1. Thus, the machine learning engine may be retrained using the training data, thereby improving future model performance. The flow then terminates at operation 262. Similarly, if instead it is determined that the training data is not to be updated, flow branches no and also terminates at operation 262.

FIG. 3 illustrates an overview of an example method 300 for controlling session proxy using a multimodal generation platform in accordance with aspects described herein. In an example, aspects of the method 300 are performed by a model interaction manager (e.g., model interaction manager 120 or model interaction manager 126 in fig. 1) to affect behavior of a video game application (e.g., video game application 118 or video game application 124).

In the context of a video game application, the session proxy may be an NPC, for example, with a visual representation presented to the user. As another example, a session agent may be a virtual assistant or narrator that may not have an associated visual representation. Although an example session proxy is described, it should be understood that any of a variety of other session proxies may be used in other examples.

The method 300 begins at operation 302, where a hint generated for multiple modalities is determined. For example, the hint may be obtained from a hint store of the multimodal generation platform (e.g., hint store 114 of multimodal generation platform 102) or a game server, or other examples. As another example, a video game application may be distributed with a set of cues from which cues may be obtained. In another example, at least a portion of the cues may be user-provided, such as when a user creates or modifies cues for use with a video game application.

As described above, the reminder may have multiple parts, with a first part of the reminder defining a general persona for the conversation agent and a second part defining more specific aspects (e.g., may be associated with a scene, venue, map, or particular storyline). In this case, operation 304 may include generating a hint having a general portion and one or more scene-specific portions, wherein the scene-specific portions are selected from a set of scene-specific portions. In some cases, a set of scene-specific cues may be associated with a general portion of cues, e.g., such that each different cue may have a different set of associated scene-specific portions. In other cases, a set of scene-specific cues may be used in association with multiple cues, such as when multiple NPCs each have different general personas and are intended to have similar scene-specific parameters for the same scene. Thus, it should be appreciated that the cues may be determined according to any of a variety of techniques.

At operation 304, user interactions associated with the session agent are identified. Example user interactions include, but are not limited to, natural language input (e.g., text input or voice input, as may be provided specifically for conversation agents or more generally by a user), input associated with a player avatar, gesture input, mouse input, and/or keyboard input, as well as any of a variety of other interactions. In some cases, the identified user interactions may be multimodal, including both natural language input and player avatar input, for example.

Operation 304 is shown using a dashed box to indicate that operation 304 may be omitted in some examples. For example, the behavior of the session proxy may be controlled according to the method 300 even in the event that no user interaction is received. In some examples, aspects of operation 300 may be performed as a result of a session proxy becoming visible to a user, in response to the occurrence of an event within a video game application, and/or after a predetermined amount of time has elapsed. Thus, it should be appreciated that any of a variety of triggers may be used, and further, that such triggers need not be directly related to user interaction with the video game application.

Flow proceeds to operation 306 where the request is provided to a multimodal generation platform (e.g., multimodal generation platform 102). In an example, the request includes the prompt determined at operation 302 and, in the event that a user interaction is identified, an indication of the identified user interaction. In an example, the user interaction may be processed prior to being provided to the multimodal generation platform, for example, to perform automatic speech recognition or to perform gesture recognition. In other examples, such processing may instead be performed by a multimodal generation platform.

Moving to operation 308, a response is received from the multimodal generation platform. In an example, the response includes a model output generated by a machine learning engine (e.g., machine learning engine 112). In some cases, the model output itself may be multimodal or at least a portion thereof may have different associated content types with prompts and/or indications of user interaction provided at operation 306. In other cases, at least a portion of the model output may have the same content type as that provided at operation 306.

At operation 310, the response is processed to control the behavior of the session proxy accordingly. In an example, the processing performed at operation 310 depends on the type of content of the received model output. For example, if the model output includes a natural language output, the natural language output may be presented to the user (e.g., as text or as spoken dialog). As another example, if the model output includes a program output, the program output may be parsed or otherwise executed to affect the appearance and/or behavior of a session agent within the video game application. In some cases, the model output includes a plurality of content types, such that operation 310 includes identifying a plurality of sub-portions therein and processing each sub-portion accordingly. In other cases, the model output may be a program output in which the natural language output is packaged such that executing, parsing, or otherwise processing the program output will result in the natural language output being presented to the user.

Although the method 300 is described using examples of natural language output and/or program output, it should be understood that similar techniques may be applied to any of a variety of content types. For example, the model output received at operation 308 may include textures, animations (e.g., avatar animation or facial animation), intonation information for spoken dialog, sound effects, or other content that may be used to affect the behavior (e.g., textually, audibly, or visually) of a conversation agent of a video game application.

In some cases, operation 310 may include determining that the model output is insufficient and processing such identified problems accordingly. As described above, a failure indication may be presented to the user, e.g., indicating that the user may retry or reformulate the user interaction, that the user interaction is not properly understood, or that the requested function may not be available (e.g., where the program output may be grammatically or semantically incorrect for a given video game application). While example problems and associated problem-handling techniques are described, it should be appreciated that any of a variety of other problems and/or problem-handling techniques may be encountered/used in other examples.

Flow proceeds to operation 312 where updated context is generated. For example, the context may be updated in a context store, such as context store 122 or context store 128 discussed above with reference to FIG. 1. The updated context may include the prompt determined at operation 302, the user interaction identified at operation 304, and/or the response received at operation 308, etc. As described above, the context may only maintain a portion of the previous session proxy interactions, such that subsequent interactions (e.g., more than a predetermined number of interactions or after a predetermined time) may be omitted. Operation 312 is shown using a dashed box to indicate that operation 312 may be omitted in some examples. For example, the context may not be maintained so that subsequent session proxy interactions similarly utilize the hints obtained at operation 302.

In the context of session agents for video game applications, some session agents may build a history with the user so that the session agents learn of previous interactions with the user. In contrast, other session agents may not have such history, as is the case for a one-time session agent or a recurring session agent (and thus may have different scene-specific hint portions in different scenes), but do not need to learn past interactions.

Finally, flow proceeds to operation 314 where subsequent user interactions are identified. Similar to operation 304, operation 314 is shown using a dashed box to indicate that user interaction need not be identified in other examples. Aspects of operation 314 are similar to those discussed above with respect to operation 304 and thus need not be re-described in detail. For example, as a result of any of a variety of other triggers, flow may proceed to operation 306. As shown, flow returns to operation 306 where the request is provided to the multimodal generation platform.

Accordingly, the flow may loop between operations 306-314 to process user interactions and implement session proxy interactions (e.g., using hints and/or contexts, as may be maintained by operation 312) using a multimodal machine learning model in accordance with aspects described herein. As described above, aspects of the method 300 may be provided for multiple session agents of a video game application, such that each session agent may exhibit at least slightly different behavior even if the same machine learning engine is used to generate model outputs.

FIG. 4 illustrates an overview of an example method 400 for controlling a video game application using a multimodal generation platform in accordance with aspects described herein. In an example, aspects of the method 400 are performed by a model interaction manager (e.g., the model interaction manager 120 or the model interaction manager 126 in fig. 1) to affect behavior of a video game application (e.g., the video game application 118 or the video game application 124).

It should be appreciated that the video game application may provide any of a variety of controls including, but not limited to, user interface elements, keyboard inputs, mouse inputs, controller inputs, gesture inputs, and/or voice inputs. Thus, in addition to such control, aspects of the method 400 may enable a user to access similar functionality via a multimodal machine learning model (e.g., using natural language input), wherein the resulting model output is processed in accordance with aspects described herein to control video game applications as directed by the user.

The method 400 begins at operation 402, where a hint generated for multiple modalities is determined. For example, the hint may be obtained from a hint store of the multimodal generation platform (e.g., hint store 114 of multimodal generation platform 102) or a game server, or other examples. As another example, a video game application may be distributed with a set of cues from which cues may be obtained. In another example, at least a portion of the cues may be user-provided, such as when a user creates or modifies cues for use with a video game application. In some cases, different cues may be associated with different aspects of the video game application, which may be the case when different controls and/or functions are available in different scenarios.

At operation 404, a user interaction with a video game application is received. Example user interactions include, but are not limited to, natural language input (e.g., text input or voice input), input associated with a player avatar, gesture input, mouse input, and/or keyboard input, as well as any of a variety of other interactions. In some cases, the received user input may be multimodal, including both natural language input and gesture input, for example.

Flow proceeds to operation 406 where indications and prompts of user input are provided to the multimodal generation platform (e.g., multimodal generation platform 102). In an example, the received user input may be processed prior to being provided to the multimodal generation platform, for example, to perform automatic speech recognition or to perform gesture recognition. In other examples, such processing may instead be performed by a multimodal generation platform.

Moving to operation 408, a response is received from the multimodal generation platform. In an example, the response includes a model output generated by a machine learning engine (e.g., machine learning engine 112). In some cases, the model output itself may be multimodal or at least a portion thereof may have a different associated content type than the prompt and/or indication of the user input provided at operation 406. In other cases, at least a portion of the model output may have the same content type as that provided at operation 406.

At operation 410, the response is processed to control behavior of the video game application accordingly. In an example, the processing performed at operation 410 depends on the type of content of the received model output. For example, if the model output includes a natural language output, the natural language output may be presented to the user (e.g., as text or as spoken dialog). As another example, if the model output includes a program output, the program output may be parsed or otherwise executed to control functions of the video game application. For example, the program output may include one or more API calls, may identify one or more user interface elements for actuation, or may include executable code to implement functionality similar to that of other controls provided by the video game application. In some cases, the model output includes a plurality of content types, such that operation 410 includes identifying a plurality of sub-portions therein and processing each sub-portion accordingly. In other cases, the model output may be a program output in which the natural language output is packaged such that executing, parsing, or otherwise processing the program output will result in the natural language output being presented to the user.

Although the method 400 is described using examples of natural language output and/or program output, it should be understood that similar techniques may be applied to any of a variety of content types. For example, the model output received at operation 408 may include macros, game objects, or other content that may be used to affect the operation of the video game application in response to the user interaction received at operation 404, other examples being discussed above. For example, the game objects received at operation 408 may be processed and imported into the application state of the video game application, thereby enabling users to create or otherwise manipulate game objects using natural language input (while using other more traditional functions to achieve similar results may involve a high degree of manual interaction with the various controls provided by the video game application).

In some cases, operation 410 may include determining that the model output is insufficient and processing such identified problems accordingly. As described above, a failure indication may be presented to the user, for example, indicating that the user may retry or reformulate the user interaction, that the user interaction is not properly understood, or that the requested function may not be available (e.g., the program output may be grammatically or semantically incorrect for a given video game program). While example problems and associated problem-handling techniques are described, it should be appreciated that any of a variety of other problems and/or problem-handling techniques may be encountered/used in other examples.

Flow proceeds to operation 412 where an updated context is generated. For example, the context may be updated in a context store, such as context store 122 or context store 128 discussed above with reference to FIG. 1. The updated context may include the prompt determined at operation 402, the user interaction received at operation 404, and/or the response received at operation 408, or other examples. As described above, the context may only maintain a portion of the previous session proxy interactions such that subsequent interactions (e.g., more than a predetermined number of interactions or after a predetermined time) may be omitted. Operation 412 is shown using a dashed box to indicate that operation 412 may be omitted in some examples. For example, the context may not be maintained so that subsequent session agent interactions similarly utilize the hints obtained at operation 402.

Finally, flow proceeds to operation 414 where subsequent user interactions are received. Aspects of operation 414 are similar to those discussed above with respect to operation 404 and thus need not be re-described in detail. As shown, flow returns to operation 406 where the request is provided to the multimodal generation platform. Accordingly, the flow may loop between operations 406-414 to process user interactions to control a gaming application (e.g., using hints and/or context, as may be maintained by operation 412) using a multimodal machine learning model in accordance with aspects described herein.

Fig. 5-8 and the associated descriptions provide a discussion of various operating environments in which aspects of the present disclosure may be practiced. However, the devices and systems shown and discussed with respect to fig. 5-8 are for purposes of illustration and description, and are not limiting of the wide variety of computing device configurations that may be used to practice aspects of the present disclosure described herein.

Fig. 5 is a block diagram illustrating physical components (e.g., hardware) of a computing device 500 that may be used to practice aspects of the disclosure. The computing device components described below may be applicable to the computing devices described above, including devices 104 and/or 106, and one or more devices associated with the multimodal generation platform 102 discussed above with respect to fig. 1. In a basic configuration, computing device 500 may include at least one processing unit 502 and system memory 504. Depending on the configuration and type of computing device, system memory 504 may include, but is not limited to, volatile storage (e.g., random Access Memory (RAM)), non-volatile storage (e.g., read-only memory (ROM)), flash memory, or any combination of such memories.

The system memory 504 may include an operating system 505 and one or more program modules 506 suitable for running software applications 520, such as one or more components supported by the systems described herein. As an example, system memory 504 may store a hint store 524 and a model interaction manager 526. The operating system 505, for example, may be suitable for controlling the operation of the computing device 500.

Further, embodiments of the present disclosure may be practiced in conjunction with a graphics library, other operating systems, or any other application program, and are not limited to any particular application or system. This basic configuration is illustrated in fig. 5 by those components within dashed line 508. Computing device 500 may have additional features or functionality. For example, computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is comprised of removable storage 509 and non-removable storage 510 in FIG. 5.

As described above, a number of program modules and data files may be stored in system memory 504. When executed on processing unit 502, program modules 506 (e.g., applications 520) may perform processes including, but not limited to, aspects described herein. Other program modules that may be used in accordance with aspects of the present disclosure may include email and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided programs and the like.

Furthermore, embodiments of the present disclosure may be practiced in circuit discrete electronic components, packaged or integrated electronic chips containing logic gates, in a circuit utilizing a microprocessor, or on a single chip containing electronic components or microprocessors. For example, embodiments of the present disclosure may be practiced via a system on a chip (SOC) in which each or more of the components shown in fig. 5 may be integrated onto a single integrated circuit. Such SOC devices may include one or more processing units, graphics units, communication units, system virtualization units, and various application functions, all of which are integrated (or "burned") onto a chip substrate as a single integrated circuit. When operating via an SOC, the functionality described herein with respect to the capabilities of the client switching protocol may operate via dedicated logic integrated with other components of computing device 500 on a single integrated circuit (chip). Embodiments of the present disclosure may also be practiced using other techniques (including but NOT limited to mechanical, optical, fluidic, AND quantum techniques) capable of performing logical operations (e.g., AND, OR, AND NOT). Furthermore, embodiments of the disclosure may be practiced within a general purpose computer or in any other circuit or system.

Computing device 500 may also have one or more input devices 512, such as a keyboard, mouse, pen, voice or sound input device, touch or slide input device, and so forth. Output device(s) 514 such as a display, speakers, printer, etc. may also be included. The above devices are examples and other devices may be used. Computing device 500 may include one or more communication connections 516 that allow communication with other computing devices 550. Examples of suitable communication connections 516 include, but are not limited to, radio Frequency (RF) transmitters, receivers, and/or transceiver circuitry; universal Serial Bus (USB), parallel and/or serial ports.

The term computer readable media as used herein may include computer storage media. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, or program modules. System memory 504, removable storage 509 and non-removable storage 510 are all examples of computer storage media (e.g., memory storage). Computer storage media may include RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture that can be used to store information and that can be accessed by computing device 500. Any such computer storage media may be part of computing device 500. Computer storage media does not include a carrier wave or other propagated or modulated data signal.

Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term "modulated data signal" may describe a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio Frequency (RF), infrared and other wireless media.

Fig. 6A-6B illustrate a mobile computing device 600, e.g., a mobile phone, a smart phone, a wearable computer (such as a smart watch), a tablet, a laptop, etc., that may be used to practice embodiments of the present disclosure. In some aspects, the client may be a mobile computing device. Referring to FIG. 6A, one aspect of a mobile computing device 600 for implementing these aspects is shown. In a basic configuration, the mobile computing device 600 is a handheld computer having both input elements and output elements. The mobile computing device 600 typically includes a display 605 and one or more input buttons 610, the input buttons 610 allowing a user to input information into the mobile computing device 600. The display 605 of the mobile computing device 600 may also be used as an input device (e.g., a touch screen display).

If included, optional side input element 615 allows additional user input. The side input element 615 may be a rotary switch, a button, or any other type of manual input element. In alternative aspects, mobile computing device 600 may incorporate more or fewer input elements. For example, in some embodiments, the display 605 may not be a touch screen.

In yet another alternative embodiment, mobile computing device 600 is a portable telephone system, such as a cellular telephone. The mobile computing device 600 may also include an optional keypad 635. The optional keypad 635 may be a physical keyboard or a "soft" keyboard generated on a touch screen display.

In various embodiments, the output elements include a display 605 for showing a Graphical User Interface (GUI), a visual indicator 620 (e.g., a light emitting diode), and/or an audio transducer 625 (e.g., a speaker). In some aspects, the mobile computing device 600 incorporates a vibration transducer for providing haptic feedback to a user. In yet another aspect, the mobile computing device 600 incorporates input and/or output ports for sending signals to or receiving signals from an external device, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., an HDMI port).

Fig. 6B is a block diagram illustrating an architecture of one aspect of a mobile computing device. That is, the mobile computing device 600 may incorporate a system (e.g., architecture) 602 to implement some aspects. In one embodiment, system 602 is implemented as a "smart phone" capable of running one or more applications (e.g., browser, email, calendar, contact manager, messaging client, game and media client/player). In some aspects, system 602 is integrated as a computing device, such as an integrated Personal Digital Assistant (PDA) and wireless telephone.

One or more applications 666 may be loaded into memory 662 and run on or in association with operating system 664. Examples of application programs include telephone dialer programs, email programs, personal Information Management (PIM) programs, word processing programs, spreadsheet programs, internet browser programs, messaging programs, and the like. The system 602 also includes a non-volatile storage area 668 within the memory 662. The non-volatile storage area 668 may be used to store persistent information that should not be lost when the system 602 is powered down. The application program 666 may use and store information in the non-volatile storage area 668, such as email or other messages used by an email application, or the like. A synchronization application (not shown) also resides on the system 602 and is programmed to interact with a corresponding synchronization application resident on the host computer to keep the information stored in the non-volatile storage area 668 synchronized with the corresponding information stored at the host computer. It should be appreciated that other applications may be loaded into memory 662 and run on the mobile computing device 600 described herein (e.g., search engine, extractor module, relevance ranking module, answer scoring module, etc.).

The system 602 has a power supply 670, which 670 may be implemented as one or more batteries. The power supply 670 may also include an external power source such as an AC adapter or powered docking cradle that supplements or recharges the batteries.

The system 602 can also include a radio interface layer 672 that performs the function of transmitting and receiving radio frequency communications. The radio interface layer 672 supports wireless connectivity between the system 602 and the "outside world" via a communications carrier or service provider. Transmissions to and from the radio interface layer 672 are conducted under control of the operating system 664. In other words, communications received by radio interface layer 672 may be propagated to application 666 via operating system 664, and vice versa.

The visual indicator 620 may be used to provide visual notifications and/or the audio interface 674 may be used to generate audible notifications via the audio transducer 625. In the illustrated embodiment, the visual indicator 620 is a Light Emitting Diode (LED) and the audio transducer 1225 is a speaker. These devices may be directly coupled to the power supply 670 so that when activated they may remain on for a duration dictated by the notification mechanism even though the processor 660 and other components might shut down for conserving battery power. The LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device. The audio interface 674 is used to provide audible signals to and receive audible signals from the user. For example, in addition to being coupled to the audio transducer 625, the audio interface 674 may also be coupled to a microphone to receive audible input to support a telephone conversation. According to embodiments of the present disclosure, the microphone may also be used as an audio sensor to support control of notifications, as described below. The system 602 may also include a video interface 676, the video interface 676 enabling operation of the onboard camera 630 to record still images, video streams, and the like.

The mobile computing device 600 implementing the system 602 may have additional features or functionality. For example, the mobile computing device 600 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in fig. 6B by nonvolatile storage 668.

The data/information generated or captured by the mobile computing device 600 and stored via the system 602 may be stored locally on the mobile computing device 600 as described above, or the data may be stored on any number of storage media that are accessible by the device via the radio interface layer 672 or via a wired connection between the mobile computing device 600 and a separate computing device associated with the mobile computing device 600 (e.g., a server computer in a distributed computing network such as the internet). It should be appreciated that such data/information can be accessed via the mobile computing device 600 via the radio interface layer 672 or via a distributed computing network. Similarly, such data/information may be readily transferred between computing devices for storage and use in accordance with well known data/information transfer and storage means, including email and collaborative data/information sharing systems.

FIG. 7 illustrates one aspect of an architecture of a system for processing data received at a computing system from a remote source, such as a personal computer 704, a tablet computing device 706, or a mobile computing device 708 as described above. Content at the server device 710 may be stored in different communication channels or other storage types. For example, various documents may be stored using directory service 722, web portal 724, mailbox service 726, instant messaging store 728, or social networking service 1330.

Model interaction manager 720 may be used by clients in communication with server device 702, and/or multimodal machine learning engine 721 may be used by server device 702. The server device 702 can provide data to and from client computing devices such as personal computer 704, tablet computing device 706, and/or mobile computing device 708 (e.g., a smart phone) over network 715. For example, the computer systems described above may be embodied in a personal computer 704, a tablet computing device 706, and/or a mobile computing device 708 (e.g., a smart phone). In addition to receiving graphics data that may be used for preprocessing at a graphics-initiating system or post-processing at a receiving computing system, any of these embodiments of the computing device may also obtain content from storage 716.

Fig. 8 illustrates an exemplary tablet computing device 800 that can operate in one or more aspects disclosed herein. Additionally, aspects and functions described herein may operate on a distributed system (e.g., a cloud-based computing system), where application functions, memory, data storage and retrieval, and various processing functions may operate remotely from one another over a distributed computing network (e.g., the internet or an intranet). Various types of user interfaces and information may be displayed via an on-board computing device display or via a remote display unit associated with one or more computing devices. For example, various types of user interfaces and information may be displayed and interacted with on a wall surface projected with the various types of user interfaces and information. Interactions with a variety of computing systems that may practice aspects of the present invention include: a key input, touch screen input, voice or other audio input, gesture input, computing device associated in gesture input, is equipped with detection (e.g., camera) functionality for capturing and interpreting user gestures for controlling the functionality of the computing device, etc.

As can be appreciated from the foregoing disclosure, one aspect of the present technology relates to a system comprising: at least one processor; a memory stores instructions that, when executed by at least one processor, cause the system to perform a set of operations. The set of operations includes: receiving user input at a video game application, the user input comprising natural language; determining a model output associated with a multimodal machine learning model based on the received user input; and executing at least a portion of the model output to control a function of the video game application. In an example, determining the model output includes: providing an indication of user input to the multimodal generation platform; and receiving a model output from the multimodal generation platform. In another example, the set of operations further includes determining a hint for launching a multi-modal machine learning model; and the indication of the user input further includes the determined prompt. In yet another example, the set of operations further includes: the received user input and the determined model output are stored as part of the context. In yet another example, the set of operations further includes: receiving a second user input; determining a second model output associated with the multimodal machine learning model based on the second user input and the context; and executing at least a portion of the second model output to control the function of the video game application. In yet another example, the received user input is the same as the second user input; the first model output is different from the second model output. In another example, the portion of the model output is program content that includes a set of program steps that are executed to control the functions of the video game application. In yet another example, the multimodal machine learning model is associated with a set of content types including natural language content and program content.

In another aspect, the technology relates to a method for controlling a session proxy for a video game application. The method comprises the following steps: determining a hint associated with the session proxy; using the prompt to initiate a multimodal machine learning model to determine a model output associated with the multimodal machine learning model; and processing the model output to affect behavior of the session proxy of the video game application. In an example, a model output is determined in response to a trigger associated with the session proxy. In another example, the method further includes identifying a user interaction with the session proxy; and the model output is also determined based in part on the identified user indication. In a further example, the method further comprises: identifying a second user interaction with the session proxy; determining a second model output associated with the multimodal machine learning model based on the second user interaction and context; and processing the second model output to further influence the behavior of the session proxy of the video game application. In yet another example, processing the model output includes executing at least a portion of the model output to control the session proxy. In yet another example, prompting for launching the multimodal machine learning model and determining the prompt associated with the session proxy includes: identifying a generic portion of the hint associated with the session proxy; and identifying a scene-specific portion of the hint associated with a scene of the video game application.

In another aspect, the present technology relates to a method for using model outputs of a multi-modal machine learning model to control a video game application. The method comprises the following steps: receiving user input at a video game application having a first content type; determining a model output associated with a multimodal machine learning model based on the received user input and a prompt associated with the video game application, wherein the multimodal machine learning model is associated with the first content type and a second content type; and executing at least a portion of the model output to control a function of the video game application. In an example, the method further comprises: storing the received user input and the determined model output as part of a context; receiving a second user input; determining a second model output associated with the multimodal machine learning model based on the second user input and the context; and executing at least a portion of the second model output to control the function of the video game application. In another example, the received user input is the same as the second user input; and the first model output is different from the second model output. In a further example, a general portion of the hint associated with the video game application is identified; and identifying a scene-specific portion of the hint associated with a current scene of the video game application. In yet another example, determining the model output includes: providing the received user input and indication of the prompt to the multimodal generation platform; model outputs are received from the multimodal generation platform. In yet another example, the function of the video game is controlled by executing the portion of the model output that is associated with the function of the video game application that is accessible using a video game controller input.

Another aspect of the present technology relates to a system comprising: at least one processor; a memory stores instructions that, when executed by at least one processor, cause the system to perform a set of operations. The set of operations includes: determining a hint defining a persona associated with a session proxy of a video game application, wherein at least a portion of the hint is associated with a scene of the video game application; determining a model output of a multimodal machine learning model for a prompt-based session proxy; a session proxy controlling the video game application is output according to the determined model. In an example, the persona associated with the session proxy specifies one or more of the following: the natural language output by the model outputs the language; the role of a conversation agent for natural language output; a target of a conversation agent for natural language output; personality of conversation agent for natural language output; a behavior pattern of a session proxy for visual presentation of the video game application; or the physical appearance of a session proxy visually presented for a video game application. In another example, the manner of behavior of the session proxy is defined by a program. In a further example, the persona includes a program definition of variables that may be used to dynamically access the video game application. In yet another example, the determined hint further includes at least a portion of a context associated with a user of the video game application. In yet another example, controlling a session proxy for a video game application includes executing at least a portion of a model output to affect behavior of the session proxy in accordance with the determined model output. In another example, the scene is a first scene; the prompt is a first prompt; the set of operations further includes: identify a transition from a first scene to a second scene of the video game application; in response to identifying the transition: determining an updated cue comprising a generic cue portion from the first cue and another scene-specific portion associated with a second scene of the video game application; determining a second model output of the multimodal machine learning model for the updated hint based session proxy; a session proxy controlling the video game application is output according to the second model. In yet another example, determining the updated hint further includes a sub-portion of a context associated with the user's interaction with the session proxy.

In another aspect, the technology relates to a method for managing a state associated with a session proxy for a video game application. The method comprises the following steps: determining a hint defining a persona associated with a session proxy of a video game application, wherein the hint includes a generic part and a first scene-specific part associated with a first scene of the video game application; identify a transition from a first scene to a second scene of the video game application; and in response to identifying the transition to the second scene: determining an update hint for the session proxy that includes a general portion of the first hint and a second scene-specific portion associated with a second scene of the video game application; determining a second model output of the multimodal machine learning model for the updated hint based session proxy; a session proxy controlling the video game application is output according to the second model. In an example, the second model output is determined and the session proxy is controlled in response to an identification of the user's interaction with the session proxy. In another example, the second model output is further determined based on the identified user interactions with the session agent. In yet another example, determining the updated hint further includes identifying a sub-portion of a context associated with the user's interaction with the session proxy. In yet another example, the updated hint changes at least one of: aiming at the natural language output language of the model output in the second scene; a role of a conversation agent for natural language output in a second scenario; a target of a conversation agent for natural language output in a second scenario; personality of a conversation agent for natural language output in the second scenario; a behavior pattern of a session proxy for visual presentation of the video game application in the second scene; or the physical appearance of a session proxy for visual presentation of the video game application in the second scenario.

In another aspect, the present technology relates to a method for controlling a session proxy for a video game application. The method comprises the following steps: determining a hint defining a persona associated with a session proxy of a video game application, wherein at least a portion of the hint is associated with a scene of the video game application; determining a model output of a multimodal machine learning model for a prompt-based session proxy; a session proxy controlling the video game application is output according to the determined model. In an example, the persona associated with the session proxy specifies one or more of the following: the natural language output by the model outputs the language; the role of a conversation agent for natural language output; a target of a conversation agent for natural language output; personality of conversation agent for natural language output; a behavior pattern of a session proxy for visual presentation of the video game application; or the physical appearance of a session proxy visually presented for a video game application. In another example, the manner of behavior of the session proxy is defined by a program. In a further example, the persona includes a program definition of variables that may be used to dynamically access the video game application. In yet another example, the determined hint further includes at least a portion of a context associated with a user of the video game application. In yet another example, controlling a session proxy for a video game application includes executing at least a portion of a model output to affect behavior of the session proxy in accordance with the determined model output. In another example, the scene is a first scene; the method further comprises the steps of: identify a transition from a first scene to a second scene of the video game application; in response to identifying the transition: determining an updated cue comprising a generic cue portion from the first cue and another scene-specific portion associated with a second scene of the video game application; determining a second model output of the multimodal machine learning model for the updated hint based session proxy; a session proxy controlling the video game application is output according to the second model.

For example, aspects of the present disclosure have been described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to aspects of the disclosure. The functions/acts noted in the blocks may not occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The description and illustration of one or more aspects provided in the present application is not intended to limit or restrict the scope of the claimed disclosure in any way. The aspects, examples, and details provided in this disclosure are believed to be sufficient to convey possession and enable others to make and use the best mode of the claimed disclosure. The claimed disclosure should not be construed as limited to any aspect, example, or detail provided in the present disclosure. Whether shown and described in combination or separately, the various features (of the structures and methods) are intended to be selectively included or omitted to produce embodiments having a particular set of features. Having provided the description and illustration of the present application, those skilled in the art may contemplate variations, modifications, and alternative aspects that fall within the spirit of the broader aspects of the general inventive concepts embodied in the present application, without departing from the broader scope of the disclosure as set forth.

Claims

1. A system, comprising:

At least one processor; and

A memory storing instructions that, when executed by the at least one processor, cause the system to perform a set of operations comprising:

receiving user input at a video game application, the user input comprising natural language;

determining a model output associated with a multimodal machine learning model based on the received user input; and

At least a portion of the model output is executed to control a function of the video game application.

2. The system of claim 1, wherein determining the model output comprises:

providing an indication of the user input to a multimodal generation platform; and

The model output is received from the multimodal generation platform.

3. The system of claim 2, wherein:

The set of operations further includes determining a hint for launching the multimodal machine learning model; and

The indication of the user input further includes the determined prompt.

4. The system of claim 1, wherein the portion of the model output is program content comprising a set of program steps that are executed to control the function of the video game application.

5. A method for controlling a session proxy for a video game application, the method comprising:

Determining a hint associated with the session proxy;

using the prompt to initiate a multimodal machine learning model to determine a model output associated with the multimodal machine learning model; and

The model output is processed to affect behavior of the session proxy of the video game application.

6. The method of claim 5, wherein the model output is determined in response to a trigger associated with the session proxy.

7. The method of claim 5, wherein the hint is to launch the multimodal machine learning model and determining the hint associated with the session proxy comprises:

Identifying a generic portion of the hint associated with the session proxy; and

A scene-specific portion of the hint associated with a scene of the video game application is identified.

8. A method for using model outputs of a multi-modal machine learning model to control a video game application, the method comprising:

receiving user input at a video game application having a first content type;

Determining a model output associated with a multimodal machine learning model based on the received user input and a prompt associated with the video game application, wherein the multimodal machine learning model is associated with the first content type and a second content type; and

9. The method of claim 8, further comprising:

storing the received user input and the determined model output as part of a context;

receiving a second user input;

determining a second model output associated with the multimodal machine learning model based on the second user input and the context; and

At least a portion of the second model output is executed to control the function of the video game application.

10. The method of claim 8, further comprising:

identifying a general portion of the hint associated with the video game application; and

A scene-specific portion of the hint associated with a current scene of the video game application is identified.

11. The system of claim 1, wherein the multimodal machine learning model is associated with a set of content types including natural language content and program content.

12. The method of claim 5, further comprising:

The method further includes identifying a user interaction with the session proxy; and

The model output is also determined based in part on the identified user indication.

13. The method of claim 12, further comprising:

identifying a second user interaction with the session proxy;

Determining a second model output associated with the multimodal machine learning model based on the second user interaction and context; and

The second model output is processed to further influence the behavior of the session proxy of the video game application.

14. The method according to claim 9, wherein:

the received user input is the same as the second user input; and

The first model output is different from the second model output.

15. The method of claim 8, wherein the function of the video game is controlled by executing the portion of the model output associated with the function of the video game application that is accessible using a video game controller input.