CN114270337A

CN114270337A - System and method for personalized and multi-modal context-aware human-machine dialog

Info

Publication number: CN114270337A
Application number: CN202080054000.3A
Authority: CN
Inventors: W·张
Original assignee: De Mai Co ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2019-06-17
Filing date: 2020-06-09
Publication date: 2022-04-01
Also published as: WO2020256993A1

Abstract

The present teachings relate to methods, systems, media, and embodiments for human-machine dialog. Utterances are received from users participating in a human-machine conversation about a topic in a conversation scene. Multimodal surrounding information related to a human-machine conversation is obtained and analyzed to track a multimodal context of the human-machine conversation. Based on the tracked multimodal context, operations for spoken language understanding of the utterance are conducted in a context-aware manner to determine semantics of the utterance.

Description

System and method for personalized and multi-modal context-aware human-machine dialog

Cross Reference to Related Applications

This application claims priority from the following patent applications: U.S. provisional patent application 62/862296 (attorney docket No. 047437-, U.S. provisional patent application 62/862275 (attorney docket No. 047437-.

Technical Field

The present teachings relate generally to computers. More particularly, the present teachings relate to human machine dialog (human machine dialog) management.

Background

Computer-assisted dialog systems are becoming increasingly popular with advances in artificial intelligence technology and the explosive growth of internet-based communications due to the ubiquity of internet connectivity. For example, more and more call centers deploy automated dialog robots to handle customer calls. Hotels have begun to install various kiosks that can answer questions of guests or guests. In recent years, automated man-machine communication in other fields has become increasingly popular.

Conventional computer-assisted dialog systems are typically preprogrammed with certain dialog content, such as questions and answers based on dialog patterns well known in the relevant arts. Unfortunately, some conversation models may be suitable for some human users, but may not be suitable for other human users. Furthermore, it is undesirable that a human user may run problems during a conversation, and continue a fixed conversation mode regardless of what the user speaks may cause irritation or lose interest.

When planning a conversation, human designers often need to manually author the content of the conversation based on known knowledge, which is time consuming and tedious. Even more labor is required in view of the need to create different conversation models. Any deviation from the designed conversation mode may need to be noted and used to determine how to continue the conversation while authoring the conversation content. Previous dialog systems have not effectively addressed such problems.

With recent developments in the AI field, the observed dynamic information can be adaptively incorporated into learning and used to guide the progress of human-computer interaction sessions. How to develop a knowledge representation (knowledge representation) that can incorporate dynamic information in different dimensions and sometimes in different modalities is a challenging problem. Since this knowledge representation is the basis for a dynamic conversational process between human and machine, it is necessary to configure the representation sufficiently to support adaptive conversations in a relevant manner.

In order to talk to a human being, an automatic dialog system may need to achieve different levels of understanding of what the human being said in language, what the semantics of the spoken words are, sometimes the emotional state of the person, and the mutual causal relationship between the spoken content and the environment of the conversation. Conventional computer-assisted dialog systems do not adequately address such problems.

Accordingly, there is a need for methods and systems that address such limitations.

Disclosure of Invention

The teachings disclosed herein relate to methods, systems, and programming for advertising. More particularly, the present teachings relate to methods, systems, and programming related to exploring sources of advertisements and utilization thereof.

In one example, a method implemented on a machine having at least one processor, memory, and a communication platform capable of connecting to a network for human-machine conversation. And (5) man-machine conversation. An utterance (utterance) is received from a user participating in a human-machine conversation about a topic (topic) in a conversation scene. Multimodal surrounding information (multimodal context) related to a human-machine dialog is obtained and analyzed to track multimodal context of the human-machine dialog. Based on the tracked multimodal context, a spoken understanding of the utterance is performed in a context aware (context aware) manner to determine the semantics of the utterance (semantics).

In different examples, a system for human-machine conversation includes various functional modules for performing a method of human-machine conversation. Such a human-machine dialog system comprises a surrounding knowledge tracker (surround knowledge tracker) and a Spoken Language Understanding (SLU) engine. The context tracker is configured to obtain multimodal context information related to the human-machine dialog and analyze the multimodal context information to track the multimodal context of the human-machine dialog. The SLU engine is configured to receive an utterance from a user participating in a human-machine conversation about a topic in a conversation scene and personalize Spoken Language Understanding (SLU) of the utterance in a context-aware manner based on a tracked multimodal context to determine semantics of the utterance.

Other concepts relate to software for implementing the present teachings. A software product according to this concept includes at least one machine-readable non-transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters associated with the executable program code, and/or information related to the user, the request, the content, or other additional information.

In one example, a machine-readable non-transitory and tangible medium having data recorded thereon for human-machine interaction, wherein the medium, when read by a machine, causes the machine to perform a series of steps to perform a method for user-machine communication. And (5) man-machine conversation. Utterances are received from users participating in a human-machine conversation about a topic in a conversation scene. Multimodal surrounding information related to a human-machine conversation is obtained and analyzed to track a multimodal context of the human-machine conversation. The method further includes performing an operation of spoken language understanding of the utterance in a context-aware manner based on the tracked multimodal context to determine semantics of the utterance.

Additional advantages and novel features will be set forth in part in the description which follows and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

Drawings

The methods, systems, and/or programs described herein are further described in accordance with the exemplary embodiments. These exemplary embodiments are described in detail with reference to the accompanying drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and in which:

FIG. 1A depicts an exemplary configuration of a dialog system centered on information states that capture dynamic information observed during a dialog, according to embodiments of the present teachings;

FIG. 1B is a flow diagram of an exemplary process for using a dialog system that captures the information state of dynamic information observed during a dialog, according to an embodiment of the present teachings;

FIG. 2A depicts an exemplary construction of an information state according to embodiments of the present teachings;

FIG. 2B illustrates a representation of how estimated different psychotics (mindset) are connected in a dialog with a robot director (robot tutor) teaching user score (fractions) addition, according to an embodiment of the present teachings;

FIG. 2C illustrates an exemplary relationship between an estimated representation of the heart states of an agent (agent), a shared mindset, and a user in an informational state, according to embodiments of the present teachings;

FIG. 3A illustrates exemplary relationships between different types of AND-OR graphs (AOG) used to represent estimated mind states of parties involved in a conversation, according to embodiments of the present teachings;

FIG. 3B depicts an exemplary association between a spatial AOG (S-AOG) and a temporal AOG (T-AOG) in an information state according to an embodiment of the present teachings;

FIG. 3C illustrates an exemplary S-AOG and its associated T-AOG, according to embodiments of the present teachings;

FIG. 3D illustrates an exemplary relationship between S-AOG, T-AOG and C-AOG according to an embodiment of the present teachings;

FIG. 4A illustrates an exemplary S-AOG that represents, in part, the mind of an agent for teaching different mathematical concepts, in accordance with an embodiment of the present teachings;

FIG. 4B illustrates an exemplary T-AOG that represents a dialog strategy associated in part with an psychology of an agent teaching a concept of a score, according to an embodiment of the present teachings;

FIG. 4C illustrates exemplary dialog content for teaching concepts associated with scores according to embodiments of the present teachings;

FIG. 5A illustrates an exemplary temporal partial graph (T-PG) within a T-AOG that represents a shared mind state between a user and a machine in accordance with embodiments of the present teachings;

FIG. 5B illustrates a portion of a conversation between a machine and a human along a conversation path representing a current representation of a shared mind state, in accordance with an embodiment of the present teachings;

FIG. 5C depicts an exemplary S-AOG having nodes parameterized with measurements related to mastery levels of different underlying concepts to represent a user' S mind, according to embodiments of the present teachings;

FIG. 5D illustrates exemplary personality feature types of a user that may be estimated based on observations from the conversation, according to embodiments of the present teachings;

FIG. 6A depicts a generic S-AOG for tutoring dialogues (tutoring dialogues) according to an embodiment of the present teachings;

FIG. 6B depicts a particular T-AOG of a dialog relating to a greeting, according to an embodiment of the present teachings;

FIG. 6C illustrates different types of parameterized alternatives for different types of AOGs in accordance with an embodiment of the present teachings;

FIG. 6D illustrates an S-AOG having different nodes parameterized with rewards that are updated based on dynamic observations from conversations in accordance with an embodiment of the present teachings;

FIG. 6E illustrates an exemplary T-AOG generated by merging different graphs via graph matching with parameterized content, according to embodiments of the present teachings;

FIG. 6F illustrates an exemplary T-AOG with parameterized content associated with nodes, in accordance with embodiments of the present teachings;

FIG. 6G illustrates an exemplary T-AOG in which different paths traverse different nodes, the different paths parameterized with rewards that are updated based on dynamic observations from dialogs, according to embodiments of the present teachings;

FIG. 7A depicts a high level system diagram of a knowledge tracking unit in accordance with an embodiment of the present teachings;

FIG. 7B illustrates how knowledge tracking enables adaptive dialog management according to an embodiment of the present teachings;

FIG. 7C is a flow chart of an exemplary process of a knowledge tracking unit according to an embodiment of the present teachings;

FIG. 8A illustrates an example of utility driven (utility drive) coaching (node) planning with respect to S-AOG, according to an embodiment of the present teachings;

FIG. 8B illustrates an example of utility-driven path planning with respect to a T-AOG, according to an embodiment of the present teachings;

FIG. 8C illustrates dynamic states in utility-driven adaptive dialog management based on parameterized AOG derivation according to embodiments of the present teachings;

9A-9B illustrate a scheme to enhance spoken language understanding in a human-machine conversation through automatic enrichment (automatic enrichment) of content parameterized for AOG, according to an embodiment of the present teachings;

FIG. 9C illustrates an exemplary manner of generating enriched training data for parameterized AOG content in accordance with embodiments of the present teachings;

FIG. 10A depicts an exemplary high-level system diagram for enhancing ASR/NLU by training a model based on automatically generated enriched AOG content, according to an embodiment of the present teachings;

FIG. 10B is a flowchart of an exemplary process of enhancing an ASR/NLU by training a model based on automatically generated enriched AOG content, according to an embodiment of the present teachings;

FIG. 11A depicts an exemplary high-level system diagram for context-aware spoken language understanding based on surrounding knowledge tracked during a conversation, according to an embodiment of the present teachings;

FIG. 11B illustrates an exemplary type of ambient knowledge to be tracked to facilitate context-aware spoken language understanding in accordance with embodiments of the present teachings;

FIG. 12A illustrates tracking a personal profile based on conversations occurring in a conversation, according to embodiments of the present teachings;

FIG. 12B illustrates tracking a personal profile based on visual observations during a conversation according to an embodiment of the present teachings;

FIG. 12C illustrates an exemplary partial personal profile that is tracked during a conversation based on multimodal input information obtained from a conversation scenario, in accordance with embodiments of the present teachings;

FIG. 12D illustrates an exemplary event knowledge representation constructed based on conversations in accordance with embodiments of the present teachings;

FIG. 13A illustrates a feature group (charateristic group) for classifying users for adaptive conversation planning, in accordance with an embodiment of the present teachings;

FIG. 13B illustrates exemplary content/construction of a personalized user profile according to embodiments of the present teachings;

FIG. 13C illustrates an example of establishing group characteristics of users and their use to facilitate adaptive conversation planning, according to an embodiment of the present teachings;

FIG. 14A depicts an exemplary high-level system diagram for tracking individual speech-related characteristics and their user profiles to facilitate adaptive conversation planning, according to an embodiment of the present teachings;

FIG. 14B is a flowchart of an exemplary process for tracking individual speech-related characteristics and their user profiles to facilitate adaptive conversation planning, according to an embodiment of the present teachings;

FIG. 15A provides an exemplary structure for building a representation of events observed during a conversation, according to an embodiment of the present teachings;

15B-15C illustrate examples of tracking event-centric knowledge as a dialog context based on observations from a dialog scenario, according to embodiments of the present teachings;

15D-15E illustrate another example of tracking event-centric knowledge as a dialog context based on observations from a dialog scenario in accordance with an embodiment of the present teachings;

FIG. 16A depicts an exemplary high-level system diagram for personalized context-aware dialog management according to embodiments of the present teachings;

FIG. 16B is a flowchart of an exemplary process for personalized context-aware dialog management, according to an embodiment of the present teachings;

FIG. 16C depicts an exemplary high-level system diagram of an NLG engine and a TTS engine for producing a contextually-aware and personalized audio response, according to embodiments of the present teachings;

fig. 16D is a flowchart of an exemplary process of an NLG engine according to an embodiment of the present teachings;

FIG. 16E is a flowchart of an exemplary process of a TTS engine according to an embodiment of the present teachings;

FIG. 17A depicts an exemplary high-level system diagram of adaptive personalized coaching based on dynamic tracking and feedback according to an embodiment of the present teachings;

FIG. 17B illustrates an exemplary method that a robotic agent tutor may use to tutor a student according to embodiments of the present teachings;

FIG. 17C provides exemplary aspects of student users that a scorer may dynamically observe according to embodiments of the present teachings;

FIG. 17D provides an example of the standard acoustic (acoustic)/visual (viseme) feature of spoken language;

FIG. 17E illustrates an example of an adaptive coaching plan designed based on the acoustic/visual characteristics of the user relative to the acoustic/visual characteristics of the underlying spoken language, in accordance with an embodiment of the present teachings;

FIG. 17F illustrates an example of tutoring content for a user based on deviation of the user's viseme features from the viseme features of the underlying spoken language, in accordance with an embodiment of the present teachings;

FIG. 17G is a flowchart of an exemplary process of adaptive personalized coaching based on dynamic tracking and feedback, according to an embodiment of the present teachings;

FIG. 18 is an illustrative diagram of an exemplary mobile device architecture that can be used to implement a dedicated system embodying the present teachings in accordance with various embodiments; and

FIG. 19 is an illustrative diagram of an exemplary computing device architecture that can be used to implement a special purpose system embodying the present teachings in accordance with various embodiments.

Detailed Description

In the following detailed description, by way of example, numerous specific details are set forth in order to provide a thorough understanding of the relevant teachings. It will be apparent, however, to one skilled in the art that the present teachings may be practiced without these specific details. In other instances, well-known methods, procedures, components, and/or circuitry have been described at a relatively high-level without detail so as to avoid unnecessarily obscuring aspects of the present teachings.

The present teachings are directed to addressing the deficiencies of conventional human-machine dialog systems and providing methods and systems that enable rich presentation of multimodal information from the conversation environment to allow machines to have an improved perception of the surroundings and environment of the dialog content and to better adapt to the dialog by enhancing the interaction with the user. Based on such representations, the present teachings further disclose different modes for creating such representations and authoring content of the dialog in such representations. Further, to allow these representations to be adjusted based on the dynamics that occur during the conversation, the present teachings also disclose mechanisms that track the dynamics of the conversation and update the representations accordingly, which the machine then uses to conduct the conversation in a utility-driven manner to achieve maximized results.

FIG. 1A depicts an exemplary configuration of a dialog system 100, the dialog system 100 being centered on an information state 110, the information state 110 capturing dynamic information observed during a dialog, according to embodiments of the present teachings. The dialog system 100 includes a multimodal information processor 120, an Automatic Speech Recognition (ASR) engine 130, a Natural Language Understanding (NLU) engine 140, a Dialog Manager (DM)150, a Natural Language Generation (NLG) engine 160, a text-to-speech (TTS) engine 170. The system 100 interacts with a user 180 to conduct a conversation.

During the conversation, multimodal information is collected from the environment, including from the user 180, which captures ambient information of the conversation environment, the user's voice and expressions (of the face or body), etc. The multimodal information thus collected is analyzed by the multimodal information processor 120 to extract relevant features of different modalities in order to estimate different characteristics of the user and the environment. For example, the speech signal may be analyzed to determine speech related features such as speaking rate, pitch (pitch), or even accent (accent). Visual signals associated with the user may also be analyzed to extract, for example, facial features or body gestures, etc., in order to determine the expression of the user. By combining the acoustic and visual features, the multimodal information analyzer 120 may also be able to infer an emotional state of the user. For example, a high pitch, a fast speaking plus an angry facial expression may indicate that the user is not happy. In some embodiments, the observed user activity may also be analyzed to better understand the user. For example, if a user points or walks to a particular object, it may reveal what the user pointed to in his/her voice. Such multimodal information can provide a useful context to understand the intent of the user. The multimodal information processor 120 can continuously analyze multimodal information and store such analyzed information in the information state 110, which can then be used by various components in the system 100 to facilitate decisions related to dialog management.

In operation, speech information from user 180 is sent to ASR engine 130 to perform speech recognition. Speech recognition may include distinguishing between the language spoken by the user 180 and the words spoken. The results from ASR engine 130 are further processed by NLU engine 140 in order to understand the semantics spoken by the user. This understanding may depend not only on the words spoken, but also on other information (such as the expression and gestures of the user 180) and/or other contextual information (such as what was spoken before). Based on the understanding of the user's utterance, the dialog manager 150 determines how to respond to the user. Such determined responses may then be generated in text form via NLG engine 160 and further transformed from text form to speech signals via TTS engine 170. The output of the TTS engine 170 may then be delivered to the user 180 as a response to the user's utterance. Via this back-and-forth interaction, the process for the machine dialog system continues to perform a conversation with the user 180.

As seen in fig. 1A, the components in the system 100 connect to an information state 110, as discussed herein, the information state 110 captures the dynamics surrounding a conversation and provides relevant and rich contextual information that can be used to facilitate speech recognition (ASR), language understanding (NLU), and various conversation-related determinations, including what is the appropriate response (DM), what language features to apply to a text response (NLG), and how to convert the text response into speech form (TTS) (e.g., what accent). As discussed herein, the information state 110 can represent dialog-related dynamics obtained based on multimodal information that are relevant to the user 180 or to the surroundings of the dialog.

Upon receiving multimodal information from a dialog scene (about a user or about a dialog environment), the multimodal information processor 170 analyzes the information and characterizes the dialog surroundings in different dimensions, for example, acoustic features (e.g., pitch, speed, accent of the user), visual features (e.g., facial expressions of the user, objects in the environment), physical features (e.g., the user waving or pointing at an object in the environment), estimated mood and/or mood of the user, and/or preferences or intentions of the user. This information may then be stored in the information state 110.

The rich media context information stored in the information state 110 may facilitate different components to play their respective roles such that the conversation may proceed in an adaptive, more engaging, and more efficient manner with respect to the intended goal. For example, rich contextual information may improve understanding of the utterance of the user 180 based on what is observed in the conversation scene, evaluate the performance of the user 180 and/or estimate the utility (or preference) of the user based on the intended goal of the conversation, determine how to respond to the utterance of the user 180 based on the estimated emotional state of the user, and deliver the response in what is considered most appropriate based on the knowledge of the user, and so forth. For example, if accent information represented in both acoustic form (e.g., a particular way of speaking certain phonemes) and visual form (e.g., a particular view of the user) is captured in an informational state about the user, the ASR engine 130 may utilize this information to determine the words spoken by the user. Similarly, NLU engine 140 can also utilize rich context information to determine the semantics that the user is pointing to. For example, if the user points to a computer that is placed on a desk (visual information) and says "i like this," NLU engine 140 may combine the output of ASR engine 130 (i.e., "i like this") and the visual information that the user points to a computer in the room to understand that the user's "this" refers to a computer. As another example, if the user 180 repeatedly makes mistakes in the tutoring session while estimating that the user performs very annoying based on the tone of the speech and/or the user's facial expressions (e.g., they are determined based on multimodal information), instead of continuing to advance the tutoring content, the DM 140 may decide to temporarily change the topic based on the user's known interests (e.g., talking about a happy Gao game) in order to continue attracting the user. The decision to distract the user may be determined based on, for example, previously observed utilities (utilities) regarding what worked (e.g., intermittently made user distraction worked in the past) and what did not work (e.g., continued pressure on the user made it better).

FIG. 1B is a flow diagram of an exemplary process of the dialog system 100 in which the information state 110 captures dynamic information observed during a dialog, according to an embodiment of the present teachings. As seen in fig. 1B, the process is an iterative process. At 105, multimodal information is received, which is then analyzed by the multimodal information processor 170 at 125. As discussed herein, multimodal information includes information related to the user 180 and/or information related to the environment surrounding the conversation. Multimodal information related to a user may include utterances of the user and/or visual observations of the user, such as body gestures and/or facial expressions. The information related to the environment surrounding the conversation may include information related to the environment, such as objects that are present, spatial/temporal relationships between the user and such observed objects (e.g., the user standing in front of a table), and/or dynamic relationships between the user's activities and the observed objects (e.g., the user walking to a table and pointing to a computer at the table). An understanding of the multimodal information captured from the dialog scenario can then be used to facilitate other tasks in the dialog system 100.

Based on the information (representing the past state) stored in the information state 110 and the analysis results (regarding the current state) from the multimodal information processor 170, the ASR engine 120 and the NLU engine 130 perform speech recognition at 125 to determine the words spoken by the user and the language understanding based on the recognized words, respectively. ASR and NLU may be performed based on the current information state 110 and the analysis results from the multimodal information processor 170.

Based on the results of the multimodal information analysis and language understanding (i.e., what the user said or meant), changes to the dialog state are tracked at 135, and such changes are used to update the information state 110 accordingly at 145 to facilitate subsequent processing. To perform the dialog, the DM 140 determines a response at 155 based on the dialog tree designed for the underlying dialog, the output of the NLU engine 130 (understanding of the utterance), and the information stored in the information state 110. Once the response is determined, NLG engine 150 generates the response, e.g., in its textual form, based on information state 110. When a response is determined, there may be different ways to speak it. NLG engine 150 may generate a response at 165 based on the user's preferred genre or a genre known to be more suitable for the particular user in the current conversation. For example, if a user answers a question by mistake, there may be a different way to indicate that the answer is incorrect. For a particular user in the current conversation, if the user is known to be sensitive and easily frustrated, a gentler way may be used to tell the user that his/her answer is incorrect to generate a response. For example, rather than say "this is wrong," NLG engine 150 may generate a textual response that "it is not exactly correct.

The text response generated by NLG engine 150 may then be rendered by TTS engine 160 at 175 into a speech form, e.g., an audio signal form. Although TTS may be performed using standard or commonly used TTS techniques, the present teachings disclose that the response generated by NLG engine 150 may be further personalized based on the information stored in information state 110. For example, if slower speech speeds or softer speech styles are known to be more effective for the user, the generated response may be correspondingly rendered by TTS engine 160 into a speech form, e.g., having a lower speed and pitch, at 175. Another example is to render the response with an accent consistent with the student's known accent based on the user's personalized information in the informational state 110. The rendered response may then be delivered to the user as a response to the user utterance at 185. After responding to the user, the dialog system 100 then tracks additional changes to the dialog and updates the information state 110 accordingly at 195.

FIG. 2A depicts an exemplary configuration of an information state representation 110 according to embodiments of the present teachings. Without limitation, the information state 110 includes a representation for an estimated mind or state of mind. As shown, different representations may be evaluated to represent, for example, the agent's mind state 200, the user's mind state 220, and the shared mind state 210, along with other information recorded therein. The mental state 200 of an agent may refer to the intended target(s) that the dialog agent (machine) is to achieve in a particular dialog. The shared mind state 210 may refer to a representation of the current conversation situation, which is a combination of the agent performing the intended agenda and the user's actual performance according to the agent's mind state 200. The user's mind 220 may refer to a representation of an estimate of what state the student is in with respect to the intended purpose of the conversation, based on the shared mind or the user's performance. For example, if the agent's current task is to teach the student user the concept of a score in mathematics (which may include sub-concepts that establish an understanding of the score), the user's mind may include an estimated mastery of the user's various related concepts. Such an estimate may be derived based on an evaluation of student performance at different stages of tutoring such related concepts.

Fig. 2B illustrates how such representations of different concentric states are connected in an example where the robot director 205 teaches the student user 180 concepts 215 related to score addition, according to an embodiment of the present teachings. As can be seen, the robot agent 205 interacts with the student user 180 via multi-modal interactions. The robotic agent 205 may begin tutoring based on an initial representation of the agent's mind state 200 (e.g., a lesson on score addition that may be represented as AOG). During tutoring, the student user 180 may answer questions from the bot 205 and such answers to the questions form a particular dialog path, enabling the estimation of a representation of the shared mind state 210. Based on the user's answers, the user's performance is evaluated and the user's mental state 220 representation is evaluated with respect to different aspects, e.g., whether the student mastered the taught concept and what style of conversation worked with this particular student.

As seen in fig. 2A, the representation of the estimated mind state is based on some graph-related form, including but not limited to space-time-causal or graph STC-AOG 230, STC-parse graph (STC-PG)240, and may be used in conjunction with other types of information stored in the information state, such as conversation history 250, conversation context 260, event-centric knowledge 270, common sense models 280, … …, and user profiles 290. These different types of information may belong to multiple modalities and constitute different aspects of the dynamics of each dialog with respect to each user. As such, the information state 110 captures general information for various conversations as well as personalized information about each user and each conversation, and interconnects together to facilitate different components in the conversation system 100 to perform corresponding tasks in a more adaptive, personalized, and engaging manner.

Fig. 2C illustrates an exemplary relationship between the mental state 200 of the agent, the shared mental state 210, and the mental state 220 of the user represented in the information state 110, according to an embodiment of the present teachings. As discussed herein, the shared mind state 210 represents the state of a conversation effected via interaction between the agent and the user, and is a combination of the agent's intended intent (in terms of the agent's mind state) and the user's performance in following the agent's intended agenda. Based on the shared mind state 210, the dynamics of the conversation may be tracked as to what the agent can implement and up to that point the user can implement.

Tracking this dynamic knowledge enables the system to estimate what the user has achieved up to that point, in what way the student user has mastered which concepts or sub-concepts so far (i.e., which dialog path(s) are active and which may not). Based on the achievements made by the student user to date, the user's mind can be inferred or estimated 220, which will be used to determine how the agent may further facilitate adjusting or updating the dialog strategy in order to achieve the desired goal or to adjust the mind of the agent to suit the user. The process of adjusting the heart state of the agent enables derivation of an updated heart state 200 of the agent. Based on the dialog history, the dialog system 100 learns the user's preferences or what is more effective (utility) for the user. This information, once incorporated into the information state, will be used to adjust the conversation strategy via utility-driven (or preference-driven) conversation planning. The updated dialog policy drives the next step in the dialog, which in turn may result in a response from the user and subsequent updates to the shared mind, the mind of the user, and the mind of the agent. The process iterates so that the agent can continue to adjust the conversation policy based on the dynamic information state.

According to the present teachings, different heart states are represented based on, for example, STC-AOG and STC-PG. FIG. 3A illustrates exemplary relationships between different types of AND-OR graphs (AOG) used to represent estimated mind states of parties involved in a conversation, according to embodiments of the present teachings. An AOG is a graph with an AND branch AND an OR branch. Branches associated with nodes in the AOG AND associated through AND relationships represent tasks that need to be traversed in their entirety. Branches from nodes in the AOG that are associated through an OR relationship represent tasks that can be selectively traversed. As discussed herein, STC-AOG includes S-AOG corresponding to spatial AOG, T-AOG corresponding to temporal AOG, and C-AOG corresponding to causal AOG. In accordance with the present teachings, an S-AOG is a graph that includes nodes, each of which may correspond to a topic to be covered in a conversation. A T-AOG is a graph that includes nodes, each of which may correspond to a temporal action to be taken. Each T-AOG may be associated with a topic or node in the S-AOG, i.e., represent steps to be performed during a conversation about the topic corresponding to the S-AOG node. The C-AOG is a graph comprising nodes, each of which can be linked to a node in the T-AOG and a node in the corresponding S-AOG, representing an action occurring in the T-AOG and a causal impact of the action on the node in the corresponding S-AOG.

FIG. 3B depicts an exemplary relationship between a node in an S-AOG and a node of an associated T-AOG represented in information state 110, according to embodiments of the present teachings. In this illustration, each K node corresponds to a node in the S-AOG, representing a skill or topic to be taught in the conversation. The evaluation for each K node may include "grasp" or "not grasped", e.g., respective probabilities P (T) and 1-P (T), i.e., P (T) represents a transition probability from not grasped to grasped state. P (L0) represents the probability of a priori learning skills or a priori knowledge about a topic, i.e. the likelihood that a student has mastered the concept before the tutoring session starts. To teach the skills/concepts associated with each K node, the robot director may ask multiple questions based on the T-AOG associated with the K node, and the student will then answer each question. Each question is shown as a Q node and the student's answer is represented as an a node in fig. 3B.

During the conversation between the agent and the user, the student's answer may be the correct answer a (c) or the incorrect answer a (w), as seen in fig. 3B. Based on each answer received from the user, additional probabilities are determined based on various knowledge or observations collected, for example, during the conversation. For example, if the user provides the correct answer (a (c)), then the probability that the answer is a guess, p (g), may be determined, which represents the likelihood that the student does not know the correct answer but guesses the correct answer. Instead, 1-P (G) is the probability that the user knows the correct answer and answers correctly. For incorrect or wrong answers a (w), a probability p(s) may be determined that represents the likelihood that the student gave a wrong answer but the student did know the concept. Based on P (S), a probability 1-P (S) can be estimated, which represents the probability that a student gives a wrong answer because the concept is unknown. Such probabilities may be computed for each node along the path traversed based on the actual conversation, and may be used to estimate when the student mastered the concept, and what might work and not go in teaching this particular student about each particular topic.

FIG. 3C illustrates an exemplary S-AOG and its associated T-AOG, according to embodiments of the present teachings. In the example S-AOG 310 for guiding concepts of scores, each node corresponds to a topic or concept to be taught to student users during a conversation. For example, S-AOG 310 includes node P0 or 310-1 representing the concept of score, node P1 or 310-2 representing the concept of division, node P2 or 310-3 representing the concept of multiplication, node P3 or 310-4 representing the concept of addition, and node P4 or 310-5 representing the concept of subtraction. In this example, different nodes in S-AOG 310 are related. For example, to grasp the concept of a score, at least some of the other concepts regarding addition, subtraction, multiplication, and division need to be grasped first. To teach the concepts (e.g., scores) represented by the S-AOG nodes, the agent may need to perform a series of steps or processes in a dialog session with the student user. This process or series of steps corresponds to T-AOG. In some embodiments, there may be multiple T-AOGs for each node in an S-AOG, each of which may represent a different way for the student to teach and may be invoked in a personalized manner. As shown, S-AOG node 310-1 has a plurality of T-AOGs 320, one of which is shown as 320-1, that correspond to a series of time steps of questions/

answers

330, 340, 350, 360, 370, 380 … …, and so on. The choice of which T-AOG to use in each tutoring session that teaches the concept of a score may vary and may be determined based on various considerations, such as the user in the session (personalized), the degree of mastery of the current concept (e.g., P (L0)), etc.

The representation of the STC-AOG based dialog captures the entities/objects/concepts related to the dialog (S-AOG), the possible actions observed during the dialog (T-AOG), and the impact of each of these actions on these entities/objects/concepts (C-AOG). The actual dialog activity occurring during the dialog (speech) results in traversing the corresponding graph representation or STC-AOG, resulting in a Parse Graph (PG) corresponding to the traversed portion of the STC-AOG. In some embodiments, the S-AOG may model a spatial decomposition (spatial decomposition) of the objects and scenes of the dialog. In some embodiments, S-AOG may model a decomposition of concepts and sub-concepts as discussed herein. In some embodiments, a T-AOG may model a temporal decomposition of events/sub-events/actions that may be performed or have occurred in a dialog related to certain entities/objects/concepts represented in the corresponding S-AOG. The C-AOG can model the decomposition of events represented in the T-AOG and their causal relationships to corresponding entities/objects/concepts represented in the S-AOG. That is, the C-AOG describes changes to nodes in the S-AOG caused by events/actions taken in the dialog and represented in the T-AOG. Such information is about different aspects of the conversation and is captured in the information state 110. That is, the information state 110 represents the dynamics of the dialog between the user and the dialog agent. This is illustrated in fig. 3D.

As discussed herein, based on the actual conversation session, the particular path traversed based on the conversation may result in different types of corresponding Parse Graphs (PGs). For example, it can be applied to S-AOG to produce S-PG, to T-AOG to produce T-PG, and to C-AOG to produce C-PG. That is, based on STG-AOG, the actual conversation results in a dynamic STC-PG, which at least partially represents the different minds of the parties participating in the conversation session. To illustrate this, FIGS. 4A-4C show exemplary S-AOG/T-AOG associated with an agent' S thoughts for teaching score-related concepts; 5A-5B provide an exemplary representation of a shared mind state via a T-PG that is generated based on a dialog in a particular tutoring session; fig. 6A-6B illustrate exemplary representations of a user's mind in terms of estimated mastery of different concepts taught in a conversation with a conversation agent.

Fig. 4A illustrates an exemplary representation of an agent's mental state with respect to fractional coaching, according to embodiments of the present teachings. As discussed herein, the mental representation of the agent may reflect what the agent desires or is designed to cover in the conversation. The mental state of the agent may be adjusted during the conversation session based on the user's performance/behavior such that the representation of the mental state of the agent captures such dynamics or adjustments. As shown in fig. 4A, an exemplary representation of the mental state of the agent includes various nodes, each node representing a sub-concept related to the concept of a score. For example, there are sub-concepts related to: "understand score" 400, "compare score" 405, "understand equivalence score" 410, "expand and simplify equivalence score" 415, "find factor pair" 420, "apply characteristics of multiply/divide" 425, "add scores" 430, "find LCM" 435, "solve unknowns in multiply/divide" 440, "multiply and divide within 100" 445, "simplify false scores" 450, "understand false scores" 455, and "add and subtract" 460. These sub-concepts may constitute a lattice of scores, some of which may need to be taught before others, e.g., "understand false score" 455 may need to be covered before "simplify false score" 450, "add and subtract" 460 may need to be mastered before "multiply and divide within 100" 445, and so on.

FIG. 4B illustrates an exemplary T-AOG representing an agent's mental state in teaching concepts related to scores, according to embodiments of the present teachings. As discussed herein, a T-AOG includes various steps associated with a conversation, some of which relate to what the agent says, some of which relate to what the user responds to, and some of which correspond to certain evaluations for the conversation performed by the agent. There are branches in the T-AOG that represent decisions. For example, at 470, the corresponding action is to have the agent highlight the numerator and denominator boxes, which may be after teaching the student what is the numerator and denominator, for example. After link 480, the agent proceeds to 490 to request user input, e.g., to ask the user to tell the agent which highlighted denominator. Based on the answers received from the students, the agent follows two links combined by OR (plus sign), where each link represents a path taken by the user. For example, if the user correctly answered which was the denominator, the agent proceeds to 490-3, e.g., further asking the user to evaluate the denominator. If the user answers incorrectly, the agent proceeds to 490-4 to provide the user with a prompt for the denominator, and then returns to 490 per link 490-2, again requiring user input as to which is the denominator.

If the agent requires the user to evaluate the denominator at 490-3, there are two associated results, one is a wrong answer and one is a correct answer. The former results in 490-5, the agent indicates to the user that the answer is incorrect at 490-5, and then returns to 490 following link 490-1, asking the user again for input. If the answer is correct, the agent follows another path to 490-6, letting the user know that he/she is correct and continue along that path to set the denominator further and clear the highlighting at 490-7 and 490-8, respectively. As can be seen, the step at 490 represents the proxy plan' S temporal actions related to teaching the concept of denominator, which is related to the S-AOG in FIG. 4A representing the concept of proxy plan teaching the concept of student scores. Thus, they together form part of a representation of the mental state for the agent. FIG. 4C illustrates exemplary dialog content authored to teach concepts associated with scores according to embodiments of the present teachings. By using a similar conversation strategy, conversations are intended to be performed in the question-answering process.

FIG. 5A illustrates an exemplary representation of a shared heart state in the form of a T-PG (corresponding to the path in the T-AOG in FIG. 4B) according to embodiments of the present teachings. The step of highlighting forms a particular path taken by actions in the dialog performed by the dialog agent based on the answer from the user. In contrast to the T-AOG shown in FIG. 4B, shown in FIG. 5A is a T-PG of various highlighted steps (e.g., 470, 510, 520, 530, 540, 550 … …) along a highlighted path. The T-PG shown in FIG. 5A represents an instantiated path traversed based on actions of both the agent and the user, thus representing a shared mind state. FIG. 5B illustrates a portion of authored dialog content between an agent and a user based on which a shared mental representation may be obtained according to an embodiment of the present teachings. As discussed herein, a shared mental representation may be derived based on a stream of dialogs that form a particular path or T-PG that is traversed along an existing T-AOG.

As discussed herein, during a conversation, a conversation agent estimates the mood of a user participating in the conversation based on observations of a conversation with the user, such that both the conversation and the estimated representation of the user's mood are adjusted based on the dynamics of the conversation. For example, to determine how to conduct a conversation, an agent may need to evaluate or estimate a user's mastery of a particular topic based on observations of the user. As discussed with respect to fig. 3B, the estimate may be probabilistic. Based on such probabilities, the agent may infer the current mastery level of the concept and determine how to talk further, e.g., continue to guide the current topic if the estimated mastery level is insufficient, or proceed to other concepts if the estimated user mastery level is sufficient. The agent may evaluate periodically during the dialog and annotate the PG (parameterize) during this process to facilitate decisions on next moves in traversing the graph. Such annotated or parameterized S-AOG may produce an S-PG, i.e., indicating which nodes in the S-AOG have been adequately covered and which have not been adequately covered, for example. FIG. 5C depicts an exemplary S-PG of corresponding S-AOGs that represents an estimated user' S mood, according to embodiments of the present teachings. The underlying S-AOG is shown in FIG. 4A. In this illustrated example, during a conversation, each node in this S-AOG is evaluated based on the conversation and parameterized or annotated based on such evaluation. As shown in fig. 5C, nodes representing different sub-concepts related to the score are annotated with respective different parameters (which indicate, for example, the degree of mastery of the corresponding node).

As shown in FIG. 5C, the nodes in the initial S-AOG (FIG. 4A) are now annotated in FIG. 5C with different weights, each weight indicating the degree of mastery evaluated of the corresponding node' S sub-concepts. As can be seen, the nodes in FIG. 5C are presented in different shades, which are determined according to weights representing different degrees of mastery of the underlying sub-concepts. For example, the now dotted nodes may correspond to those already mastered sub-concepts, and therefore no further traversal is required. Nodes 560 and 565 (corresponding to "understand scores" and "understand false scores") may correspond to sub-concepts that do not achieve a desired level of mastery. All nodes (e.g., mastered and mastered) connected to the two nodes between the two nodes may be considered as a reason why the user has not mastered the concept of scores and false scores.

Such estimated level of mastery of the corresponding nodes in the original S-AOG would result in annotated S-PGs that represent an estimated user' S mood that would indicate a degree of understanding of concepts associated with such nodes. This provides a basis for the dialog broker to understand the user's relevant patterns, e.g. for what has been taught, what the user has understood and for what the user still has problems. As can be seen, the representation of the user's mind is dynamically estimated based on, for example, the user's performance and activity during an ongoing conversation. In addition to estimating the level of mastery of concepts associated with different nodes to understand the user's mind, contextual observations and information about the user that may be collected during an ongoing conversation may also be used to estimate other traits or behavioral indicators of the user as part of understanding the user's mind. FIG. 5D illustrates exemplary types of personality traits of a user that may be estimated based on information observed in a conversation, according to embodiments of the present teachings. As shown, in a conversation with a user, based on observations of the user's behavior or expression (whether in spoken, visual, or physical form), the agent can estimate various characteristics of the user in different dimensions via multimodal information processing (e.g., by multimodal information processor 170), e.g., whether the user is outward, how mature the user is, whether the user is angry, whether the user is excited easily, whether the user is generally upright, how confident or safe the user feels to him/herself, whether the user is reliable, rigorous, etc. Such information, once estimated, forms a user's profile that can influence the dialog system 100 to determine how to adjust the dialog strategy of the dialog system 100 when needed and in what manner the agent of the dialog system 100 should have a dialog with the user.

Both S-AOG and T-AOG may have certain structures that are organized based on, for example, topics, concepts, or streams of conversation. FIG. 6A depicts an exemplary generalized structure of an S-AOG associated with a tutorial dialog according to an embodiment of the present teachings. This general structure as shown in fig. 6A is not subject-specific, but can be used to teach any subject. Exemplary structures include the different stages involved in the tutorial dialog, which are represented as different nodes in the S-AOG. As shown, node 600 is used for conversations related to greetings, node 605 is used for chats about, for example, weather or health, node 610 is used for conversations reviewing previously learned knowledge (e.g., as a basis for teaching an intended topic), node 615 is used for teaching an intended topic, node 620 is used for testing student users about a taught topic, and node 625 is used for conversations evaluating the student users' mastery of the taught topic based on the testing. Different nodes may be connected in a manner that encompasses different flows between the underlying sub-dialogs, but the particular flow in each dialog may be dynamically determined based on the situation. Some branches coming out of a node may be related via an AND relationship AND some branches coming out of a node may be related via an OR relationship.

As seen in fig. 6A, a tutoring-related dialog may begin with a greeting dialog 600, such as "good morning", "good afternoon", or "good evening". There are three branches out of node 600, including going to node 605 for short chats, going to node 610 for reviewing previously learned knowledge, and going to node 615, which begins teaching directly. The three branches are OR'd together, i.e. the dialog agent can proceed to follow any of the three branches. After the chat session 605, there are also three branches, one to the tutorial node 615, one to the test node 620, and one to the review node 610. The review node 610 also has two branches, one going to the teaching node 615 and the other going to the testing node 620 (the student's prior knowledge or level of prior knowledge of the subject may be tested first before teaching). In this illustrated embodiment, teaching AND testing nodes are the necessary dialogs such that branches from nodes 605 AND 610 to teaching AND testing nodes 615 AND 620 are related by AND.

The teaching and testing may be iterative, as indicated by the double-headed arrow between

nodes

615 and 620. Both the teaching node 615 or the testing node 620 may proceed to the evaluation node 625 as desired. That is, the evaluation may be performed based on the teaching results from node 615 or the test results from node 620. Based on the evaluation results, the conversation may proceed to one of three alternative scenarios (related by OR), including teaching 615 (review concepts again), testing 620 (retesting), OR reviewing 610 (to enhance the user's understanding of some concepts), OR even chatting 605 (e.g., if the user is found to be frustrated, the conversation system 100 may switch topics to continue attracting the user rather than losing the user). This general S-AOG for tutorial-related dialogs is provided by way of illustration and not limitation. The S-AOG used for tutoring can be derived from any logic flow required by the application.

As seen in fig. 6A, each node is itself a conversation and, as discussed herein, may be associated with one or more T-AOGs, each representing a conversation flow for an intended topic. FIG. 6B depicts an exemplary T-AOG with dialog content for a greeting authored for the S-AOG node 600, according to embodiments of the present teachings. The T-AOG may be defined as a dialog policy for a dialog. Following the steps defined in the T-AOG is the implementation of a strategy that achieves some of the intended purposes. In fig. 6B, the content in each rectangular box represents the content to be spoken by the agent, and the content in the ellipse represents the content to which the user responds. As can be seen, in the T-AOG for a greeting shown in fig. 6B, the agent first says one of the three alternative greetings available, namely, good morning 630-1, good afternoon 630-2, and good evening 630-3. The user's response to such a greeting may be different. For example, the user may repeat what the agent said (i.e., good morning, good afternoon, or good evening). Some people will repeat and then add "you are also" 635-1. Some people say, "thank you, at 635-2? ". Some would say both 635-1 and 635-2. Some people may simply remain silent 635-3. There may be other alternative ways to respond to an agent's greeting. Upon receiving a response from the user, the conversation agent may then reply to the user's response. Each answer may correspond to a response from the user for each alternative response from the user. This is illustrated by the contents at 640-1, 640-2 and 640-3 in FIG. 6B.

The T-AOG shown in FIG. 6B may cover multiple T-AOGs. For example, 630-1, 635-2, and 640-2 in FIG. 6B may constitute T-AOG for greetings. Similarly, 630-1, 635-1, 640-1 may correspond to another T-AOG for a greeting; 630-2, 635-1, 640-1 may form another T-AOG; 630-1, 635-3, 640-3 form a different T-AOG; 630-2, 635-3 and 640-3 may form another different T-AOG, and so on. Although different, these alternative T-AOGs all have substantially similar structures and general content. This commonality can be used to generate a simplified T-AOG with flexible content associated with each node. This may be achieved via, for example, graph matching. For example, the different T-AOGs mentioned above that are related to greetings, while having different authored content with respect to greetings, all have a similar structure, namely the initial greeting plus the response from the user and plus the response to the user's response to the greeting. In this sense, the T-AOG in FIG. 6B may not correspond to the simplest general T-AOG for greetings.

In order to facilitate flexible dialog content and to enable the dialog system 100 to adapt the dialog in a personalized manner, the AOG may be parameterized. According to different embodiments of the present teachings, such parameterization may be applied to both S-AOG and T-AOG with respect to both parameters associated with nodes in the AOG and parameters associated with links between different nodes. FIG. 6C illustrates different exemplary types of parameterization according to embodiments of the present teachings. As shown, the parameterized AOG includes parameterized S-AOG and T-AOG. For parameterized S-AOG, each of its nodes can be parameterized with an award (rewarded) representing, for example, an award obtained by covering a topic or topic/concept associated with the node. In the context of tutoring, the higher the reward associated with a node in the S-AOG, the greater the value the agent has to teach the student user the concepts associated with that node. Conversely, if the student user is already familiar with (e.g., has mastered) the concept associated with the node in the S-AOG, the lower the reward assigned to that node, since there is no further benefit by teaching the associated concept to the student. Such rewards associated with the nodes may be dynamically updated during the course of the tutoring session. This is illustrated in FIG. 6D, where S-AOG 310 has nodes associated with related mathematical concepts that are related to scores. As can be seen, each node representing a related concept is parameterized with an award that is evaluated to indicate whether the student is taught the concept for an award.

Each node in the S-AOG may have a different branch, and each branch leads to another node associated with a different topic. Such branches may also be associated with parameters such as the probability of taking the respective branch, as shown in fig. 6C. Also illustrated in FIG. 6D is parameterization of the path in the AOG. Teaching of the scores may require accumulating knowledge starting with additions and subtractions, followed by multiplications and divisions. Along each connection between different concepts, there is a probability from one to the other. For example, as shown, the probability P is parameterized from the "add" node 310-4 to the "subtract" node 310-5_a,sIt may indicate the possibility of successfully teaching the student to understand the concept of "subtraction" if the concept "addition" is first taught. In contrast, the probability P_s,aThe likelihood of successfully teaching the student to understand addition if subtraction is taught first may be indicated. As another example, the probabilities P are used from "add" to "multiply"/"divide", respectively_a,mAnd P_a,dAnd carrying out parameterization. Similarly, the probabilities P are also used from "subtract" to "multiply"/"divide", respectively_s,mAnd P_s,dAnd carrying out parameterization. With such probabilities, the conversation agent can select an optimized path by maximizing the probability of successfully teaching the intended concept in an order that is likely to work better. Such probabilities may also be dynamically updated based on, for example, observations from the conversation. In this way, the optimal process of teaching the student's actions can be adjusted in real time based on the individual's situation.

Parameterization may also be applied to the T-AOG, as indicated in FIG. 6C. As discussed herein, T-AOG represents a dialog strategy for a particular topic. Each node in the T-AOG represents a specific step in the conversation, often related to what the conversation agent is going to say to the user, or what the user is going to react to the conversation agent, or an evaluation of the transition. As discussed herein, the following often occurs: the same thing can be said in different ways and any way of saying it should be considered to convey the same thing. Based on this observation, the content associated with the nodes in the T-AOG can be parameterized. This is shown in fig. 6E according to an embodiment of the present teachings. As shown in fig. 6B, there are different ways to perform a greeting dialog. Even for such simple topics, there are many different ways to express something that is nearly the same. The content of the greeting dialog may be parameterized in a more simplified T-AOG. FIG. 6E illustrates an exemplary parameterized T-AOG corresponding to the T-AOG shown in FIG. 6B. The initial greeting is now parameterized as "[ ___ ] good! "650, where the content in parentheses is parameterized as possible examples" morning "," afternoon "and" evening ". The user's response to the initial greeting is now divided into two cases, one is a spoken response 655-1 and the other is no spoken response or silent 655-2. The spoken answer 655-1 may be parameterized with different content choices in response to the initial greeting, as shown in parenthesis associated with 655-1. That is, any content included in parameterized set 655-1 may be identified as a possible answer from the user's response to the initial greeting from the agent. Similarly, the content of such responses from the agent at 660-1 may also be parameterized as a set of all possible responses in order to respond to the user's answer. In the case of user silence, the response content 660-2 of the agent may also be similarly parameterized.

Another example of parameterizing content associated with nodes in a T-AOG is illustrated in FIG. 6F. This example relates to a T-AOG for testing students with respect to the concept of "addition". As shown, the T-AOG used for this test may include the following steps: questions are posed (665), students are asked for answers (667), the students provide answers (670-1 or 675-1), responses are made to the user' S answers (670-2 or 675-2), and then the rewards associated with the S-AOG node for "addition" are evaluated (677). For each node of the T-AOG in FIG. 6F, the content associated with it is parameterized. For example, for step 665, the parameters involved include X, Y, Oi, where X and Y are numbers and Oi refers to objects of type i. By instantiating specific values for these parameters, a number of problems can be created. In this example in fig. 6F, the first step at 665 of the test is to present X objects of type 1 ("o 1") and Y objects of type 2 ("o 2"), where X and Y are instantiated with numbers (3 and 4) and "o 1" and "o 2" may be instantiated with the type of object, such as apple and orange. Based on this instantiation of the parameters, a specific test question may be generated. At 667 of T-AOG in FIG. 6F, to test a student, the dialog agent will ask the user what the sum of X objects of type "o 1" and Y objects of type "o 2" is. When X, Y, o1 and o2 are instantiated with specific values, such as X3, Y4, o1 and o2 orange, the text question would appear as "3 apples, 4 oranges" (or even a picture thereof), and the test question could be asked by an instantiated parameterized question to ask the sum of X + Y, e.g., "how many fruits are? "or" can you tell me the total number of fruits? In this way, flexible test problems can be generated under generic and parameterized T-AOG. Also parameterized test questions may facilitate the generation of the expected correct answer. In this example, since X and Y are instantiated as 3 and 4, respectively, the expected correct answer to the summed test question may be dynamically generated as X + Y-7. Such dynamically generated expected correct answers may then be used to evaluate answers from student users in response to the question. In this way, the T-AOG can be parameterized with a simpler graph structure, while enabling the dialog agent to flexibly configure different dialog content in a parameterized framework to perform the intended task.

As discussed herein, the dialog agent may also dynamically derive a basis for evaluation of the test as parameterized content is instantiated. In this example, the expected correct answer 7 is formed based on the example of X-3 and Y-4. When an answer is received, the answer may be categorized as an unknown answer at 670-1 (e.g., when the user does not respond at all or the answer does not contain a number) or an answer with a number at 675-1 (correct or incorrect numbers). The response to the answer may also be parameterized. The answer is either unknown or incorrect may be considered a non-correct (not-correct) answer, which may be responded to using a parameterized response at 670-2.

Responses to incorrect answers may also be classified into different cases, and in each case, the response content may be parameterized with appropriate content appropriate for that classification. For example, when the incorrect answers are incorrect total numbers, the response thereto may be parameterized to account for the incorrect answers, which may be due to mistakes or guesses. If the incorrect answer is because the user does not know at all (e.g., does not answer at all), then the response can be parameterized to address that situation directly using appropriate response alternatives. Similarly, the response to a correct answer may also be parameterized to account for whether it is indeed a correct answer or is estimated to be a lucky guess.

As shown in fig. 6F, after the dialog agent responds to the answer (or no answer) from the user in different situations, the T-AOG includes a step at 677 for evaluating the current reward for mastering the concept of "add". After the evaluation, the process may return to 665 to test the student for more questions. The process may also continue to "teach" if the evaluation reveals that the concept is not fully understood by the student, or exit if the student is deemed to have mastered the concept. In some cases, the process may also be abnormal, for example, if it is detected that the student is unable to complete the assignment at all, the system may consider switching topics temporarily, as discussed with respect to fig. 6A.

Furthermore, since the T-AOG corresponds to a conversation policy (which indicates possible flows for alternative to the conversation), the actual conversation may follow a particular path in the T-AOGThe path traverses a portion of the T-AOG. Different users in different dialog sessions or the same user may generate different paths embedded in the same T-AOG. Such information may be useful for allowing the dialog system 100 to personalize the dialog by parameterizing the links along different paths with respect to different users, and the path of such parameterization may be with respect to what each user represents to function and what to do not. For example, for each link between two nodes in a T-AOG, the reward for that link may be estimated with respect to each student user's performance of understanding of the underlying concepts being taught. Such path-centric rewards may be calculated based on probabilities associated with different branches for each node along the path. Figure 6G illustrates a T-AOG associated with a user having different paths between different nodes parameterized with rewards updated based on dynamic information observed in a conversation with the user, in accordance with an embodiment of the present teachings. In this exemplary parameterized T-AOG (similar to the T-AOG presented in FIG. 6F), after X objects 1 and Y objects 2 are presented to the student user at 680, the agent asks the user for the total number of X + Y at 685. Based on previous teaching or testing of the same user, there may be an estimated likelihood of how the student will make this round, i.e., for each possible outcome (690-1, 680-2 and 690-3), there is an associated reward R, respectively₁₁、R₁₂And R₁₁。

If the answer from the student is incorrect (690-2), then it may be responded to in different ways, e.g., 695-1, 695-2, and 695-3. Based on the user's past experience or known personality (also estimated in a personalized manner), there may be different reward points R respectively associated with each possible response₂₂、R₂₃And R₂₄. For example, if the user is known to be sensitive and to do better in an encouraging or positive manner, the reward associated with response 695-2 may be highest. In this case, the dialog system 100 may choose to respond to the incorrect answer with a response 695-2, which response 695-2 is more aggressive in motivation. For example, the conversation agent maySay "about. Think of a new idea. "different user may prefer not to be notified of the error, in which case R is associated with that user₂₃And R₂₄In contrast, reward R linked to response 695-1₂₂May be the highest. Such reward points associated with alternative paths of the T-AOG are personalized based on knowledge of the particular user and/or past deals with the user. By configuring the AOG with parameters for both nodes and paths, the dialog system 100 can dynamically configure and update parameters during each dialog to personalize the AOG so that these dialogs can be conducted in a flexible (content parameterized), personalized (parameters are calculated based on personalized information) and thus more efficient manner.

As discussed herein and shown in fig. 2A, the informational state 110 is represented based not only on AOG, but also on various types of information, such as conversation context 260, conversation history 250, user profile 290, event-centric knowledge 270, and some common sense models 280. The representation of the different concentric states 200-. For example, while AOGs are used to represent different psychology, their corresponding PGs (results of traversing AOGs based on dialog) are generated based on actual traversals (nodes and paths) in the AOGs and dynamic information collected during the dialog. For example, the values of parameters associated with nodes/links in an AOG may be dynamically estimated based on ongoing conversations, conversation history, conversation context, user profiles, events occurring during conversations, and so forth. In view of this, to update the information state 110, different types of information (such as knowledge about events, surrounding environment, user characteristics, activities, etc.) may be tracked, and this tracked knowledge may then be used to update different parameters and ultimately update the information state 110.

As discussed herein, AOG/PG is used to represent different psychology, including that of a robotic agent (designed according to what is expected to be done), that of sharing between the robotic agent and a user (derived based on actually occurring conversations), and that of the user (estimated based on occurring conversations and the user's performance in conversations). When AOGs and PGs are parameterized, the values of parameters associated with nodes and links may be evaluated based on, for example, information related to the conversation, the user's performance, the user's characteristics, and optionally the event(s) occurring during the conversation, etc. Based on this dynamic information, such a representation of the mood may be updated over time during the dialog based on changing circumstances.

Fig. 7A depicts a high-level system diagram of a knowledge tracking unit 700, the knowledge tracking unit 700 being used to track information and update rewards associated with nodes/paths in an AOG/PG, in accordance with embodiments of the present teachings. As discussed herein, nodes in an AOG may each be parameterized with a state-related reward, and paths in a PG may also each be parameterized with a path-related reward or utility. The reward/utility associated with a state or node in an AOG may include a reward/utility that represents a degree of mastery of the concept associated with the node. The higher the mastery degree of a concept of an AOG node, the lower the associated state reward/utility for that node, i.e., the considerably lower the reward/utility for teaching a concept that has been mastered. The reward/utility is personalized and derived based on an assessment of the user's performance, and such assessment may be ongoing on a timely basis or on a periodic basis during the session.

As discussed herein, each S-AOG node associated with a concept (e.g., to be taught in tutoring) may have one or more T-AOGs, each of which may correspond to a particular manner of teaching the concept. The parse path or PG is formed based on nodes and links in the T-AOG traversed during the dialog. The reward/utility associated with a path in a T-AOG or T-PG may indicate the likelihood that such a path will result in a successful tutoring session or a successful mastery of the concept. In view of this, the better the performance of the evaluated user in traversing along the path, the higher the path reward/utility associated with the path. Such path-related rewards/utilities may also be determined based on the performance of multiple users that statistically indicate which pedagogical style is better for a group of users. Such estimated rewards/utilities along different branch paths may be particularly helpful in conversation sessions when determining which branch path to take to continue a conversation, and may guide the conversation agent to select a path that has a statistically better chance of leading to better performance (i.e., reaching a mastery level of the concept sooner).

The illustrated embodiment shown in fig. 7A is directed to tracking state and path rewards during a conversation. In this illustrative embodiment, the rewards associated with the nodes and paths are determined based on different probabilities estimated based on the dynamic conditions of the conversation. For example, for nodes in the S-AOG associated with, for example, an "add" concept, the reward is a state-based reward that indicates whether there is a reward or reward by teaching the "add" concept to a particular user. For each student user registered to learn mathematics from the robotic agent, the mathematically conceptual reward value for each node of the S-AOG is adaptively calculated. The reward for each node in such an S-AOG (e.g., "add" to the mathematical concept) may be assigned an initial reward value, and the reward value may continue to change as the user engages in a conversation indicated by the associated T-AOG (conversation flow on the "add" concept). During the dialog specified by the T-AOG, the robotic agent may ask a question to the user, who then answers the question. Answers from the user may be continuously evaluated and the probability of whether the user is learning or making progress estimated. Such probabilities estimated while traversing the T-AOG may be used to estimate rewards associated with nodes in the S-AOG (i.e., nodes representing the concept "add") that indicate whether the user has mastered the concept. That is, the reward value associated with the node representing the concept is updated during the dialog. If the teaching is successful, the reward may be reduced to a low value indicating that the student is taught that the concept is of no further value or reward because the student has mastered the concept. As can be seen, such state-based rewards are personalized in that they are calculated based on each user's performance in the conversation.

There are also rewards associated with different paths in the T-AOG, which includes different nodes, each of which may have multiple branches representing alternative paths available. The selection of different branches results in different traversals through the underlying T-AOG, and each traversal results in a T-PG. In a tutoring application, in order to track the effectiveness of tutoring, at each node (traversed during a dialog) of the T-AOG, different branches may be associated with respective measurements, which may indicate the likelihood of achieving the intended target when selecting the respective branch. The higher the measurement associated with a branch, the more likely it is to lead to a path that meets the intended purpose. However, optimizing the selection of branches from each individual node may not result in an overall optimal path. In some embodiments, rather than optimizing the individual selection of branches at each node, the optimization may be performed on a path basis, i.e., the optimization is performed with respect to a path (of a particular length). In operation, such path-based optimization may be implemented as a look-ahead operation, i.e., what the best choice at the current branch with respect to the current node is when considering the next K choices along the path. This look-ahead operation selects the branch based on a composite measurement along each possible path determined based on measurements accumulated on links from the current node along each possible path. The length of the lookup may vary and may be determined based on application needs. The composite measurements associated with all available alternate paths (originating from the current node) may be referred to as a path-based reward. The branch from the current node may then be selected by maximizing the path-based reward for all possible traversals from the current node.

The reward along the path of the T-AOG may be determined based on a plurality of probabilities determined based on the user's performance observed during the conversation. For example, at the current node in the T-AOG, the dialog agent may ask a question to the student user and then receive an answer from the user in response to the question, where the answer corresponds to the branch from the node in the T-AOG for the question. A measure associated with the reward for the branch may then be estimated based on the probability. Such measurements, as well as path-based rewards, are personalized in that they are calculated based on personal information observed from conversations involving a particular user. Measurements associated with different branches along the T-AOG path (associated with the S-AOG node) may be used to estimate the reward of the S-AOG node with respect to student mastery level. These rewards (including node-based rewards and path-based rewards) may constitute the "utilities" or preferences of the user, and may be used by the robotic agent to adaptively determine how to continue the conversation in a utility-driven conversation plan. This is illustrated in fig. 7B, which shows how knowledge can be tracked to enable the dialog system 100 to adapt relevant knowledge on the basis of "shared thoughts" (which represent actual dialogs) on the fly to use this tracked knowledge to dynamically update models (parameters in parameterized AOGs, e.g., rewards for the degree of mastery of basic concepts in S-AOGs about students/users 'thoughts and/or rewards for different paths in T-AOGs in agents' thoughts) which can then be used (by agents) to perform utility-driven dialog plans according to the dynamic situation of dialogs with a particular user.

Referring back to fig. 7A, to perform knowledge tracking and updating of information states 110 based on tracked knowledge, the knowledge tracking unit 700 includes an initial knowledge probability estimator 710, a positive knowledge probability estimator 720, a negative knowledge probability estimator 730, a guess probability estimator 740, a state reward estimator 760, a path reward estimator 750, and an information state updater 770. Fig. 7C is a flow chart of an exemplary process of the knowledge tracking unit 700 according to an embodiment of the present teachings. In operation, the initial probability of knowledge of a node in a representation of a relevant AOG can first be estimated at 705. This may include an initial known probability of both each relevant node in the S-AOG and each branch of each T-AOG associated with the S-AOG node.

With the estimated initial probabilities, the conversation agent can converse with the user on a particular topic represented by the relevant S-AOG node and the particular T-AOG for that S-AOG node, the associated probabilities having been initialized. To initiate a conversation, the robotic agent begins the conversation by following the T-AOG. When a user responds to the robotic agent, the NLU engine 120 may analyze the response and generate a language understanding output. In some embodiments, to understand the utterance of the user, NLU engine 120 may also perform language understanding based on information other than the utterance (e.g., information from multimodal information analyzer 702). For example, a user may say "this is a machine toy" while pointing at a toy on a table. To understand the semantics of this utterance (i.e., what this means), the multimodal information analyzer 702 may analyze the audio and visual information to combine cues in different modalities to facilitate the NUL engine 120 to understand the user's meaning and output the user's response with an assessment of the correctness of the response, e.g., based on T-AOG.

When the knowledge tracking unit 700 receives the user's response with the assessment at 715, in order to track knowledge based on what is happening in the conversation, different modules may be invoked to estimate corresponding probabilities based on the received input. For example, if the user's response corresponds to a correct answer, then the know-positive probability estimator 720 may be invoked to determine a probability associated with knowing the correct answer positively; negative-aware probability estimator 730 may be invoked to estimate the probability associated with an unknown answer; the guess probability estimator 740 may be invoked to determine a probability that evaluates the likelihood that the user will only make a guess. Knowing the positive probability estimator 720 may also determine the probability associated with a positive knowledge but the user made an error if the user's response corresponds to an incorrect answer; negative-aware probability estimator 730 may estimate a probability associated with not being aware and the user still answering the error; the guess probability estimator 740 may determine the probability that the answer is only a guess. These steps are performed at 725, 735, and 745, respectively.

As discussed herein, for T-AOG, when a user interacts with a conversation agent, such interactions form a parse graph that continues to grow as the conversation progresses. One example is shown in fig. 5A. Given a parse graph or history of interactions between the robotic agent and the user, the probability of the user knowing the underlying concept may be adaptively updated based on the estimated probability. In some embodiments, the probability of initially knowing (or knowing) a concept at time t +1 may be updated based on the observations. In some embodiments, it may be calculated based on the following formula:

wherein P (L)_t+1I obs ═ correct) represents the probability of initial knowledge given the observed correct answer at time t +1, P (L)_t+1"wbs" — wrong) represents the initial known probability of a given observed wrong answer at time t +1, P (L)_t) Is the initial known probability at time t, p(s) is the probability of a miss (slip), and p (g) represents the probability of a guess. Thus, the probabilities of the a priori knowledge can be dynamically updated with probabilities estimated based on observations of the conversation, as shown in the examples herein. This a priori knowledge probability associated with the nodes in the S-AOG may then be used by the state reward estimator 760 at 755 in fig. 7C to compute state-based rewards or node-based rewards associated with the nodes in the S-AOG that represent the user' S mastery of the relevant skills associated with the concept nodes.

Based on the probabilities computed for different branches (e.g., some corresponding to correct answers and some corresponding to wrong answers) for each node along the PG path in the T-AOG, a path-based reward may be computed by the path reward estimator 750 for each path at 765 in fig. 7C. Based on such estimated state-based rewards and path-based rewards, the information state updater 770 may then continue to update the parameterized AOG in the information state 110 at 775. When the parameters associated with the AOGs in the information state 110 are updated, the updated parameterized AOGs can then be used to control the conversation based on the utility (preferences) of the user.

In some embodiments, different parameters for parameterizing AOGs may be learned based on observed and/or calculated probabilities. In some embodiments, unsupervised learning methods may be employed to learn such model parameters. This includes, for example, knowledge tracking parameters and/or utility/reward parameters. Such learning may be performed online or offline. In the following, an exemplary learning scheme is provided:

(α₁(j)=π_jb_j(o₁)，j∈[1，N]

β_T(i)＝1，i∈[1，N]

8A-8B depict a utility-driven conversation plan based on dynamically computed AOG parameters, according to an embodiment of the present teachings. Utility driven conversation plans can include conversation node plans and conversation path plans. The former may refer to a node in the S-AOG that is selected to continue the conversational session. The latter may refer to the path selected in the T-AOG for conducting a conversation. FIG. 8A illustrates an example of a utility-driven tutoring plan with respect to a parameterized S-AOG, according to an embodiment of the present teachings. FIG. 8B illustrates an example of utility-driven path planning in a parameterized T-AOG, according to an embodiment of the present teachings.

For node planning, as shown in FIG. 6D, an exemplary S-AOG 310 is used to teach various mathematical concepts and each node corresponds to a concept. In fig. 8A different nodes are shown, which have associated therewith reward related parameters, and some of which may be parameterized with conditions established based on the rewards of the connected nodes. As seen in FIG. 8A, node 310-4 is used to teach the concept "add", node 310-5 is used to teach the concept "subtract", … …, and so on. Each node parameterizes, for example with an indication, a reward for teaching the concept, which reward is related to, for example, the current level of mastery of the concept. The rewards associated with some nodes in S-AOG 310 are expressed as a function of reward parameters from the nodes to which they are connected.

Some concepts may need to be taught under the requirement or condition (e.g., a prerequisite) that the user already possesses some other concept. For example, to teach the student the concept of "division," the user may be required to already grasp the concepts of "addition" and "subtraction". This may be indicated by the requirement 820, the requirement 820 being expressed as R_d＝F_d(R_a,R_s) Wherein a reward R associated with node 310-3_dIs R_a、R_sFunction F of_d，R_a、R_sRepresenting the rewards associated with node 310-4 for "addition" and node 310-5 for "subtraction", respectively. For example, an exemplary condition for teaching the concept of "division" 310-3 may be a reward R whose reward level must be high enough (i.e., the user has not mastered the concept of "division") and for "addition" (310-4)_aAnd a reward R for "subtraction" (310-5)_sMust be low enough (i.e., the user has mastered the prerequisite concept for "addition" and "subtraction"). The function F can be designed according to application requirements_dTo satisfy these conditions.

The node-based plan may be set such that conversations (T-AOGs) associated with a node conditioned on certain reward criteria in the S-AOG may not be scheduled until the reward conditions associated with the node are metAnd (4) a foot. In this way, initially, when the user is unaware of any concept, unconditional nodes can be arranged as 310-4 and 310-5 only. During a dialogue for "add" or "subtract", a reward (R) associated therewith_aOr R_s) Can be continuously updated and propagated to nodes 310-2 and 310-3 such that R_mOr R_dAlso according to formula F_mOr F_dIs updated. When the user grasps a certain point of the concept of "addition" and "subtraction", R is rewarded_aAnd R_sBecomes low enough that there is no need to schedule dialogs associated with nodes 310-4 and 310-5. At the same time, low R_aAnd R_sCan be substituted into F_mOr F_dSuch that the conditions associated with nodes 310-2 and 310-3 may now be satisfied to place 310-2 and 310-3 in an active state because R_mOr R_dCan now become high enough that they are ready to be selected to perform a dialog on the topic of multiplication and division. When this happens, a dialog for teaching the corresponding concept can be initiated using the T-AOG associated with it.

The same applies to nodes for "scores". The user may be required to have mastered the notion of "multiplication" and "division" (the rewards for 310-2 and 310-3 are sufficiently low) and the reward for node "score" becomes reasonably high accordingly. In this manner, the state-based rewards associated with the nodes in the S-AOG may be used to dynamically control how to traverse between the nodes in the S-AOG in a personalized manner, e.g., in a manner that is adaptive based on the circumstances associated with each individual. That is, in an actual conversation with a different user, the traversal can be adaptively controlled in a personalized manner based on observations of the actual conversation situation. For example, in FIG. 8A, different nodes may have different rewards at different times, depending on the stage of teaching. As shown, node 310-4 for "add" is darkest, representing, for example, the lowest prize value, which may indicate that the user has mastered the concept of "add". Node 310-5 for "subtract" has a reward value in between, for example, indicating that the user is not currently mastering the concept but is close. Nodes 310-1, 310-2, and 310-3 are light colored, for example, to indicate a high level of reward values indicating that the user has not mastered the corresponding concept.

Path-dependent or path-based rewards associated with paths in a T-AOG may also be dynamically computed based on observations of the actual conversation and may also be used to adjust how the T-AOG is traversed (how branches are selected) during the conversation. FIG. 8B illustrates an example of utility-driven path planning with respect to a T-AOG, according to an embodiment of the present teachings. As shown, upon traversing the T-AOG, at each time, e.g., at time T, upon receiving an answer from the user, the robotic agent needs to determine how to respond. During

time

1, 1.. t, the dialog traverses the parse graph pg1.. t, with s as the traversed state₁，s₂，...，s_t. In response, slave state s_tThere may be multiple branches leading to the next state s_t+1。

To determine which branch to take, look-ahead operations may be performed based on path-based rewards along alternative paths. For example, to look ahead one step, the sum of the values from s can be considered_t(step one) the rewards associated with the available alternative branches, and the branch representing the best path based reward may be selected. To look ahead two steps, consider the sum of the results from s_tAnd the reward associated with each branch in each secondary alternative branch (originating from each branch in the first set of alternative branches), and the branch that results in the best path-based reward is selected as the next step. A deeper look-ahead can also be achieved based on the same principles. The example shown in FIG. 8B is a scheme that implements a two-step look-ahead, i.e., at time t, the range of the look-ahead includes a plurality of paths at t +1 and each of the plurality of paths at t +2 that originate from each path at t + 1. Branches are then selected via look-ahead to optimize the path-based reward.

The path-based reward associated with a branch may be initialized first and then updated during the session. In some embodiments, may beAn initial path-based reward is calculated for a previous session indicated by the user. In some embodiments, such initial path-based rewards may also be calculated based on previous conversations of similarly situated users. Each path-based reward may then be dynamically updated over time during the dialog based on how each branch selection results in satisfaction of the intended purpose of the dialog. Based on this dynamically updated path-based reward, a look-ahead optimization scheme may be driven by the utility (or preference) of each user as to how to talk. Thus, it enables adaptive path planning. The following is an exemplary formula for path planning that optimizes path selection based on look-ahead operations. In this exemplary formula, a is the path of optimal choice given a number of branch choices a, the current state st, and the parse graph pg1 … t, EU is the expected utility of branch choice a, and R (st +1, a) represents at state s_t+1The prize of a is selected next. As can be seen, the optimization is recursive, which allows look-ahead at any depth.

a^*＝arg max EU(a|s_t，pg₁...t)

In conjunction with state-based, utility-driven node planning, the conversation system 100 in accordance with the present teachings is able to dynamically control conversations with users based on past accumulated knowledge about the users and instantaneous observations of the users, in connection with the intended purpose of the underlying conversation. Figure 8C illustrates the use of utility driven dialog management for dialogs with student users based on a combination of node and path plans, according to embodiments of the present teachings. That is, in a dialog with a student user, the dialog agent conducts a dialog with the user via utility-driven dialog management based on dynamic nodes and parameterized AOG-based path selection.

In FIG. 8C, S-AOG 310 includes various nodes for various concepts to be taught, with annotated rewards and/or conditions. The reward associated with a node may be predetermined based on knowledge about the user. For example, as shown, four nodes (310-2, 310-3, 310-4, and 310-5) may have lower rewards (represented as darker nodes), indicating, for example, that the student user has mastered the concepts of addition, subtraction, multiplication, and division. There is one node with a high educational reward (i.e., a dialog can be scheduled), which is 310-1 for "score". Thus, selecting one of the S-AOG nodes to conduct a conversation is a reward-driven or utility-driven node plan.

Node 310-1 is shown associated with one or more T-AOGs 320, each T-AOG 320 corresponding to a conversation policy governing a conversation to teach students the concept of "points". One of the T-AOGs (i.e., 320-1) may be selected to govern the conversation session, and T-AOG 320-1 includes various steps, such as 330, 340, 350, 360, 370, 380 … …. T-AOG 320-1 may be parameterized with path-based rewards, for example. During the session, path planning can be done dynamically using path-based rewards to optimize the likelihood of achieving the goal of teaching students to master the concept of "points. As shown, the nodes highlighted in 320-1 correspond to paths selected based on a path plan, forming a parse graph, representing a dynamic traversal based on an actual dialog. This is to illustrate that knowledge tracking during a conversation enables the conversation system 100 to continually update parameters in the parameterized AOG to reflect learned utilities/preferences from the conversation, and such learned utilities/preferences in turn enable the conversation system 100 to adjust its path plan, thereby making the conversation more efficient, attractive, and flexible.

As shown in FIG. 6F, the parameterized content associated with different nodes may be used to create a T-AOG. The parameterized content associated with each node represents what is expected to be spoken/heard during the conversation. The more alternative content included in the parameterized content associated with each node, the more flexible the parameterized T-AOG can support dialog. Such alternative content for each node may be authored manually, semi-automatically, or automatically. Such authored, alternative content may also be used as training data to facilitate efficient recognition to improve the adaptability of dialog management. 9A-9B illustrate a scheme to enhance spoken language understanding in human-machine conversation through automatic enrichment of parameterized content for AOG, according to embodiments of the present teachings. FIG. 9A presents a T-AOG as seen in FIG. 6F, except that each node in FIG. 9A is now associated with one or more alternative sets of content that the dialog manager can use to conduct a dialog. As discussed herein, such alternative parameterized dialog content sets may also be used to train ASR and/or NLU models to understand utterances to be spoken.

The training data set associated with a node represents the parameterized content of the node. For example, as shown in FIG. 9A, node 665 is associated with two training data sets [ T [ ]_N]And [ T_O]Association, where the former is for numbers (X and Y may be data items contained therein), the latter is for objects o1 or o2 (e.g., apple, orange, pear, etc.); node 667 and training dataset for query statement [ T ]_I]Associating; node 670-1 and training data set for user unknown answers [ T_NKA]Associating; node 675-1 and training data set for user numeric answers [ T_N]Associating; node 670-2 is associated with a training data set [ T ] for responses to incorrect answers (which are either alternative responses to unknown answers or alternative responses to incorrect answers)_RNCA]Associating; node 670-1 and training data set for responding to correct answers [ T_RCA]Associating; node 675-2 communicates with a training data set [ T ] for responding to an unknown answer from a user_NKA]Associating; node 910 is associated with a training data set [ T ] for responding to unknown answers from users_RNK]Associating; node 920 and training data set [ T ] for responding to incorrect answers from users_RI]Associating; node 930 and training data set T for responding to correct answers from users_RC]Associating; node 940 and training data set [ TR ] for responding to wrong answers from users_m]Associating; and node 950 with a training data set [ T ] for responding to an unknown answer from a user_RG]And (4) associating.

FIG. 9B illustrates a rootExemplary training data sets associated with different nodes of the T-AOG in FIG. 9A according to embodiments of the present teachings. As shown, for example, [ T_N]Can be any single or multi-digit number set, [ T ]_O]May include the name of the object available for replacement, [ T_I]Alternative ways of requiring a total of two digits may be included; [ T ]_NKA]May include alternative ways for the user to say "i don't know"; [ T ]_RC]Alternative ways of responding to correct answers may be included; [ T ]_RNK]May include alternative ways of responding to an unknown answer from the user; [ TR ]_m]May include alternative ways of responding to a wrong answer from the user; [ T ]_RG]May include alternative ways of responding to guessed answers from the user; [ T ]_RI]May include an alternative way of responding to an incorrect answer from the user, which may include an alternative response to the wrong answer TR_m]Or alternative responses to guessed answers [ T_RG]；[T_RNCA]May include alternative ways of responding to incorrect answers from the user, which may include alternative responses to unknown answers [ T_RNK]Or an alternative response to an incorrect answer T_RI](ii) a And [ T_RCA]May include alternative ways of responding to correct answers from the user, which may include alternative responses to correct answers [ T_RC]Or alternative responses to guessed answers [ T_RG]. Training data set for response (e.g., [ T ]_RC]、[T_RNK]、……、[T_RCA]) May be used to generate a response of the robotic agent to the user's answer. This can be a supplement to the ASR/NLU understanding utterance.

Such enriched training data sets associated with different nodes in the T-AOG may significantly improve the ability of the robotic agent to understand different ways of expressing the same answer from a user and the flexibility of generating responses to the user under different circumstances. The enriched training data set may be automatically generated in a bootstrap (bootstrap) manner. FIG. 9C illustrates exemplary types of linguistic variants (alternatives) for generating enriched training data for nodes with parameterized content, according to embodiments of the present teachings. The language variants may be due to different spoken languages, alternative expressions … …, different accents, or even in combination with different specified acoustic characteristics (e.g., pitch, volume, speed, etc.). With respect to alternative expressions, there may be a linguistically and semantically equivalent expression based on, for example, synonyms or slang, for each text string. To enhance the capabilities of the robotic agent, it is desirable to have more alternative content associated with each parameterized node. However, manual generation can be time consuming, tedious and expensive, and thus an efficient method is needed to create such alternative content. The present teachings disclose methods of automatically generating a rich training data set associated with a node in a T-AOG based on, for example, authored content associated with each node. That is, by using the already authored content associated with each node as a basis, the disclosed method automatically generates alternative content with respect to the authored content. The initial authored content may then be combined with such automatically generated alternative content as a training data set for parameterized content associated with the nodes.

FIG. 10A depicts an exemplary high-level system diagram 1000 for automatically generating enriched AOG content and its use for training ASR/NLU models for enhanced performance, according to an embodiment of the present teachings. System 1000 is provided for generating enriched training data sets for different nodes in a T-AOG, and then training ASR and NLU models based on such enriched training data sets to obtain enhanced ASR and NLU models. In this illustrated embodiment, system 1000 includes a parameterized AOG retriever 1010, a parameterized AOG training data generator 1020, an ASR model training engine 1035, and an NLU model training engine 1040. As discussed herein, AOG is used to represent different aspects in managing robot dialogs. Each S-AOG may include a collection of nodes, each node relating to a particular topic on a topic. Each S-AOG node for a particular topic may be associated with one or more T-AOGs, each corresponding to a conversation policy on that topic and including back-and-forth conversations between users and robotic agents on that topic. As discussed herein, a T-AOG may also have multiple nodes, each of which may represent one or more alternative content items spoken by a participant (agent or user) in a conversation. A T-AOG having multiple nodes may be created with parameterized content associated with each node. The parameterized content of the node may be initially populated with some authored content generated at the time the T-AOG was created. The content of this initial parameterization can be expanded or enriched. In accordance with the present teachings, the goal is to enrich the parameterized content associated with each node to include other alternatives that can be used to carry out conversations in a more flexible and enriched manner.

In operation, the initial parameterized content associated with the TAG node may be used as a basis for generating a rich set of parameterized content, which is then used as training data to train ASR and/or NLU models to enhance the ability of the robotic agent to perform more flexible conversations. FIG. 10B is a flowchart of an exemplary process for obtaining an enhanced ASR/NLU model based on automatically generated enriched AOG content, according to an embodiment of the present teachings. At 1055, parameterized AOG retriever 1010 first selects topic-based templates (AOGs) from storage 1005, e.g., S-AOGs at 1055, and then retrieves T-AOGs associated with each node in the selected S-AOGs accordingly, having an initial set of parameterized content associated therewith, at 1060. The parameterized content of such retrieved T-AOG nodes is then sent to parameterized AOG training data generator 1020 to generate enriched parameterized content.

To enrich the authored content of the T-AOG nodes, the parameterized AOG training data generator 1020 accesses the authored content associated with each T-AOG node and uses it as a basis for generating the enriched content at 1065. The parameterized AOG training data generator 1020 accesses the model in 1015 to obtain known language variants and generates enriched training data based on the initial authored content of each T-AOG node at 1070. The enriched training data thus generated is then used to create an enriched parameterized AOG, which is stored in enriched parameterized AOG storage 1025. At the same time, the enriched parameterized content set is used as enriched training data and stored in enriched training data store 1030.

Based on this enriched training data, the ASR model training engine 1035 trains the ASR model 1045 at 1075 using training data 1030 bootstrapped from the initially authored content based on the language variant model 1015. As discussed herein, language variants can be models in terms of speech style (language, accent, etc.) and speech content (different ways of speaking the same thing). The ASR model 1045, obtained based on the enriched training data, may then be used by the ASR to recognize utterances of the user's phonetic content in the phonetic style captured in the enriched parameterized set of content. The derived ASR model is then stored in storage 1045 for future use by ASR engine 130 (fig. 1A) at 1080.

Similarly, to leverage the enriched training data, the NLU model training engine 1040 trains the NLU model 1050 at 1085 using training data 1030 bootstrapped from initially authored content based on the language variant model 1015. As discussed herein, language variants can be models in terms of speech style (language, accent, etc.) and speech content (different ways of speaking the same thing). NLU model 1050, obtained based on enriched training data 1030, may then be used by NLU engine 140 (fig. 1A) to understand the meaning of the user's speech based on the ASR results from ASR engine 130 (fig. 1A). The derived NLU model is then stored in storage 1050 at 1090 for future use by NLU engine 140.

It produces an augmented conversational strategy by developing a rich parameterized set of content associated with the nodes of the T-AOG. This enriched parameterized content set results in a more efficient and effective human-machine dialog as it can be automatically generated (without human activity) and used to further enhance machine-based speech recognition and understanding. In accordance with the present teachings, in addition to enhancing human-machine conversation by automatically generating augmented conversation strategies, the process can be further enhanced by exploring multimodal contextual information in spoken language understanding.

FIG. 11A depicts an exemplary high-level system diagram for context-aware spoken language understanding based on surrounding knowledge tracked during a conversation, according to an embodiment of the present teachings. In this illustrated embodiment, spoken language understanding includes both automatic speech recognition (which recognizes spoken words) and natural language understanding (which understands the semantics of speech based on spoken words). Traditionally, spoken language understanding can utilize speech context information to understand the semantics of an utterance. Such conventional context information utilizes a language context, e.g., words/phrases are spoken before or after. For example, in the sentence "Bob bought a bicycle and he goes out in the wind," he "refers to Bob (semantic), and this semantic ambiguity is resolved based on the language context.

In human-computer interaction or conversation, different types of contexts can be utilized and used to understand the surrounding environment more deeply to achieve more appropriate interaction. Such different types of context may include personal characteristics (related to the user), language, vision, environment, events, preferences, and ambient environment observed during the interaction/conversation. In accordance with the present teachings, different types of contexts may be leveraged to aid in understanding the encountered situation. For example, if the user says, "what are these? "without language context, it is difficult, if not impossible, to know what" something "means unless some additional information is accessible and used to disambiguate. In this case, if the captured visual information reveals that the user is pointing at a pile of fruit on a table, it is now possible to understand what the user said "these things" are. Different types of context may facilitate ASR recognition of spoken words and NLU understanding the meaning of the recognized sentences/words. To take advantage of different types of contexts, different types of sensors can be deployed at a conversation scene to continuously monitor the surrounding environment, gather relevant sensor data, extract features, estimate features related to different events, activities, determine spatial relationships between objects, and store such knowledge learned via observation. This continuously obtained multimodal context information can then be stored in different databases (e.g., the conversation history database 250, the rich media conversation context database 260, the event-centric knowledge database 270) in the information state 110 and utilized during ASR and/or NLU processes to improve human interaction/conversation.

As shown in fig. 11A, spoken language understanding unit 1100 includes ASR engine 130 and NLU engine 140. The ASR engine 130 operates to perform speech recognition to generate text words based on, for example, the vocabulary 1120, the ASR model 1045, and various types of contextual information, including the conversation history 250, the rich media conversation context 260, the surrounding knowledge 270, … …, and the user profile 290. The ASR engine 130 takes audio input (utterances of the user) and performs audio processing to identify spoken words. In recognizing spoken words, contextual information obtained in different modalities may also be explored to aid in recognition. For example, to identify the spoken word, contextual information about the user and its surroundings may be relied upon (e.g., the user is smiling, looks sad or depressed, is close to a table, is pointing to a blackboard, is sitting on a chair, what color the user's clothing is, etc.). In addition, conventional language context information may also be used to update the dialog history 250 and/or the user profile 290.

The ASR engine 130 may analyze the audio input and determine, for example, probabilities of different word sequences, which may be modeled by the ASR model 1045. Recognition may be performed by detecting phonemes (phones) (or some other linguistic unit in a different language) and their sequence based on a vocabulary 1120 of the appropriate language. As discussed herein, in some embodiments, the ASR model 1045 may be obtained via machine learning based on training data, which may be textual, acoustic, or both. The present teachings disclose automatically generating enriched training data from initial content (via bootstrapping) associated with AOG based on certain language variant models. By training the ASR model 1045 using this enriched training data, it results in a model that can efficiently facilitate speech recognition for a wider range of content.

When performing ASR, ambiguities may arise. For example, ambiguities regarding phonemes may arise. While acoustic information may reach the limit of disambiguation, visual information may be used to aid in the determination, for example, via identifying visemes (visemes) (which correspond to phonemes) based on visual observation of the speaker's mouth movement. Based on the phonemes (or visuals, or both), the ASR engine 130 may then identify the spoken words from the vocabulary 1120, and the vocabulary 1120 may specify not only the words but also the phoneme composition of each word. In some cases, some words may have the same pronunciation. During a human-machine conversation, it may be helpful to determine the exact words spoken. To disambiguate in this case, this can be done using visual information. For example, in Chinese, the pronunciation of "he" and "her" are the same. To identify what "he" or "s" refers to, visual information can be utilized to see if any visual cue can be used to disambiguate. For example, a speaker may point to a person when referring to "he" or "her" in speech, and such information may be used to estimate who the speaker is referring to.

The ASR engine 130 outputs one or more estimated sequences of words spoken by the speaker, each sequence associated with, for example, some probability, each probability representing a likelihood of the recognized sequence of words. Such output may be fed to NLU engine 140 where the semantics of the word sequence are to be estimated. In some embodiments, multiple output sequences (candidates) may be processed by NLU engine 140, and the most likely understanding of speech may be selected. In some embodiments, the ASR engine 130 may select the single word sequence with the highest probability to send to the NLU engine 140 for further processing to determine semantics. The semantics of the word sequence may be determined based on the language model 1130, the NLU model 1050, and multimodal information from different sources, such as the conversation history 250, the conversation context 260, the event knowledge 270, and the profile 290 from the speaker.

In some embodiments of the present teachings, for human-machine conversation, the NLU model 1150 may be trained based on authored content associated with different AOGs. As discussed herein, authored content (text, acoustic, or both) associated with AOG may be automatically enriched based on, for example, language variant models, and such enriched text training data may then be used to train NLU model 1150 to derive models that may support broader speech content. While traditional Natural Language Understanding (NLU) is based on audio signals, the present teachings provide enhanced natural language understanding based on different types of contextual information obtained in multi-modal domains. The goal is to enhance the ability of NLU engine 140 to resolve ambiguities under different circumstances based on relevant multimodal information. For example, if the user says "see what is above," there may be different solutions in order to make clear what "this" means. Conventional use of speech context is not sufficient to resolve this ambiguity. Additional contextual information for other modalities may be explored in accordance with the present teachings to resolve different ambiguities via a multimodal context-sensitive language understanding.

As shown in fig. 11A, the information stored in the conversation history 250, the rich media conversation context 260, the event knowledge 270, and the speaker's profile 290 may be used for language understanding. Taking the example above as an example, when a user says "see what is above," to make clear what is "this," the visual information captured from the dialog scene may be analyzed to identify cues that may be used to disambiguate. For example, the user may be standing in front of the object and pointing at it, e.g. the user may be standing beside a table, for example, and pointing at a computer on a table, for example. Based on this knowledge learned from the visual data, a representation can be generated in a rich media dialog context (260) that reveals that the user is pointing to a computer on a desk near the user. Using this visual observation, NLU engine 140 can estimate that "this" in the speech refers to a computer on a desk and that the user wants to know what is being displayed on the computer screen.

As another example, a user might respond to "I did the same thing the year after" when asked what the user likes to do on the birthday of the year. This response is not clear as to what is "the same thing i have done in the past year". There are two points of ambiguity in this response from the user. First, what is the time horizon of the last year? Second, what did the user done in the estimated time horizon in the last year? Given the language context, it can be estimated that the time frame is the user's birthday in the last year. However, from the linguistic context of the current conversation, it may not be possible to infer or clarify what the user said "what I did in the past year" means. In this case, rich media dialog context according to some embodiments of the present teachings may enable NLU engine 140 to explore information stored in event knowledge 270 about events that are relevant to the user within a time frame around his/her last birthday. For example, the last year the user went to boston birthday may be recorded. This information can help NLU engine 140 disambiguate the meaning of "i did the same thing in the past year". The last year's events may be recorded as events or logs in, for example, the user's file storage 290. In this way, by leveraging the rich media context, NLU engine 140 can then understand that the user means that his/her birthday revisits boston this year. Thus, understanding can then help the dialog manager 150 (see FIG. 1) to then determine how to respond to the user in the dialog. Thus, by exploring the rich media context observed over time and in the dialog scenario, the spoken language understanding unit 1100 can enhance its spoken language understanding capabilities in ASR (recognized words) and NLU (semantics of spoken content).

The surrounding knowledge 270 in FIG. 11A may broadly include different aspects of information. FIG. 11B illustrates an exemplary type of surrounding knowledge to be tracked to facilitate context-aware spoken language understanding in accordance with embodiments of the present teachings. As shown, the surrounding knowledge may include, but is not limited to, observations of the environment (e.g., whether it is sunny), events that occurred (occurring in past and current conversations), objects observed in the conversation scene, and spatial relationships of objects to objects, … …, activities observed (acoustic or visual), and/or statements made by the user. Such observations may be made via multi-modal sensors, which may then be used to estimate or infer a user's preferences or profile, which may then be used by dialog manager 150 to determine a response from both the inferred user's profile and the surrounding situation.

Figure 12A illustrates an example of tracking a personal profile based on a conversation or statements made during a conversation according to an embodiment of the present teachings. As shown, there is a robotic agent (duck talking agent) 1210 that is performing a conversation with a user (child) 1220. During the session, the bot agent 1210 asks the user 1220 where he is born. The user answers saying that he is born in chicago. This speech information is analyzed and the speech-based information (audio) is tracked and analyzed to extract useful profile information. For example, based on such a conversation, the user's profile is updated with an additional graph (or sub-graph) 1230, which additional graph (or sub-graph) 1230 links node 1230-1 representing "me" (or user) with another node 1230-2 representing "Chicago," link 1230-3 being annotated as "living from". Although a simple example, the user profile may be updated gradually based on the tracked audio information on a continuous basis.

Figure 12B illustrates an example of tracking a personal profile during a visual observation-based conversation according to an embodiment of the present teachings. As shown, if the visual sensor captures a conversation scene as shown in 1240, it includes a boy (presumably the user who is participating in the conversation) and an object/fixture in the conversation scene (e.g., there is a small bookshelf, a light on the bookshelf, another light on the floor, a window, a picture on a wall hanging with "michael jordan" written on it). Such visual scenes can be collected and analyzed so that relevant objects can be extracted, spatial relationships inferred, … …, and it is observed that boys are gazing at the poster of michael jordan. By analyzing objects in the scene, knowledge can be extracted and some conclusions can be estimated. For example, based on such a visual representation of the observed scene, the estimated conclusion may be that boys like michael jordan. Another estimated conclusion may be that boys and posters coexist in the dialog scenario. Various other factors may also affect the estimation of what the visual representation may mean. For example, if the scene is a boy's own room, the fact that he is sticking such a poster on a wall may lead to the conclusion that he is more likely to like michael jordan. The conclusion that the boy and the poster coexist may be weighted higher if the boy is in a scene in his friend's room or elsewhere, such as a school. In either case, based on visual observation, a graph (or subgraph) can be created representing the relationship between the observed boy and poster, where a node 1250-1 representing the boy is linked with another node 1250-2 representing michael jordan by a link 1250-3 representing the relationship between the two. In this illustrated example, the link represents "like". The link may also be in a "coexistence" relationship if its probability is higher. In some embodiments, link 1250-3 may also be annotated with different relationships (e.g., "like" and "co-exist"), each of which may be associated with a probability. In this way, the monitored visual information may be used to continuously update the surrounding information. Dialog manager 150 may use this observation to determine how to perform the conversation. For example, if a boy in a scene is observed looking at a poster and seems to be distracted from the focus of the conversation (e.g., tutoring on mathematics), the conversation manager 150 may decide to ask the user "do you like michael jordan? "to improve the user's engagement.

Information tracked based on sensor information in different modalities may be combined to create an integrated representation of the surrounding environment. FIG. 12C illustrates an exemplary partial personal profile that is updated during a dialog based on multimodal input information obtained from a dialog scenario according to embodiments of the present teachings. In this example, the integrated graphical representation 1260 is a combination of the graph 1230 in fig. 12A derived based on the tracked audio information and the graph 1250 in fig. 12B derived based on the tracked visual information. There are other types of knowledge that may be tracked during the conversation and used to update the surrounding knowledge 270 and/or the user profile 290. In some embodiments, users may be classified into different groups based on tracked information, and characteristics associated with each group may also be continually established based on tracked information from users in the groups. As will be discussed below, such group characteristics may be used by the dialog manager to adaptively adjust the dialog policy to make the dialog more efficient.

FIG. 12D illustrates an exemplary personal knowledge representation constructed based on conversations, according to embodiments of the present teachings. In a dialog session (or dialog sessions), information obtained from such conversations may be used to continuously establish profile knowledge about the user, which the dialog manager 150 may then rely on to direct the dialog. For example, in a user profile, the user's birthday may be established. If the user mentions certain travels when different years are birthdays in the conversation, this knowledge can be represented in a graph as shown in fig. 12D. In this example, on a birthday of 2016, the user went by boat to alaska, and on the same day of 2018, the user flown to the lasvivax birthday. This knowledge may be conveyed explicitly by the user or may be inferred from the conversation by the system, for example, the user mentions the dates of the trips without explicitly indicating that the trips are for celebrating a birthday. Knowing the user's birthday, the system may infer that these travels are for celebrating the user's birthday.

As seen from fig. 12D, this knowledge may be represented as a graph 1270, where the user is represented as the entity "me" 1270-1, and it is linked to different pieces of knowledge about the user. For example, via link "birthday" 1270-2, it points to a representation of birthday 1270-3 with the user's particular date "10/2/1988". Based on the two travels mentioned by the user, two destination representations 1270-5 and 1270-7 are provided, where 1270-7 represents Alaska and 1270-5 represents Lasvas. To link the user to these two destinations, there are two links 1270-4 and 1270-8 that link the user to these two destinations, each with an annotation on the link for the corresponding travel period. Since these travel periods coincide with the user's birthday 10/2, additional representations may be provided to link the user's birthday with the two travels. To this end, two additional links 1270-6 and 1270-9 may be provided that link from the birthday representation 1270-3 to two destination representations 1270-5 and 1270-7. On links 1270-6 and 1270-9, the year of the trip is indicated with a note of the user's calculated age at the time the trip was made. With this representation, the user's birthday and some events associated with the user's birthday can be identified, i.e. the user has made two trips on his/her birthday, one to alaska in 2016 and another to las vegas in 2018 when the user is 28 and 30 years old (calculated from the birthday and the years of the trip), respectively.

The robotic agent may later utilize the tracked representation of some knowledge about the user to determine what to say in some cases. For example, if the robotic agent is talking to the user on a certain day and notices that the day is the user's birthday, the robotic agent may cause the user to be greeted "birthday happy! "additionally, to further attract the user, the robotic agent may say" you went to Las Vegas the day of the year's birthday. Where are you going this year? "such personalized and context-aware conversations may enhance user engagement and improve intimacy between the user and the robotic agent.

FIG. 13A illustrates an exemplary group of users formed based on observations about the users to facilitate adaptive dialog management, according to embodiments of the present teachings. The user groups in fig. 13A are for illustration, not limitation, and any other grouping may be added and is possible. As shown, the users may be categorized into different groups, such as social groups (e.g., Facebook chat or interest groups), ethnic groups (e.g., asian or hispanic groups), age groups that may include youth groups (e.g., toddler or teenage groups) and adult groups (e.g., senior and work specialty groups), gender groups … …, and various possible specialty groups. Each group of users may share some common characteristics that may be used to build a group profile. Such a group profile having characteristics shared by group members may be relevant to a conversation plan or control. For example, users in some subgroups (e.g., asian groups) may share some common characteristics, such as accents when speaking english. This common feature of a group of users may be used, for example, in a conversation plan. For example, in tutoring English, a user who understands that belongs to a particular ethnic group with well-known accents may facilitate dialog manager 150 planning a tutoring session using certain measures aimed at solving the problems associated with accents to form a correct pronunciation. As another example, if the user is in a teenager group, the conversation manager 150 may explore some known popular interests common to teenagers when the conversation manager 150 desires to better engage teenager users in conversations.

In addition to group profiles, the dialog system 100 may accumulate knowledge or profiles of individual users based on immediate observations from dialogs or information about each user from other sources. 12A-12C illustrate some examples of some aspects of building a user based on multimodal information collected during a conversation. FIG. 13B illustrates exemplary content/construction of a personalized user profile according to embodiments of the present teachings. The user's personal profile may include the user's demographic information (e.g., race, age, gender, etc.), known preferences, language (e.g., native language, second language, etc.), learning arrangements (e.g., tutoring sessions on different topics), … …, and performance (e.g., recorded assessments of performance related to different tutoring sessions on different topics). Some of the profile information may be declared (e.g., demographics) and some may be observed or learned via communication. For example, for information related to the user's second language (e.g., English), information regarding his/her proficiency, accent, etc. may be recorded and may be used, for example, by dialog manager 150 to plan or control a dialog. Details relating to some of the individual's traits may also be collected and recorded. One example shown in FIG. 13B is a detailed characterization of an accent in a second language of the user. For each second language that the user is learning, the accent may be described in terms of both acoustic features (e.g., acoustic characteristics of the user's pronunciation) and visual features (e.g., visual characteristics of the user's mouth moving while speaking). This particular characterization of a person's accent in a certain language may be utilized in dialog control.

Fig. 13C illustrates an exemplary visualization of accent profile distributions associated with different ethnic groups, according to embodiments of the present teachings. A set of features may be used to characterize an accent. For example, visuals may be used to represent features related to accents, i.e., accents characterized using visual features related to mouth movement. Accents may also be characterized based on acoustic features. A human accent may be represented by a vector of instantiated values of these features, and such vector may be projected as a point into a high dimensional space in a coordinate system. Shown in fig. 13C is a 3D projection of accent vectors of different ethnic groups in a 3D coordinate system. In this illustration, each accent feature vector for a particular user is projected as a point, and the different circles in fig. 13C represent the boundaries of the projected accent feature vectors for users of different ethnic groups. One or more of the points are included in each circle, and the variation in accent for different members of the same group corresponds to the degree of extensibility or shape of the circle.

As shown in fig. 13C, there are exemplary four sets of accent distributions a 11310, a 21320, a 31330, and a 41340 corresponding to the english speaking accent characterization of the ethnic group of chinese, japanese, american, and french. Each circle represents a distribution range of the accent profiles of the group members, which may be derived based on a plurality of projection points (member accent vectors) representing the accent profiles of the group members. For example, a group accent profile vector may be derived by, for example, averaging member accent vectors in each dimension, or using, for example, the centroid of each group distribution as a group profile. As can be seen, each circle in fig. 13C has a centroid (center of the circle) representing a group accent profile, which is projected from, for example, a group accent feature vector.

When a user is engaged in a human-machine conversation, the user can be observed and multimodal information captured from the conversation scene to estimate different types of information to be used for the adaptive conversation strategy. Based on the observed multimodal information, the robotic agent can dynamically build and use a user profile to adaptively control how to talk to the user. One exemplary type of information to be evaluated is the accent of the user. Accent information may be used to improve dialog quality and the effectiveness of the tutoring language. With the estimated accent information, the robotic agent may adjust its language understanding capabilities to determine the meaning of the user utterance. If a robotic agent is used to teach a user to learn a certain language (e.g., English), such estimated accent information may be explored to adaptively determine a dynamic teaching plan.

A group profile as shown in fig. 13C may be used to determine to which group a new user may belong. For example, in FIG. 13C, point 1350 may represent the accent vector for the new user. Without knowing the new user's declared race, the distance between the distribution of points 1350 to different race groups may be used to estimate the new user's race. For example, as shown, the distance between point 1350 and the centroid of group a1 is 1360; the distance between point 1350 and the centroid of group a2 is 1370; the distance between point 1350 and the centroid of group a3 is 1380; and the distance between point 1350 and the centroid of group a4 is 1390. In various embodiments, the distance may also be evaluated between the point 1350 and the closest point of each distribution.

Once the distance to each group is determined, the new user may be estimated to belong to the ethnic group having the shortest distance 1370 between the point 1350 and the centroid of that estimated ethnic group. In the example shown, since the projected vector points of new user 1350 are closest to group a2, the new user may be estimated to be a member of group a2 (e.g., the japanese group). The group profile (centroid) may also be updated if there is a high confidence in the estimate of the new user ethnicity. In some embodiments, the estimated ethnicity and the representative accent profile for that ethnicity group (i.e., the centroid of a 2) may be used for the adaptive coaching plan. An example is shown in fig. 17B-17E, which illustrates how a dialog may be conducted adaptively using known accent information (e.g., visuals) of a user to coach a student in learning a language.

FIG. 14A depicts an exemplary high-level diagram of a system for tracking individual speech-related characteristics to update a user profile according to an embodiment of the present teachings. As discussed herein, a user profile may be dynamically updated to characterize the underlying user based on information observed during a human-machine conversation, and then the user profile may be used to direct the robotic agent to adjust the conversation with the user according to the observed characteristics of the user. Exemplary system diagram 1400 is directed to establishing and updating accent information for a user in a language (e.g., English). In this illustrated embodiment, the system 1400 receives audio/visual information captured while the user 1402 speaks, analyzes the received audio/visual information to extract acoustic and visual features (visual features) related to the speech, classifies the features to estimate characteristics of the user's speech, and updates the user profile stored in 290 accordingly.

The system 1400 includes an acoustic feature extractor 1410, a visual feature extractor 1420, an acoustic feature updater 1440, and a feature-based accent component classifier 1450. FIG. 14B is a flow chart of an exemplary process of a system 1400 for tracking individual voice-related characteristics and their user profiles, according to an embodiment of the present teachings. When the user 1402 is speaking (e.g., in a language in which the robotic agent is tutoring), audio signals corresponding to the user's speech are acquired as well as visual signals that capture the user's mouth movement during speech. When such acquired multimodal information is received by the acoustic and visual feature extractors 1410 and 1420 at 1405, the audio signal is analyzed by the acoustic feature extractor 1410 to estimate acoustic features associated with the speech of the user 1402 at 1415. These acoustic features are identified based on, for example, an acoustic accent profile 1470, which acoustic accent profile 1470 characterizes different accents based on acoustic features identified based on the speech signal.

At 1425, the received visual signals are analyzed by a visual feature extractor 1425 to estimate or determine visual features related to the mouth movement of the user while speaking. According to various language-based visual models 1420, visual features are identified based on, for example, visual images of the mouth as the user speaks. To classify the user's accents into the appropriate accent groups (see fig. 13A-13C), a feature-based accent group classifier 1450 receives acoustic features from the acoustic feature extractor 1410 and audience features from the visual feature extractor 1420, and classifies the user's accents into accent groups at 1445 according to the language-based accent group profile stored in 1460. Once classified, the user profile in 290 associated with the user 1402 is updated accordingly to incorporate the estimated accent classification. This estimated accent information may later be used by, for example, a robotic agent to adaptively plan a dialog, as discussed herein.

Multimodal information (250 in FIG. 2A and 290) that can be used to facilitate adaptive dialog management includes, in addition to the user's language features, the activities that occurred, the event(s) that occurred, … …, and so forth. Knowledge about the event may assist the robotic agent in improving conversation engagement. An event may refer to something that occurred in the past or was observed in a conversation, whether by someone who did or mentioned in the conversation. For example, a user in a conversation scene may walk to an object in the scene, or the user may mention that he will fly to Las Vegas celebration his birthday in June.

FIG. 15A provides an exemplary abstract structure representing events, according to embodiments of the present teachings. As shown in fig. 15A, an event includes an action performed with respect to an object. An action involved in an event can be described by a verb, and an object can be anything for which the action is directed. With respect to a dialog, an action may be performed by an entity in a dialog scene, and an object may be any object in the scene that may be acted upon. For example, as shown in FIG. 15A, an event may be someone walking an object (such as a blackboard, a desk … …, or a computer); a person pointing to an object (such as a blackboard, a table, or a computer on a table); … …, and someone lifting the object, etc.

15B-15C illustrate examples of tracking event-centric knowledge as a dialog context during a dialog based on observations from the dialog context, according to embodiments of the present teachings. As discussed herein, knowledge about an activity or event (whether occurring during a conversation or not) may enable a robotic agent to enhance its performance. Event-centric knowledge may be used to assist the robotic agent in understanding the meaning or intent of speech from a human user. For example, a user having a conversation with a robotic agent may walk to the blackboard and then point to ask it: "what is this? This is illustrated in fig. 15B. In this case, with acoustic information alone, it is often difficult, if not impossible, for a robotic agent to understand what the user is saying. However, other types of information from the dialog scenario may provide useful clues and may be combined with the speech recognition results (spoken words) to estimate what the user is pointing to.

In accordance with the present teachings, a dialog scene may be tracked and information analyzed to detect various objects present in the dialog scene, spatial relationships between such objects, action(s) performed by a user, and the effects of such actions on objects in the scene, including dynamic changes in the spatial relationships of objects due to such actions, and the like. For example, as shown in fig. 15B, it is observed via the visual sensor(s) that the user in the scene is walking up the blackboard, lifting his hand, and when the user says "what is this? "when pointing to the blackboard. Such observations may be used to dynamically construct an event (the user walks to and points at a blackboard in the scene) with an exemplary representation as shown in fig. 15C to describe the event visually observed in the conversational scene. In this example, the event includes two actions, one (action 1) is to go to the blackboard and the other (action 2) is to go to the blackboard. Each action involved may be annotated with a time T representing the time at which the action was performed (see T1 for action 1 and T2 for action 2). In this manner, sequences of actions in an event may be associated with timelines associated with different actions. For example, the user may first go (action 1 of T1) to the blackboard, and then say "what is this? "hour is directed to (action 2 of T2) the blackboard. The timing of each action may be compared to the timing of the utterance, and the temporal correspondence may also assist in understanding the meaning of the utterance.

Information gathered in multiple modalities facilitates tracking/learning knowledge about what is happening in a dialog scenario, which can be described in a dynamically constructed event representation. This dynamically constructed knowledge of events can play an important role in helping the robotic agent estimate (by the ASR) the semantics of the recognized word sequence (via the NLU) when certain ambiguities are not effectively resolved by traditional language models and contexts. Taking the example shown in fig. 15B and 15C as an example, based on the knowledge of the event that the user walks to the blackboard and points at the blackboard while saying "what this is", the robotic agent can at least conclude that: the user's "this" refers to the blackboard, either what is on the blackboard or what is the blackboard. This understanding is more useful than what traditional methods can achieve (i.e., does not know what is the user's pass "is this. In some cases, if more information is available, the robotic agent may be able to further narrow down the user's intent. For example, if the user points to a computer screen asking "what is this? "and the robotic agent has just displayed a picture to the user, the robotic agent may determine an appropriate response to the user's question, e.g., a query" do you mean i just displayed a picture on a computer? "thus, rich media context information (including visually observed events, actions, etc.) may assist the robotic agent in designing an adaptive dialog strategy to continuously engage the user in a human-machine dialog session.

15D-15E illustrate another example of tracking multimodal information to identify event-centric knowledge to facilitate spoken language understanding in a human-machine conversation, in accordance with an embodiment of the present teachings. In this example, the user in the conversation may say "i like to try this". While ASR can process utterances and recognize the words "i", "like", "try", and "this", the robotic agent cannot determine what the user's "this" refers to without much non-traditional contextual information. In this example, visual information that is simultaneously monitored in a conversation scenario may be analyzed, and the robotic agent may estimate the meaning of the utterance using events detected while the user speaks words. In this example, as shown in fig. 15D, a user who may observe via visual means that the words are spoken reaches their hands to the notebook computer on a desk (action 1), opens (action 2) the notebook computer, and then begins typing on the keyboard (action 3). Such event knowledge learned from visual information may be used to construct an event representation as shown in fig. 15E, where the event is represented as including three actions (extend, open, and type), each action being associated with a timing (T1, T2, and T3, respectively) and the object for which each action is directed. In this example, the sequence of actions in the event can be identified according to an order of time associated with the actions. By combining such visual observations with the recognized word "i like to try this", the robotic agent can infer that the user's "this" may refer to something the user likes to do on a notebook computer, and in response, the robotic agent can ask the user "what do you like to try on the computer? "to design a response policy to remain relevant to what the user is doing/wants to do, thereby achieving a better attraction to the user.

As discussed herein, a robotic agent according to embodiments of the present teachings conducts conversations that are intended to enable personalized context-aware human-machine conversations. It is personalized in that a robotic agent according to the present teachings understands and responds (as described below) to a user with user profile information and dynamic personal information updates (examples shown in fig. 12A-12C). It is context-aware in that it leverages event knowledge dynamically estimated from information acquired via different modalities, either in real-time (examples shown in fig. 15A-15E) or previously established, to facilitate the robotic agent's understanding of the conversation context. This personalized and context-aware dialog control enables the robotic agent not only to better understand the user, but also to generate responses that are more relevant, more appealing, and more appropriate for dynamic scenarios.

FIG. 16A depicts an exemplary high-level system diagram for personalized context-aware dialog management (PCADM) according to embodiments of the present teachings. In this embodiment, the information state 100 is centered around the PCADM and includes rich media context information that is built/updated based on sensor information across different modalities as well as personalized information. As shown, the PCADM includes a multimodal information processing component, a knowledge tracking/updating component, a component for estimating/updating minds of different parties, a dialog manager 150, and a component responsible for generating deliverable responses (responses determined by the dialog manager 150), all using a personalized and context-aware approach based on various information dynamically updated in the information state 110.

In the illustrated embodiment shown in fig. 16A, the multimodal information processing components include, for example, an SLU engine 1100 (spoken language understanding, including both the ASR 130 and NLU engine 140, see fig. 11), a visual information recognizer 1600, and a self-motion detector 1610. The knowledge update components may include, for example, a surrounding knowledge tracker 1620, a user profile update engine 1630, and a personalized common sense updater 1670. Means for estimating/updating the minds of the parties include, for example, proxy minds update engine 1660, shared minds monitoring engine 1640, and user minds estimation engine 1650. The components for generating a deliverable response include, for example, NLG engine 160 and TTS engine 170 (see fig. 1).

The dialog manager 150 manages the dialog based on the speech understanding of the user utterance from the SLU engine 1100, the relevant dialog tree (e.g., the particular T-AOG that governs the underlying dialog), and different types of data from the information state 110 (e.g., user profile, dialog history, rich media context, estimated different minds, event knowledge,..., common sense), which enables the dialog manager 150 to determine a response to the user utterance in a personalized and context-aware manner. As discussed herein, the informational state 110 is dynamically established/updated by various components (such as 1620, 1630, 1640, 1650, 1660, and 1670). For example, upon receiving a signal related to the user's speech (which may include, for example, audio and visual of mouth movements), the SLU engine 1100 performs spoken language understanding. In some embodiments, to understand the user's utterance, information from the information state 110 may be explored during both speech recognition (which is used to determine the spoken words) and speech understanding (which is used to understand the meaning of the utterance). For example, if a known accent or visual element of a user participating in a conversation is recorded for the information state of the user, such information may be used to identify the spoken word. As discussed with respect to fig. 15A-15E, event knowledge about the user observed in the dialog scenario may also be used to resolve certain ambiguities. Such event knowledge can be derived by analyzing visual input by the visual information recognizer 1600 and the surrounding knowledge tracker 1620, and then event representations can be stored in the information state 110 and accessed by the SLU engine 110 to understand the semantics of the user's utterance.

Visual and other types of observations (e.g., tactile input from a user) can also be monitored and analyzed to derive different contextual information, either alone or in combination with audio information. For example, a bird song (sound) and a green tree (visual) may be combined to infer that it is an outdoor scene. The user's self-movement detected via, for example, tactile information, may be combined with visual information to infer changes in the spatial relationship between the user and objects present in the scene. The facial expression of the user may also be identified from the visual information and may be used to estimate the mood or intent of the user. Such estimates may also be stored in the information state 110 and then used, for example, by the user's mind estimation engine 1650 to estimate the user's mind, or by the conversation manager 150 to determine what to do in order to continue attracting the user to participate in the conversation.

As discussed herein, the conversation between the robotic agent and the user may be driven by an AOG (or the agent's minds), which represents a desired topic, with a desired conversation stream with specifically authored conversation content. During a conversation, certain conversation paths in the AOG may be identified by the shared mind monitoring engine 1640, depending on the recognized user spoken language, and information related to the shared mind state may be evaluated and used to update the shared mind state representation recorded in the information state 110. In estimating shared mind, the shared mind monitoring engine 1640 may also utilize rich media contexts developed based on multimodal sensory input. In addition, based on the estimated shared mindsets, the user mindset estimation engine 1650 can further estimate the mindsets of the user and then update the user mindsets recorded in the information state 110.

With the rich media ambient context, the semantics of the user utterance from the SLU engine 1100, and the user profile 290, the dialog manager 150 determines a personalized and context-aware response to the user based on the updated information state 110. The response may be determined based on an understanding of what the user is saying (semantics), the emotional or mental state of the user, the user's intent, the user's preferences, the dialog strategy specified by the relevant AOG, and the estimated degree of engagement of the user. In addition to determining the content of the response to the user, the robotic agent in accordance with the present teachings may further personalize the response by generating personalized content for the response in a manner appropriate for the particular user. For example, as discussed with respect to fig. 9A-9B, the response may be from a subject of a classification that may have parameterized content (e.g., a response to an incorrect answer). In view of this, particular content associated with parameterized content may be selected for a particular user based on knowledge about the user (e.g., the user's preferences or the user's emotional state). This is accomplished by NLG engine 160. More details regarding NLG engine 160 and TTS engine 170 are provided with reference to fig. 16C-16D.

Figure 16B is a flow diagram of an exemplary process for personalized context-aware dialog management, according to an embodiment of the present teachings. When a multi-modal input is received at 1605, the different components in FIG. 16A analyze the information in the various domains at 1615. For example, acoustic signals related to the user's utterance and/or environmental sounds may be received; the SLU engine 1100 may analyze the audio signal to understand what the user says. Visual signals may also be received that capture the user's physical appearance and movement (e.g., mouth and body movement), and the visual signals may be analyzed by the visual information identifier 1600 to detect, for example, different objects, facial features of the user, body movement, and the like. Such multi-modal data analysis results from 1100 and 1600 may then be utilized by other different components to derive a higher level of understanding of the surroundings, preferences, and/or mood. For example, the surrounding knowledge tracker 1620 may track, for example, dynamic spatial relationships between different objects, evaluate the mood or intent of the user, etc. at 1625. This tracked ambient conditions may then be used by the user profile update engine 1630 to evaluate, for example, characteristics of the user (such as observed preferences) and update the user profile in the information state 110 based on the observation and analysis results at 1635.

Based on the tracked ambient information (e.g., the tracked movements, events, and environments of the user), the rich media context stored in the information state 110 can be updated at 1645. SLU engine 1100 may then utilize the updated user profile and rich media context to perform personalized context-aware spoken language understanding at 1655, including, but not limited to, identifying words spoken by the user (e.g., identified based on accent information related to the user) and semantics of the spoken words (e.g., identified based on visual or other cues revealed in other modalities). Based on the understanding of the user utterance, the dialog manager 150 then determines a response to the user deemed appropriate given the context and the known characteristics of the user at 1665. For example, the dialog manager 150 may determine that when the user incorrectly answers the question, a response is delivered to the user indicating this to the user.

To ensure that such responses are personalized, NLG engine 160 may select one of a plurality of alternative responses in parameterized content (see fig. 9A-9B) associated with nodes having specified destinations to generate a user-specific personalized response at 1675. The selection may be made based on personal information stored in the information state 110 (e.g., the user has a sensitive personality, has previously answered a similar question incorrectly, and currently appears to be frustrating). For example, given that users are known to be sensitive, easily frustrated, and have repeatedly made mistakes, NLG engine 160 may generate responses that are intended to be mild to avoid further frustration of the users. To deliver the personalized response to the user, the response may also be rendered by the TTS engine 170 in a personalized and context-aware manner. For example, if the user is known to have a southern accent and is currently in a noisy environment (e.g., it is specified in information state 110), TTS engine 170 may render a response at 1685 having a southern accent and higher volume. Once the response is delivered to the user, the process returns to step 1605 to process the next round of dialog in a personalized and context-aware manner.

Fig. 16C depicts an exemplary high-level system diagram of NLG engine 160 and TTS engine 170 to produce a context-aware and personalized audio response, according to an embodiment of the present teachings. In this illustration, both NLG engine 160 and TTS engine 170 may adjust the response according to the tracked information stored in information state 110. First, NLG engine 160 may generate a response from the text response from dialog manager 150, where the modification or adjustment is determined based on information related to the user and from the known dialog context. In addition to this, TTS engine 170 may further tailor the (adapted) response in its rendered delivery form in a personalized and context-aware manner.

NLG engine160 includes a response initializer 1602, a response enhancer 1608, a response adjuster 1612, and an adaptive response generator 1616. Fig. 16D is a flowchart of an exemplary process of NLG engine 160, according to an embodiment of the present teachings. In operation, the response initializer 1602 first receives 1632 a text response from the dialog manager 150 and then initializes 1634 the response by, for example, selecting an appropriate response from the parameterized content associated with a particular node in the dialog policy. For example, assume that a dialog policy for a dialog teaching a student "add" concept is used to specify a tutoring dialog with a user, as shown in fig. 9A. During this dialog, the user is presented with questions about adding X and Y. When the user answers the question correctly, the dialog manager 150 follows the path in the dialog policy in FIG. 9A to node 675-2 and determines that the response is from parameterized content associated with node 675-2. As seen in fig. 9A, in this case, the correct answer from the user may be because the user understood the question and answered correctly or because of lucky guesses, dialog manager 150 may use the parameterized content at node 675-2 as a pool from which to select a response. In this case, the parameterized content (associated with node 675-2) set for the current case is [ T [_RCA]. As shown in fig. 9B, [ T ]_RCA]Is defined as [ T_RC]And [ T_RG]In which set [ T ] is_RC]For responding to correct answers based on correct understanding of the taught content, set [ T_RG]For responding to correct answers provided based on guesses. The determination of whether the user's answer is a guess or an actual correct answer may be based on the probability P (L) estimated for such probabilities discussed herein_t+1| obs ═ correct) and p (g). Based on these probabilities, the response initializer 1602 can use which of the two parameterized content sets as a pool from which responses can be further selected. In some embodiments, the response may be selected from a group selected based on the probability as described above. For example, if the likelihood of providing a correct answer is greater because the user does have mastered the concepts taught, then the selection pool for the response is [ T_RC]. Once selected, the parameterized content set [ T ] will be selected from_RC]To select a response.

In generating an appropriate response, the response may be selected or generated in a grammatical (based on syntactic and semantic models 1604 and 1606), knowledgeable (based on topic knowledge model 1616 and common sense model 280), human (based on user profile 290, proxied minds 200, and estimated minds 220 of the user based on conversation situation), and intelligent (based on topic control according to STC-AOG 230, conversation context 260, conversation history 250, and event-centric knowledge 270 established based on observations of relevant conversations). The output of the response initializer 1602 (e.g., selection of a parameterized set of content) can then be sent to the response enhancer 1608, which can further narrow down the selection based on, for example, expertise on the topic (represented by topic knowledge model 1614), common sense knowledge (represented by common sense model 280), and so forth. Knowledge and common sense models based on the relevant topics are retrieved 1636 and used to enhance response selection/generation 1638. Topic knowledge model 1614 may include a knowledge graph representing human knowledge in certain topics, which may be used to control the generation of responses. The common sense model 280 may include any representation (such as subject-predicate-object triplets) to model human common sense. These models can be used to ensure that NLG engine 160 will produce sentences that are meaningful or consistent with known facts.

To further enhance response generation, an enhanced version of the initially generated response or further selected from a pool of alternative responses (e.g., from a parameterized set of content for responding to correct answers to, for example, mathematical questions) may then be selected and the selected response may be sent to a response adjuster 1612 where the response is adjusted in a context-aware and personalized manner. To do so, the response adjuster can operate based on a number of considerations by accessing estimated mind (200 and 220) at 1642 from, for example, conversation history (250), conversation context (260), previously acquired event knowledge (270), any user preferences (290), and agents and users, and adjusting the response accordingly at 1646. For example, if the user is known to be shy and sensitive (from previous dialog history), a mild response to an incorrect answer may be generated or selected (from the existing content set). If there are some known related past events, such as the user traveling a birthday for the past few years (see FIG. 12D), when the conversation occurs just before the user's birthday, the bot agent may ask the user "do you intend to travel this year past birthday? "to better attract the user. As another example, if it is observed (dialog context 260) that the user holds a happy toy in his hand while talking to the robotic agent, the robotic agent may ask "do you like happy? "to make a more interesting and engaging conversation. Responses that adapt to user familiarity or preferences may be more appealing to the user.

As discussed herein, during a conversation, the robotic agent continually updates estimates of the agent's mind state 200, shared mind state 210, and user's mind state 220 based on observed content. The response to the user utterance may also be adjusted based on the estimated mind. For example, if a user correctly answers a number of questions about a mathematical concept (e.g., a score), an estimate of the user's mind may indicate that the user has mastered the concept. In this case, the robotic agent may choose to confirm his success to the user and provide a response that continues forward, among the available alternative responses to the last correct answer (e.g., "do you get |," too baseball |, "or" do you get a good.

The adjusted response from response adjuster 1612 is then sent to adaptive response generator 1616 to generate a context-aware and personalized response at 1648. To ensure that the robotic agent speaks sentences that are grammatically and semantically correct, adaptive response generator 1616 generates a response according to syntactic model 1604 and semantic model 1606. This generated adaptive response is then sent 1652 to the TTS engine 170 to generate a rendition that suits the user's preference.

FIG. 16E is a flowchart of an exemplary process for TTS engine 170, according to an embodiment of the present teachings. When adaptive response processor 1618 receives an adaptive response in text form from NLG engine 160 at 1654, it processes the received response at 1656. To render the text response, the text response needs to be rendered into an acoustic form via, for example, text-to-speech conversion or TTS. To generate the audio form of the response (the utterance of the robotic agent), the adaptive TTS characteristic analyzer 1622 retrieves the relevant information from the user profile 290 at 1658. The relevant information may include, for example, the age group and/or ethnic group of the user, the known accent of the user, gender, etc., which may indicate some preferred manner of converting the text response to an audio (speech) form. For example, based on a known user ethnicity, a representative accent for that ethnic group (e.g., a characteristic acoustic feature representing a centroid of a distribution of the ethnic group's mean accent), a text response may be converted into a speech signal having the characteristic acoustic feature for that ethnic group.

If it is determined at 1662 that no such preference-related information is available about the user, the adaptive TTS characteristics analyzer 1622 invokes the text-to-speech converter 1624 to convert the adaptive text response to speech form at 1668, e.g., based on a standard TTS configuration stored, for example, in TTS characteristics configuration 1626. If any preferences exist, the adaptive TTS characteristics analyzer 1622 analyzes information from the user profile at 1664 to identify specific preferences and retrieves specific TTS conversion configuration parameters from the TTS characteristics configuration 1626 at 1666 to convert the text response to a form of speech exhibiting the specific speech characteristics of the user preferences. Based on such retrieved rendering parameters, the text-to-speech converter 1624 converts the received adaptive text response to an audio signal in speech form that represents the adaptive text response of the style specified by the user's preference at 1668. The generated audio signal in speech form for the adaptive text response is then sent to the robotic agent for rendering (or responding to the user) at 1672.

As disclosed herein, by utilizing adaptively tracked multimodal ambient information, spoken language understanding in a conversation can be personalized in a context-aware manner, the conversation itself or the manner in which the conversation is conducted (back-and-forth communication between machines and humans) can be adaptively configured based on the dynamics of the conversation, which communications can be delivered through personalization and content-sensitive determined style selection. For example, a robotic agent for tutoring a student in a second language (e.g., English) may proceed with a lesson based on knowledge of the user. If the student user belongs to a known ethnic group that is generally known to have a particular accent profile, tutoring may be performed in view of this and a lesson plan may be generated for that particular accent profile to specifically overcome the accent to ensure that the student will develop the correct pronunciation.

As discussed herein, the response to a conversational user is designed in a personalized and context-aware manner by tracking different types of information related to the surroundings of the conversation based on multimodal information. Another consideration in deriving a response in a dialog is the assessment of the user's performance by the robotic agent. Such an assessment may not only guide how the dialog is performed with the user, but may also be used to adaptively adjust the tutorial plan during the dialog. Fig. 17A depicts an exemplary high-level diagram of a machine coaching system 1700 for adaptive personalized coaching via dynamic feedback according to an embodiment of the present teachings. The illustrated machine tutoring system 1700 includes a tutor 1710 supported by tutoring plan execution unit 1770, a communication understanding unit 1720 for understanding what the student user 1702 said, a scorer unit 1730 for evaluating the performance of the student user 1702, and a tutoring plan adjustment unit 1750. An instructional plan for a student user may be designed based on knowledge about the user from the user profile 290. For example, if a student is known to belong to a certain ethnic group that may have a corresponding accent profile with respect to a language to be tutored, a teaching plan may be designed based on the known accent profile to teach the student the language.

The four elements in system 1770 form a feedback loop, making it possible to continuously adjust the instructional plan during the tutorial process. After determining the initial teaching plan 1760 based on the curriculum 1740 and the user profile 290, during the tutoring process, the communication understanding unit 1720 analyzes the student's communication, and the results are sent to the scorer unit 1730 to assess the user's performance. Such an assessment may be made with respect to a curriculum 1740 specifying an expected performance, which may be assessed according to the expected performance specified in the curriculum 1740. For example, in tutoring a student in English, there may be a series of tutoring sessions, some for pronunciation, some for vocabulary, some for grammar, some for reading and some for composition. Each tutoring session may target a particular goal according to a series of goals related to mastering a certain level of english. For each session, the particular content may be covered by the goals to be achieved, such that the scorer unit 1730 may evaluate the user's performance based on the particular goals to be achieved. For example, in teaching students how to learn to read with the correct pronunciation for different words that target different phonemes. Acoustic signals of words spoken by the user and visual information about mouth movement as the words are spoken may be recorded and the acoustic characteristics and visuals of the words read by the user may be analyzed and compared to standard corresponding acoustic and visuals profiles as part of the evaluation.

The assessment results are then used by tutorial plan adjustment unit 1750 to determine if the initial tutorial plan needs to be adjusted and, if so, how to change the tutorial plan based on the student's performance and the desired goals for lesson 1740. The adjustment may be based on a deviation between the observed acoustics/visuals characteristics and the standard acoustics/visuals profile. The adjustments to the instructional plan are then used to update the instructional plan 1760 so that the revised instructional activity can be performed by the instructional plan execution unit 1770. At the same time, the observed performance and its assessment may also be sent to the user profile updater 1780 to continually update the user profile 290. The different elements in fig. 17A as described herein may be part of the personalized and context-aware dialog management as shown in fig. 16A-16B. For example, the user profile updater 1780 may correspond to the user profile update engine 1630 or be part of the user profile update engine 1630; the communication understanding unit 1720 may correspond to the SLU engine 1100 or be a part of the SLU engine 1100; instructional plan execution unit 1770 may be part of dialog manager 150. Fig. 17A is intended to illustrate the nature of feedback from the closed loop system 1700 in dynamically adjusting the instructional plan during the course of a dialog session.

One example of adjusting a teaching plan based on a user's profile is to have a personalized teaching plan to teach each student to learn the correct pronunciation in the language. As is known, pronunciation can be measured both acoustically (phonemes) and visually (visuals). As discussed herein with reference to fig. 13B-14B, accent profiles for different ethnic groups and individual users may be established via audio and video information. For an individual user, an accent profile may be established based on audio/video information about how the user speaks certain language material. For ethnic groups, the accent profile for the group may be designed based on the accent profiles of its members. Deviations between the user's accent profile for a language and the representative accent profile for the group that said the language was spoken in can be used to design a tutorial plan to correct the user's accent.

FIG. 17B illustrates an exemplary method that a robotic agent tutor may use to tutor a student according to embodiments of the present teachings. As shown, the robot teaching device can be designed in different ways like a human to guide a student to learn a language. For example, the instructor may tutor the student via acoustic, visual, or textual means. Acoustically, the instructor can explain to the student what is the correct, incorrect word or phoneme pronunciation style, and can acoustically demonstrate a comparison between the correct and incorrect pronunciation styles. The instructor can also visually explain to the student the aspect of the lips/mouth moving while speaking. In some cases, the instructor can also present the student with an animation of the audio track while producing the sound so that the student can follow with the correct pronunciation. The instructor may also provide a text passage that the student can speak to pronounce words correctly as explained.

In human-machine interaction based coaching, a robot mentor may rely on dynamic observations (whether auditory or visual) of student performance in order to selectively choose a certain way to coach a student. The selection of such teaching methods may include both the material to be taught based on the student's progress or the way the student is taught. For example, scorer unit 1730 may dynamically assess a student's performance based on observations relating to aspects, such as whether the student answers correctly, whether the student's pronunciation bears any accents, and whether the student's visuals coincide with the required visuals. If the evaluation of the scorer unit 1730 reveals that the student's visuals do not conform to the visuals required by the particular language being taught, the robotic agent may dynamically decide how to show the correct visuals, or even visual animations of visuals, or sound trajectories to the student in order to correct the student's pronunciation.

Fig. 17C provides exemplary aspects of user performance that the scorer unit 1730 may evaluate during a tutoring session, according to embodiments of the present teachings. As shown, scorer unit 1730 may be designed to evaluate students according to different aspects of language learning, such as language features (such as syntactic expression (whether the student uses the correct syntax), semantics (whether the student understands words, the semantics of sentences)), pronunciation-related features (such as how the student pronounces, what the observed visuals are, etc.), fluency of reading by the student, the student's overall understanding, the student's language usage, and various other observations of the student that may be relevant to determining what is the appropriate way to teach the student (such as the student's gender, age group, or whether the language taught to the student is his/her first or second language). The tutorial plan adjustment unit 1750 can make and use these observations to facilitate its decision on how to adjust the tutorial plan, including the material to be taught and the way the student is taught (audio, visual, text). Examples relating to teaching the english language are shown in fig. 17D-17F.

Fig. 17D provides an exemplary projected spoken profile distribution 1330 of american's spoken english as well as exemplary acoustic waveforms for different phonemes for the centroid points of the distribution 1330 and visual visuals corresponding to such phonemes. As shown in fig. 17D, to the left is a distribution 1330 of accent profiles for a set of americans projected in a coordinate system, and to the right is a corresponding acoustic waveform, e.g., different phonemes in american english, and its corresponding visuals, derived from the centroid (e.g., average) of the distribution 1330. In teaching students to learn American English, the goal is for the student to reach an accent profile that is preferably in the range of 1330. Accordingly, to meet this goal, a coaching plan is to help students achieve similar acoustic characteristics when speaking english to the illustrated acoustic waveforms and corresponding mouth/movements shown for the different phonemes. Such assistance may be provided via sound, i.e., the robotic agent speaks a phoneme or word to the user and asks the student to mimic the sound. Additionally or alternatively, the robotic agent may visually show the shape and movement of the mouth to the student when reading out phonemes or words. Both the acoustic and visual means deployed to teach students to reach the correct pronunciation may be based on a standard accent profile of the underlying language, e.g., a spoken profile for the centroid (mean pronunciation) of distribution 1330 for american english.

FIG. 17E illustrates an example of the deviation between the accent profiles of one ethnic group A4 and the accent profile of the language to be taught (A3), according to an embodiment of the present teachings. On the left side of fig. 17E, there are two distributions, one corresponding to distribution 1330 associated with standard accent profile a 31330 in american english, and the other being accent profile a 41340 when a french person speaks in american english. On the right, exemplary visuals from two ethnic groups are shown, e.g., two exemplary visuals 1780-1 and 1790-1 from a standard accent profile corresponding to the centroid of group A3 and two exemplary corresponding visuals 1780-2 and 1790-2 observed from the user of ethnic group A41340. As can be seen, there are observable differences between 1780-1 and 1780-2 and between 1790-1 and 1790-2. When there is a difference between the visuals from the user and the visuals from the standard accent profile of the language the user is learning, this may indicate that there is an incorrect pronunciation. Thus, a teaching plan can be designed to incorporate measures/steps to correct accents by teaching the student how to control the mouth shape in order to pronounce correctly. In some embodiments, the differences in the speech signals of the various phonemes can also be used to determine whether spoken language correction is needed and, if so, what to incorporate in the instructional plan to cause it to occur. Thus, such observed differences in phonemes or visuals provide a basis for developing an adaptive teaching plan.

FIG. 17F illustrates an example of tutoring incorporated into an instructional plan adaptively developed based on the user's visual characteristics observed relative to standard visuals of the underlying spoken language, according to an embodiment of the present teachings. In this example, tutorial 1790-3 was developed based on the first deviation pair 1795 between standard visuals 1790-1 and 1790-2 observed from the user, tutorial 1790-3 having an interface where the user can first see his/her own visuals and be prompted to, for example, click on the "create correct shape" button to display the correct visuals (mouth shapes) of the phonemes. In some embodiments, the robotic agent may also provide accompanying verbal instructions to guide the student in correctly pronouncing phonemes when viewing standard visuals. Such instructions may also be adaptively created based on, for example, the difference between 1790-1 and 1790-2. For example, if the user's view pixels appear too wide rather than more circular as in standard view pixels, instructions may be designed to tell the user that his/her mouth needs to be rounded.

FIG. 17G is a flowchart of an exemplary process for adaptively creating a personalized coaching plan via dynamic information tracking and performance feedback, in accordance with an embodiment of the present teachings. A user's multimodal input is first received at 1705. To evaluate the user's performance, a current teaching plan is accessed at 1715, which is developed based on a curriculum schedule. Based on the user's input and the teaching plan, the user's performance is evaluated at 1725 with respect to the expected goals for the curriculum schedule, and differences between the user's performance and the expected goals for the curriculum schedule are identified at 1735. Such differences may be identified in different modalities (e.g., identified in acoustic and visual features) and may then be used to modify the current teaching plan at 1745 to derive a modified adaptive teaching plan. In some embodiments, observations made from the user and the evaluation of the user's performance may be used to update the user profile at 1755. To continue the conversation with the user, the robotic agent continues the conversation at 1765 according to the modified tutorial plan.

Figure 18 is an illustrative diagram of an exemplary mobile device architecture that can be used to implement a dedicated system embodying the present teachings in accordance with various embodiments. In this example, a user device on which the present teachings are implemented corresponds to a mobile device 1800, including but not limited to a smartphone, a tablet, a music player, a handheld game console, a Global Positioning System (GPS) receiver, and a wearable computing device (e.g., glasses, a wristwatch, etc.), or any other form of device. The mobile device 1800 may include one or more central processing units ("CPU") 1840, one or more graphics processing units ("GPU") 1830, a display 1820, memory 1860, a communication platform 1810, such as a wireless communication module, storage 1890, and one or more input/output (I/O) devices 1840. Any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 1800. As shown in fig. 18, a mobile operating system 1870 (e.g., iOS, Android, Windows Phone, etc.) and one or more applications 1880 may be loaded into memory 1860 from storage 1890 for execution by CPU 1840. Applications 1880 may include a browser or any other suitable mobile application for managing a conversation system on mobile device 1300. User interaction may be implemented via the I/O device 1840 and provided to the automated conversation partner via the network.

To implement the various modules, units, and their functions described in this disclosure, a computer hardware platform may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems, and programming languages of such computers are conventional in nature, and it is assumed that those skilled in the art are sufficiently familiar with these techniques to adapt those techniques to the appropriate settings as described herein. A computer with user interface elements may be used to implement a Personal Computer (PC) or other type of workstation or terminal device, but if suitably programmed, the computer may also act as a server. It is believed that those skilled in the art are familiar with the structure, programming, and general operation of such computer equipment, and that the drawings should be self-explanatory.

FIG. 19 is an illustrative diagram of an exemplary computing device architecture that can be used to implement a special purpose system embodying the present teachings in accordance with various embodiments. Such a dedicated system incorporating the present teachings has a functional block diagram illustration of a hardware platform that includes user interface elements. The computer may be a general purpose computer or a special purpose computer. Both of which may be used to implement a specialized system of the present teachings. This computer 1900 may be used to implement any of the components of a conversation or dialog management system, as described herein. For example, the conversation management system can be implemented on a computer such as computer 1900 via its hardware, software programs, firmware, or a combination thereof. While only one such computer is shown for convenience, the computer functions associated with the conversation management system described herein may be implemented in a distributed manner across a plurality of similar platforms to distribute processing load.

For example, computer 1900 includes a COM port 1950 that connects to a network connected thereto to facilitate data communications. Computer 1900 also includes a Central Processing Unit (CPU)1920 in the form of one or more processors for executing program instructions. Exemplary computer platforms include an internal communication bus 1910, various forms of program storage and data storage devices (e.g., disk 1970, Read Only Memory (ROM)1930 or Random Access Memory (RAM)1940), for processing and/or transmission by the computer 1900, as well as program instructions that may be executed by the CPU 1920. The computer 1400 also includes I/O components 1960 that support input/output streams between the computer and other components therein, such as user interface elements 1980. Computer 1900 may also receive programming and data via network communications.

Thus, as outlined above, aspects of the methods of dialog management and/or other processes may be implemented in programming. The program aspects of the technology may be considered an "article of manufacture" or "article of manufacture", typically in the form of executable code and/or associated data carried or embodied in a machine-readable medium. Tangible, non-transitory "storage" type media include any or all of the memory or other storage for a computer, processor, or the like, or its associated modules, such as various semiconductor memories, tape drives, disk drives, etc., that may provide storage at any time for software programming.

All or part of the software may sometimes be transmitted over a network such as the internet or various other telecommunications networks. For example, such communication may enable software to be loaded from one computer or processor into another computer or processor, e.g., in connection with conversation management. Thus, another type of media which may carry software elements includes optical, electrical, and electromagnetic waves, such as those used across physical interfaces between local devices, through wired and optical land-line networks, and through various air links. The physical elements that carry such waves (such as wired or wireless links, optical links, etc.) can also be considered to be media that carry software. As used herein, unless limited to a tangible "storage" medium, terms such as a computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.

Thus, a machine-readable medium may take many forms, including but not limited to, tangible storage media, carrier wave media, or physical transmission media. Non-volatile storage media includes, for example, optical or magnetic disks, any storage device in any computer or the like, such as (one or more of) any computer or the like, which may be used to implement a system or any component thereof, as shown in the figures. Volatile storage media includes dynamic memory, such as the main memory of such computer platforms. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form the bus within a computer system. Carrier-wave transmission media can take the form of electrical or electromagnetic signals, or acoustic or light waves, such as those generated during Radio Frequency (RF) and Infrared (IR) data communications. Thus, common forms of computer-readable media include, for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer can read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.

Those skilled in the art will recognize that the present teachings are amenable to various modifications and/or enhancements. For example, while an implementation of the various components described above may be implemented in a hardware device, it may also be implemented as a pure software solution — e.g., installed on an existing server. Further, the fraud network detection techniques as disclosed herein may be implemented as firmware, a firmware/software combination, a firmware/hardware combination, or a hardware/firmware/software combination.

While what is considered to constitute the present teachings and/or other examples has been described above, it should be understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. The following claims are intended to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.

Claims

1. A method implemented on at least one machine comprising at least one processor, memory, and a communication platform connectable to a network for human-machine conversation, the method comprising the steps of:

receiving utterances from users participating in the human-machine conversation about a topic in a conversation scene;

obtaining multimodal surrounding information related to the human-machine conversation;

analyzing the multimodal ambient information to track a multimodal context of the human-machine conversation; and

based on the tracked multimodal context, a spoken understanding of the utterance is personalized in a context-aware manner to determine semantics of the utterance.

2. The method of claim 1, wherein the multi-modal surrounding information includes acoustic and visual information.

3. The method of claim 1, wherein the multimodal context comprises at least one of:

objects and their spatial relationships in the dialog scene;

at least one event related to the user that occurred in the past and/or was observed during the conversation;

one or more acoustic/visual activities observed in the dialog scene;

information relating to previously recorded known characteristics of the user and/or characteristics of the user observed in the conversation; and

knowledge of common sense.

4. The method of claim 3, wherein the user's characteristics observed in the conversation include at least one of:

an observation of the user with respect to at least one of a behavior, an expression, and an action; and

inferences drawn based on the observations about the mood and/or intent of the user.

5. The method of claim 1, wherein personalizing spoken language understanding of the utterance comprises:

identifying, by an Automatic Speech Recognition (ASR), individual words spoken via the utterance, wherein the ASR disambiguates the words spoken by the user based on the user's characteristics represented in the multimodal context; and

determining semantics of the utterance by Natural Language Understanding (NLU), wherein the NLU determines the semantics based on acoustic/visual activity observed in the dialog scene and represented in the multimodal context.

6. The method of claim 1, further comprising:

determining a response to the utterance based on a dialog strategy governing the human-machine dialog on the topic;

generating a personalized text response based on the response in accordance with the multimodal context; and

based on the multimodal context, a personalized acoustic response corresponding to the personalized text response is generated in a context-aware manner.

7. The method of claim 6, wherein:

the personalized textual response is selected from a parameterized set of content associated with the response; and

the personalized text response is selected by:

based on characteristics of a user represented in the multimodal context, the characteristics of the user being previously known and/or currently observed in the dialog scenario, an

The selection is made in a context-aware manner based on context information represented in the multimodal context.

8. The method of claim 6, wherein the personalized acoustic response is rendered in a personalized and context-aware manner with respect to a user according to contextual information represented in the multimodal context.

9. The method of claim 1, further comprising, when the conversation corresponds to a tutoring session on the topic in accordance with a current tutoring plan,

evaluating, based on the semantics of the utterance, performance of the user with respect to one or more aspects of the current tutorial plan;

adjusting the current tutoring plan based on the performance in a context-aware manner to generate a personalized tutoring plan in accordance with the multimodal context; and

applying the personalized coaching plan in the conversation to continue the coaching session on the topic.

10. The method of claim 7, further comprising: updating the conversation policy based on the tracked multimodal context such that the conversation policy is personalized and context-aware.

11. A system for conducting a human-machine conversation, comprising:

a surrounding knowledge tracker configured for:

obtaining multimodal ambient information related to the human-machine dialog, an

a spoken language understanding engine configured for:

receiving utterances from users participating in the human-machine conversation about topics in a conversation scene, an

Based on the tracked multimodal context, a spoken understanding SLU of the utterance is personalized in a context-aware manner to determine semantics of the utterance.

12. The system of claim 11, wherein the multi-modal surrounding information includes acoustic and visual information.

13. The system of claim 11, wherein the multimodal context comprises at least one of:

objects and their spatial relationships in the dialog scene;

one or more acoustic/visual activities observed in the dialog scene;

knowledge of common sense.

14. The system of claim 13, wherein the user's characteristics observed in the conversation include at least one of:

15. The system of claim 11, wherein the spoken language understanding engine comprises:

an Automatic Speech Recognition (ASR) engine configured to recognize respective words spoken via the utterance, wherein the ASR engine disambiguates the words spoken by the user based on characteristics of the user represented in the multimodal context; and

a Natural Language Understanding (NLU) engine configured to determine semantics of the utterance, wherein the NLU engine determines the semantics based on acoustic/visual activity observed in the dialog scene and represented in the multimodal context.

16. The system of claim 11, further comprising:

a dialog manager configured to determine a response to the utterance based on a dialog policy governing the human-machine dialog on the topic;

a Natural Language Generation (NLG) engine configured to generate a personalized text response based on the response according to the multimodal context; and

a text-to-speech (TTS) engine configured to generate a personalized acoustic response corresponding to the personalized text response in a context-aware manner based on the multimodal context.

17. The system 1 of claim 16, wherein:

the personalized textual response is selected from a parameterized set of content associated with the response; and is

The selection is made by:

is made based on characteristics of a user represented in the multimodal context, the characteristics of the user being previously known and/or currently observed in the dialog scenario, an

Is made in a context-aware manner based on context information represented in the multimodal context.

18. The system of claim 16, wherein the personalized acoustic response is rendered in a personalized and context-aware manner with respect to a user according to contextual information represented in the multimodal context.

19. The system of claim 11, when the conversation corresponds to a tutoring session on the topic in accordance with a current tutoring plan, further comprising:

a scorer unit configured to evaluate performance of the user with respect to one or more aspects of the current tutoring plan based on the semantics of the utterance;

a tutorial plan adjustment unit configured to adjust the current tutorial plan based on the performance in a context-aware manner in accordance with the multimodal context to generate a personalized tutorial plan; and

a tutorial plan execution unit configured to apply the personalized tutorial plan in the dialog to continue the tutorial session on the topic.

20. The system of claim 17, further comprising a proxy mood update engine configured for updating the conversation policy based on the tracked multimodal context such that the conversation policy is personalized and context-aware.