WO2020256993A1 - Système et procédé pour un dialogue homme-machine personnalisé et multimodal tenant compte du contexte - Google Patents

Système et procédé pour un dialogue homme-machine personnalisé et multimodal tenant compte du contexte Download PDF

Info

Publication number
WO2020256993A1
WO2020256993A1 PCT/US2020/036748 US2020036748W WO2020256993A1 WO 2020256993 A1 WO2020256993 A1 WO 2020256993A1 US 2020036748 W US2020036748 W US 2020036748W WO 2020256993 A1 WO2020256993 A1 WO 2020256993A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
dialogue
multimodal
context
information
Prior art date
Application number
PCT/US2020/036748
Other languages
English (en)
Inventor
Wanyi ZHANG
Original Assignee
DMAI, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DMAI, Inc. filed Critical DMAI, Inc.
Priority to CN202080054000.3A priority Critical patent/CN114270337A/zh
Publication of WO2020256993A1 publication Critical patent/WO2020256993A1/fr

Links

Definitions

  • Provisional Patent Application 62/862264 filed June 17, 2019, (Attorney Docket No. : 047437- 0503578), U.S. Provisional Patent Application 62/862265, filed June 17, 2019, (Attorney Docket No. : 047437- 0503581), U.S. Provisional Patent Application 62/862273, filed June 17, 2019, (Attorney Docket No. : 047437- 0503579), U.S. Provisional Patent Application 62/862275, filed June 17, 2019, (Attorney Docket No. : 047437- 0503580), U.S. Provisional Patent Application 62/862279, filed June 17, 2019, (Attorney Docket No.
  • the present teaching generally relates to computer. More specifically, the present teaching relates to human machine dialogue management.
  • an automated dialogue system may need to achieve different levels of understanding of what the human said linguistically, what is the semantic meaning of what was said, sometimes the emotional state of the human, and the mutual causal effect between what is said and the surrounding of the conversation environment.
  • the traditional computer aided dialogue systems are not adequate to address such issues.
  • the teachings disclosed herein relate to methods, systems, and programming for advertising. More particularly, the present teaching relates to methods, systems, and programming related to exploring sources of advertisement and utilization thereof.
  • a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for human machine dialogue human machine dialogue An utterance is received from a user engaged in the human machine dialogue on a topic in a dialogue scene. Multimodal surround information related to the human machine dialogue is obtained and analyzed to track multimodal context of the human machine dialogue. The operation for spoken language understanding of the utterance is conducted, in a context aware manner based on the tracked multimodal context, to determine semantics of the utterance.
  • a system for human machine dialogue includes various functional modules for performing the method for human machine dialogues.
  • a human machine dialogue system comprises a surround knowledge tracker and a spoken language understanding (SLU) engine.
  • the surround knowledge tracker is configured for obtaining multimodal surround information related to the human machine dialogue and analyzing the multimodal surround information to track multimodal context of the human machine dialogue.
  • the SLU engine is configured for receiving an utterance from a user engaged in the human machine dialogue on a topic in a dialogue scene and personalizing spoken language understanding (SLU) of the utterance, in a context aware manner based on the tracked multimodal context, to determine semantics of the utterance.
  • SLU spoken language understanding
  • a software product in accord with this concept, includes at least one machine-readable non- transitory medium and information carried by the medium.
  • the information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.
  • a machine-readable, non-transitory and tangible medium having data recorded thereon for human machine dialogues, wherein the medium, when read by the machine, causes the machine to perform a series of steps to carry out a method for user machine communications human machine dialogue.
  • An utterance is received from a user engaged in the human machine dialogue on a topic in a dialogue scene.
  • Multimodal surround information related to the human machine dialogue is obtained and analyzed to track multimodal context of the human machine dialogue.
  • the operation for spoken language understanding of the utterance is conducted, in a context aware manner based on the tracked multimodal context, to determine semantics of the utterance.
  • FIG. 1A depicts an exemplary configuration of a dialogue system centered around an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching
  • Fig. IB is a flowchart of an exemplary process of a dialogue system using an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching
  • FIG. 2A depicts an exemplary construction of an information state, in accordance with an embodiment of the present teaching
  • Fig. 2B illustrates how representations of estimated different mindsets are connected in a dialogue with a robot tutor teaching a user adding fractions, in accordance with an embodiment of the present teaching
  • Fig. 2C shows an exemplary relationship among estimated representations of an agent’ s mindset, a shared mindset, and a user’ s mindset in an information state, in accordance with an embodiment of the present teaching
  • Fig. 3A shows exemplary relationships among different types of And-Or-
  • Graphs used to represent estimated mindsets of parties involved in a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 3B depicts exemplary associations between spatial AOGs (S-AOGs) and temporal AOGs (T-AOGs) in an information state, in accordance with an embodiment of the present teaching
  • Fig. 3C illustrates an exemplary S-AOG and its associated T-AOGs, in accordance with an embodiment of the present teaching
  • Fig. 3D illustrates exemplary relationships among an S-AOG, a T-AOG, and a C-AOG, in accordance with an embodiment of the present teaching
  • Fig. 4A illustrates an exemplary S-AOG representing partially an agent’s mindset for teaching different mathematical concepts, in accordance with an embodiment of the present teaching
  • Fig. 4B illustrates an exemplary T-AOG representing a dialogue policy associated partially with an agent’s mindset to teach the concept of fraction, in accordance with an embodiment of the present teaching
  • Fig. 4C shows exemplary dialogue content for teaching a concept associated with fraction, in accordance with an embodiment of the present teaching
  • Fig. 5A illustrates an exemplary temporal parsed graph (T-PG) within a T-
  • AOG representing a shared mindset between a user and a machine, in accordance with an embodiment of the present teaching
  • Fig. 5B illustrates a part of a dialogue between a machine and a human along a dialogue path representing a present representation of a shared mindset, in accordance with an embodiment of the present teaching
  • Fig. 5C depicts an exemplary S-AOG with nodes parameterized with measures related to levels of mastery of different underlying concepts to represent a user’ s mindset, in accordance with an embodiment of the present teaching
  • Fig. 5D shows exemplary types of personality traits of a user that can be estimated based on observations from a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 6A depicts a generic S-AOG for a tutoring dialogue, in accordance with an embodiment of the present teaching
  • Fig. 6B depicts a specific T-AOG for a dialogue on greeting, in accordance with an embodiment of the present teaching
  • Fig. 6C shows different types of parameterization alternatives for different types of AOGs, in accordance with an embodiment of the present teaching
  • Fig. 6D illustrates an S-AOG with different nodes parameterized with rewards updated based on dynamic observations from a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 6E illustrates an exemplary T-AOG generated by consolidating different graphs via graph matching with parameterized content, in accordance with an embodiment of the present teaching
  • Fig. 6F illustrates an exemplary T-AOG with parameterized content associated with nodes, in accordance with an embodiment of the present teaching
  • Fig. 6G illustrates an exemplary T-AOG with different paths traversing different nodes parameterized with rewards updated based on dynamic observations from a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 7A depicts a high level system diagram of a knowledge tracing unit, in accordance with an embodiment of the present teaching
  • Fig. 7B illustrates how knowledge tracing enables adaptive dialogue management, in accordance with an embodiment of the present teaching
  • Fig. 7C is a flowchart of an exemplary process of a knowledge tracing unit, in accordance with an embodiment of the present teaching
  • Fig. 8 A shows an example of utility-driven tutoring (node) planning with respect to S-AOGs, in accordance with an embodiment of the present teaching
  • Fig. 8B illustrates an example of utility driven path planning with respect to
  • T-AOGs in accordance with an embodiment of the present teaching
  • Fig. 8C illustrates a dynamic state in utility-driven adaptive dialogue management derived based on parameterized AOGs, in accordance with an embodiment of the present teaching
  • Figs. 9A - 9B show a scheme of enhancing spoken language understanding in human machine dialogues by automated enrichment of AOG parameterized content, in accordance with an embodiment of the present teaching
  • Fig. 9C illustrates exemplary ways to generate enriched training data for parameterized AOG content, in accordance with an embodiment of the present teaching
  • Fig. 10A depicts an exemplary high level system diagram for enhancing
  • ASR/NLU by training models based on automatically generated enriched AOG content, in accordance with an embodiment of the present teaching
  • FIG. 10B is a flowchart of an exemplary process for enhancing ASR/NLU by training models based on automatically generated enriched AOG content, in accordance with an embodiment of the present teaching
  • FIG. 11 A depicts an exemplary high level system diagram for context aware spoken language understanding based on surround knowledge tracked during a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 1 IB illustrates exemplary types of surround knowledge to be tracked to facilitate context aware spoken language understanding, in accordance with an embodiment of the present teaching
  • Fig. 12A illustrates tracking personal profile based on conversation occurred in a dialogue, in accordance with an embodiment of the present teaching
  • Fig. 12B illustrates tracking personal profile during a dialogue based on visual observations, in accordance with an embodiment of the present teaching
  • Fig. 12C shows an exemplary partial personal profile tracked during a dialogue based on multimodal input information acquired from the dialogue scene, in accordance with an embodiment of the present teaching
  • Fig. 12D shows an exemplary event knowledge representation constructed based on a conversation, in accordance with an embodiment of the present teaching
  • Fig. 13 A illustrates characteristic groups to classify a user for adaptive dialogue planning, in accordance with an embodiment of the present teaching
  • Fig. 13B illustrates exemplary content/construct of an individualized user profile, in accordance with an embodiment of the present teaching
  • Fig. 13C shows an example of establishing group characteristics of users and use thereof to facilitate adaptive dialogue planning, in accordance with an embodiment of the present teaching
  • Fig. 14A depicts an exemplary high level system diagram for tracking individual speech related characteristics and user profiles thereof to facilitate adaptive dialogue planning, in accordance with an embodiment of the present teaching
  • Fig. 14B is a flowchart of an exemplary process for tracking individual speech related characteristics and user profiles thereof to facilitate adaptive dialogue planning, in accordance with an embodiment of the present teaching;
  • Fig. 15 A provides an exemplary structure for constructing representation of events observed during dialogues, in accordance with an embodiment of the present teaching;
  • Figs. 15B - 15C show an example of tracking event-centric knowledge as dialogue context based on observations from a dialogue scene, in accordance with an embodiment of the present teaching
  • Figs. 15D - 15E show another example of tracking event-centric knowledge as dialogue context based on observations from a dialogue scene, in accordance with an embodiment of the present teaching
  • FIG. 16A depicts an exemplary high level system diagram for personalized context aware dialogue management, in accordance with an embodiment of the present teaching
  • FIG. 16B is a flowchart of an exemplary process for personalized context aware dialogue management, in accordance with an embodiment of the present teaching
  • Fig. 16C depicts exemplary high system diagrams of an NLG engine and a
  • TTS engine to produce a context aware and personalized audio response, in accordance with an embodiment of the present teaching
  • FIG. 16D is a flowchart of an exemplary process of an NLG engine, in accordance with an embodiment of the present teaching
  • Fig. 16E is a flowchart of an exemplary process of a TTS engine, in accordance with an embodiment of the present teaching
  • FIG. 17A depicts an exemplary high level system diagram for adaptive personalized tutoring based on dynamically tracking and feedback, in accordance with an embodiment of the present teaching
  • Fig. 17B illustrates exemplary approaches that a robot agent teacher may adopt to tutor a student, in accordance with an embodiment of the present teaching
  • Fig. 17C provides exemplary aspects of a student user that a grader may dynamically observe, in accordance with an embodiment of the present teaching
  • Fig. 17D provides an example of standard acoustic/viseme features of a spoken language
  • Fig. 17E shows an example of adaptive tutoring plan designed based on acoustic/viseme features of a user with respect to acoustic/viseme features of an underlying spoken language, in accordance with an embodiment of the present teaching
  • Fig. 17F shows an example of tutoring content to be used for a user based on deviation of viseme features of a user from that of an underlying spoken language, in accordance with an embodiment of the present teaching
  • FIG. 17G is a flowchart of an exemplary process for adaptive personalized tutoring based on dynamically tracking and feedback, in accordance with an embodiment of the present teaching
  • Fig. 18 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.
  • Fig. 19 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.
  • the present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables rich representations of multimodal information from the conversation environment to allow the machine to have an improved sense of the surrounding of the dialogue content and environment and to better adapt to the dialogue with enhanced engagement with the users. Based on such representations, the present teaching further discloses different modes to create such representations and to author content of dialogues in such representations. Furthermore, to allow adaptation of the representations based on the dynamics occurred during the conversation, the present teaching also discloses the mechanism of tracing the dynamics of the conversation and accordingly update the representations, which are then used by the machine to conduct the dialogues in a manner that is utility-driven to achieve maximized outcome.
  • Fig. 1A depicts an exemplary configuration of a dialogue system 100 centered around an information state 110 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching.
  • the dialogue system 100 comprises multimodal information processor 120, an automatic speech recognition (ASR) engine 130, a natural language understanding (NLU) engine 140, a dialogue manager (DM) 150, a natural language generation (NLG) engine 160, a text-to-speech (TTS) engine 170.
  • ASR automatic speech recognition
  • NLU natural language understanding
  • DM natural language generation
  • TTS text-to-speech
  • multimodal information is collected from the environment (including from the user 180), which captures the surrounding information of the conversation environment, the speech and expressions, either facial or physical, of the user, etc.
  • Such collected multimodal information is analyzed by the multimodal information processor 120 to extract relevant features in different modalities in order to estimate different characteristics of the user and the environment.
  • the speech signal may be analyzed to determine speech related features such as talking speed, pitch, or even accent.
  • the visual signal related to the user may also be analyzed to extract, e.g., facial features or physical gestures, etc. in order to determine expressions of the user.
  • the multimodal information analyzer 120 may also be able to infer the emotional state of the user.
  • a high pitch in voice, fast talking, plus an angry facial expression may indicate that the user is upset.
  • observed user activities may also be analyzed to better understand the user. For example, if a user is pointing or walking towards a specific object, it may reveal what the user is referring to in his/her speech.
  • Such multimodal information may provide useful context in understand the intent of the user.
  • the multimodal information processor 120 may continuously analyze the multimodal information and store such analyzed information in the information state 110, which is then used by different components in system 100 to facilitate decision makings related to dialogue management.
  • speech information from the user 180 is sent to the ASR engine
  • the speech recognition may include discern the language spoken and the words being uttered by the user 180.
  • the result from the ASR engine 130 is further processed by the NLU engine 140.
  • Such understanding may rely on not only the words being spoken but also other information such as the expression and gesture of the user 180 and/or other contextual information such as what was said previously.
  • the dialogue manager 150 determines how to respond to the user. Such a determined response may then be generated via the NLG engine 160 in a text form and further transformed from the text form to speech signals via the TTS engine 170. The output of the TTS engine 170 may then be delivered to the user 180 as a response to the user’s utterance. The process continues via such back and forth interactions for the machine dialogue system to carry on the conversation with the user 180.
  • the information state 110 captures the dynamics around the dialogue and provides relevant and rich contextual information that can be used to facilitate speech recognition (ASR), language understanding (NLU), and various dialogue related determinations, including what is an appropriate response (DM), what linguistic feature to be applied to the textual response (NLG), and how to convert a textual response to a speech form (TTS) (e.g., what accent).
  • ASR speech recognition
  • NLU language understanding
  • DM textual response
  • TTS textual response to a speech form
  • the information state 110 may represent the dynamics relevant to a dialogue obtained based on multimodal information, either related to the user 180 or to the surrounding of the dialogue.
  • the multimodal information processor 170 Upon receiving the multimodal information from the dialogue scene (either about the user or about the dialogue surrounding), the multimodal information processor 170 analyzes the information and characterizes the dialogue surroundings in different dimensions, e.g., acoustic characteristics (e.g., pitch, speed, accent of the user), visual characteristics (e.g., facial expressions of the user, objects in the environment), physical characteristics (e.g., user’s hand waving or pointing at an object in the environment), estimated emotion and/or the state of mind of the user, and/or preferences or intent of the user. Such information may then be stored in the information state 110.
  • acoustic characteristics e.g., pitch, speed, accent of the user
  • visual characteristics e.g., facial expressions of the user, objects in the environment
  • physical characteristics e.g., user’s hand waving or pointing at an object in the environment
  • estimated emotion and/or the state of mind of the user e.g., user’s hand waving or
  • the rich media contextual information stored in the information state 110 may facilitate different components to play their respective roles so that the dialogue may be conducted in a way that is adaptive, more engaging, and more effective with respect to the goals intended.
  • rich contextual information can improve understanding the utterance of the user 180 in light of what was observed in the dialogue scene, assessing the performance of the user 180, and/or estimating the utilities (or preferences) of the user in light of the intended goal of a dialogue, determining how to respond to the utterance of the user 180 based on estimated emotional state of the user, and delivering a response in a manner that is consider most appropriate based on what is known about the user, etc.
  • the ASR engine 130 may utilize that information to figure out the words a user said.
  • NLU engine 140 may also utilize the rich contextual information to figure out the semantics of what a user means. For instance, if a user points to a computer placing on a desk (visual information) and said,“I like this,” the NLU engine 140 may combine the output of the ASR engine 130 (i.e.,“I like this.”) and the visual information that the user is pointing at a computer in the room to understand that by“this” the user means the computer.
  • the DM 140 may determine to change the topic temporarily based on known interests of the user (e.g., talk about Lego games) in order to continue to engage the user.
  • the decision of distracting the user may be determined based on, e.g., utilities previously observed with respect to the user as to what worked (e.g., intermittently distracting the user has worked in the past) and what would not work (e.g., continue to pressure the user to do better).
  • Fig. IB is a flowchart of an exemplary process of the dialogue system 100 with the information state 110 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching.
  • the process is an iterated process.
  • multimodal information is received, which is then analyzed by the multiple information processor 170 at 125.
  • the multimodal information includes information related to the user 180 and/or that related to the dialogue surroundings.
  • Multimodal information related to the user may include the user’s utterance and/or visual observations of the user such as physical gestures and/or facial expressions.
  • Information related to the dialogue surroundings may include information related to the environment such as objects present, the spatial/temporal relationships between the user and such observed objects (e.g., user stands in front of a desk) , and/or the dynamics between the user’s activities and the observed objects (e.g., user walked towards the desk and points at a computer on the desk).
  • An understanding of the multimodal information captured from the dialogue scene may then be used to facilitate other tasks in the dialogue system 100.
  • the ASR engine 120 and the NLU engine 130 Based on the information stored in the information state 110 (representing the past state) as well as the analysis result from the multimodal information processor 170 (on present state), the ASR engine 120 and the NLU engine 130 perform, at 125, respectively, speech recognition to ascertain the words spoken by the user and language understanding based on the recognized words. ASR and NLU may be performed based on the current information state 110 as well as the analysis results from the multimodal information processor 170.
  • the changes of the dialogue state are traced, at 135, and such changes are used to update, at 145, the information state 110 accordingly to facilitate the subsequent processing.
  • the DM 140 determines, at 155, a response based on a dialogue tree designed for the underlying dialogue, the output of the NLU engine 130 (understanding of the utterance), and the information stored in the information state 110. Once the response is determined, the response is generated, by the NLG engine 150, in, e.g., its textual form based on the information state 110.
  • the NLG engine 150 determines, at 155, a response based on a dialogue tree designed for the underlying dialogue, the output of the NLU engine 130 (understanding of the utterance), and the information stored in the information state 110.
  • the response is generated, by the NLG engine 150, in, e.g., its textual form based on the information state 110.
  • the NLG engine 150 may generate, at 165, a response in a style based on the user’s preferences or what is known to be more appropriate to the particular user in the current dialogue. For instance, if the user answers a question incorrectly, there are different ways to point out that the answer is incorrect. For a particular user in the present dialogue, if it is known that the user is sensitive and easily gets frustrated, a gentler way to tell the user that his/her answer is not correct may be used to generate the response. For example, instead of saying“It is wrong,” the NLG engine 150 may generate a textual response of“It is not completely correct.”
  • the textual response, generated by the NLG engine 150 may then be rendered into a speech form, at 175 by the TTS engine 160, e.g., in an audio signal form.
  • the response generated by the NLG engine 150 may be further personalized based on information stored in the information state 110. For instance, if it is known that a slower talking speed or a softer talking manner works better for the user, the generated response may be rendered, at 175 by the TTS engine 160, into a speech form accordingly, e.g., with a lower speed and pitch. Another example is to render the response with an accent consistent with the student’s known accent according to the personalized information about the user in the information state 110.
  • the rendered response may then be delivered, at 185, to the user as a response to the user’s utterance.
  • the dialogue system 100 traces the additional change of the dialogue and updates, at 195, the information state 110 accordingly.
  • Figs. 2A depicts an exemplary construction of the information state representation 110, in accordance with an embodiment of the present teaching.
  • the information state 110 includes representations for estimated minds or mindsets.
  • different representations may be estimated to represent, e.g., the agent’s mindset 200, user’s mindset 220, and a shared mine 210, in connection with other information recorded therein.
  • the agent’s mindset 200 may refer to the intended goal(s) that the dialogue agent (machine) is to achieve in a particular dialogue.
  • the shared mindset 210 may refer to the representation of the present dialogue situation which is a combination of the agent’s carrying out the intended agenda according to the agent’s mindset 200 and the actual performance of the user.
  • the user’s mindset 220 may refer to the representation of an estimation, by the agent according to the shared mindset or the performance of the user, of where the student is with respect to the intended purpose of the dialogue. For example, if an agent’s current task is teaching a student user the concept of fraction in math (which may include sub-concepts to build up the understanding of fraction), the user’s mindset may include an estimated levels of mastery of the user on various related concepts. Such an estimation may be derived based on an assessment of the student performance at different stages of tutoring such relevant concepts. [0089] Fig. 2B illustrates how such representations of different mindsets are connected in an example of a robot tutor 205 teaching a student user 180 on concept 215 related to adding fractions, in accordance with an embodiment of the present teaching.
  • a robot agent 205 is interacting with a student user 180 via multimodal interactions.
  • the robot agent 205 may start with the tutoring based on an initial representation of the agent’s mindset 200 (e.g., the course on adding fractions which may be represented as AOGs).
  • agent’s mindset 200 e.g., the course on adding fractions which may be represented as AOGs.
  • student user 180 may answer questions from the robot tutor 205 and such answers to the questions form a certain dialogue path, enabling estimation of a representation of the shared mindset 210.
  • the performance of the user is assessed and a representation of the user’s mindset 220 is estimated with respect different aspects, e.g., whether the student masters the concept taught and what dialogue style works for this particular student.
  • the representations of the estimated mindsets are based on some graph related forms, including but not limited to, Spatial-Temporal-Causal And-Or- Graphs STC-AOGs 230, STC parsed graphs (STC-PGs) 240, and may be used in connection with other types of information stored in the information state such as dialogue history 250, dialogue context 260, event-centric knowledge 270, common sense models 280, ..., and user profiles 290. These different types of information may be of multiple modalities and constitute different aspects of the dynamics of each dialogue with respect to each user.
  • the information state 110 captures both general information of various dialogues and personalized information with respect to each user and each dialogue and interconnect together to facilitate different components in the dialogue system 100 to carry out respective tasks in a more adaptive, personalized, and engaging manner.
  • Fig. 2C shows an exemplary relationship among the agent’s mindset 200, the shared mindset 210, and a user’s mindset 220 represented in the information state 110, in accordance with an embodiment of the present teaching.
  • the share mindset 210 represents a state of the dialogue achieved via interactions between an agent and a user and is a combination of what the agent intended (according to the agent’s mindset) and how the user performed in following the agent’s intended agenda. Based on the shared mindset 210, the dynamics of the dialogue may be traced as to what the agent is able to achieve and what the user is able to achieve up to that point.
  • the user’s mindset 220 may be inferred or estimated, which will be used to determine how the agent may further facilitate to adjust or update the dialogue strategy in order to achieve the intended goal or to adjust the agent’ s mindset on adapting to the user.
  • the process of adjusting the agent’s mindset enable to derive an updated agent’ s mindset 200.
  • the dialogue system 100 learns the preferences of the user or what works better for the user (utility).
  • Such information is to be used to adapt the dialogue strategy via utility-drive (or preference driven) dialogue planning.
  • An updated dialogue strategy drives the next step in the dialogue which in turn leads to a response from the user, and subsequently the updates to the shared mindset, user’ s mindset, and the agent’s mindset.
  • the process iterates so that the agent can continue to adapt dialogue strategy based on the dynamic information state.
  • different mindsets are represented based on, e.g., STC-AOGs and STC-PGs.
  • Fig. 3 A shows exemplary relationships among different types of And-Or-Graphs (AOGs) used to represent estimated mindsets of parties involved in a dialogue, in accordance with an embodiment of the present teaching.
  • AOGs And-Or-Graphs
  • AOGs are graphs with AND and OR branches. Branches associated with a node in an AOG and related by an AND relation represent tasks that need to be all traversed. Branches from a node in an AOG and related by an OR relation represent tasks that can be alternatively traversed.
  • STC-AOGs include S- AOGs corresponding to spatial AOGs, T-AOGs corresponding to temporal AOGs, and C-AOGs corresponding to causal AOGs.
  • an S-AOG is a graph comprising nodes each of which may correspond to a topic to be covered in a dialogue.
  • a T-AOG is a graph comprising nodes each of which may correspond to a temporal action to be taken.
  • Each T-AOG may be associated with a topic or node in an S-AOG, i.e., representing steps to be carried out during a dialogue on the topic corresponding to the S-AOG node.
  • a C-AOG is a graph comprising nodes each of which may link to a node in a T-AOG and a node in a corresponding S- AOG, representing an action occurred in the T-AOG and its causal effect to the node in the corresponding S-AOG.
  • Fig. 3B depicts exemplary relationships between nodes in a S-AOG and nodes of an associated T-AOG represented in the information state 110, in accordance with an embodiment of the present teaching.
  • each K node corresponds to a node in S- AOG, representing a skill or a topic to be taught in a dialogue.
  • An evaluation with respect to each K node may include“mastered” or“not-yet-mastered” with, e.g., respective probabilities P(T) and l-P(T), i.e., P(T) representing the transition probability from not-yet mastered to mastered states.
  • P(L0) denotes the probability of prior learning skill or prior knowledge on a topic, i.e., the likelihood that a student already mastered the concept before the tutoring session starts.
  • a robot tutor may ask a number of questions in accordance with a T-AOG associated with the K node and then the student is to answer each question.
  • Each question is shown as a Q node and a student answer is represented in Fig. 3B as an A node.
  • a student’ s answer may be a correct answer A(c) or a wrong answer A(w), as seen in Fig. 3B.
  • additional probabilities are determined based on, e.g., various knowledge or observations collected during the dialogue. For example, if a user provides a correct answer (A(c)), a probability P(G) of the answer being a guess may be determined, representing the likelihood that the student does not know the correct answer but guessed the correct answer. Conversely, l-P(G) is the probability that the user knows the correct answer and answered correctly.
  • a probability P(S) may be determined, representing the likelihood of the student giving a wrong answer but the student does know the concept.
  • P(S) a probability of l-P(S) may be estimated, representing the likelihood that the student giving the wrong answer because the student does not know the concept.
  • Such probabilities may be computed with respect to each node along a path traversed based on the actual dialogue and can be used to estimate when the student masters the concept and what may work well and what may not work well in terms of teaching this specific student on each specific topic.
  • Fig. 3C illustrates an exemplary S-AOG and its associated T-AOGs, in accordance with an embodiment of the present teaching.
  • each node corresponds to a topic or concept to be taught to a student user during a dialogue.
  • S-AOG 310 includes a node P0 or 310-1, representing the concept of fraction, a node PI or 310-2, representing the concept of divide, a node P2 or 310-3, representing the concept of multiply, a node P3 or 310-4, representing the concept of add, and a node P4 or 310-5, representing the concept of subtract.
  • different nodes in the S-AOG 310 are related. For instance, to master the concept of fraction, at least some of the other concepts on add, subtract, multiply, and divide may need to be mastered first.
  • To teach a concept represented by a S-AOG node, e.g., fraction there is a series of steps or a process that an agent may need to carry out in a dialogue session with a student user. Such a process or a series of steps corresponds to a T-AOG.
  • T-AOG for each node in a S-AOG, there may be multiple T-AOGs, each of which may represent a different way to teach a student and may be invoked in a personalized manner.
  • S-AOG node 310-1 has a plurality of T-AOGs 320, one of which is illustrated as 320-1 corresponding to a series of temporal steps of questions/answers 330, 340, 350, 360, 370, 380, ... , etc.
  • a choice of which T-AOG to use may vary and may be determined based on various considerations, e.g., the user in the session (personalized), the present level of mastery of the concept (e.g., P(L0)), etc.
  • a STC-AOG based representation of a dialogue captures entities/ objects/concepts related to the dialogue (S-AOGs), possible actions observed during the dialogue (T-AOGs), and impact of each of the actions to the entities/objects/concepts (C-AOGs).
  • Actual dialogue activities occurred during the dialogue may cause to traverse the corresponding graph representation or STC-AOGs, resulting in parsed graphs (PG) corresponding to traversed portions of the STC-AOGs.
  • an S-AOG may model a spatial decomposition of objects and scenes of the dialogue.
  • an S-AOG may model a decomposition of concepts and sub-concepts as discussed herein.
  • a T-AOG may model a temporal decomposition of events/sub-events/actions that may be performed or have occurred in a dialogue in connection with certain entities/objects/concepts represented in a corresponding S-AOG.
  • a C-AOG may model a decomposition of events represented in a T-AOG and its causal implication with respect to corresponding entity/object/concept represented in an S- AOG. That is, a C-AOG describes a change to a node in an S-AOG caused by an event/action taken in a dialogue and represented in a T-AOG.
  • Such information is with respect to different aspects of a dialogue and is captured in the information state 110. That is, the information state 110 represents the dynamics of a dialogue between a user and a dialogue agent. This is illustrated in Fig. 3D.
  • FIGS. 5A-5B provide exemplary representation of a shared mindset via a T-PG yielded based on dialogues in a specific tutoring session;
  • FIGs. 6A - 6B show exemplary representation of a user’s mindset in term of estimated level of mastery of different concepts being taught in dialogues with a dialogue agent.
  • Fig. 4A shows an exemplary representation of an agent’s mindset with respect to tutoring on fraction, in accordance with an embodiment of the present teaching.
  • a representation of an agent’s mindset may reflect what the agent intends to or is designed to cover in a dialogue.
  • the agent’s mindset may adapt during a dialogue session based on the user’s performance/behavior so that its representation is to capture such dynamics or adaptation.
  • the exemplary representation of an agent’s mindset as illustrated in Fig. 4A, comprises various nodes, each of which represent a sub-concept in connection with the concept of fraction.
  • sub-concepts related to“understand fraction” 400 “compare fractions” 405,“understand equivalent fractions” 410,“expand and reduce equivalent fractions” 415,“find factor pairs” 420,“apply properties of multiplication/division” 425,“add fractions” 430, “find LCM” 435,“solve unknown in multiplication/division” 440,“multiply and divide within 100” 445,“simplify improper fractions” 450,“understand improper fraction” 455, and“add and subtract” 460.
  • sub-concepts may constitute the landscape of fraction and some sub-concepts may need to be taught before others, e.g.,“understand improper fraction” 455 may need to be covered prior to“simplify improper fractions” 450,“add and subtract” 460 may need to be mastered prior to“multiply and divide within 100” 445, etc.
  • Fig. 4B illustrates an exemplary T-AOG representing an agent’s mindset in teaching a concept related to fraction, in accordance with an embodiment of the present teaching.
  • a T-AOG includes various steps associated with a dialogue, some of which relating to what an agent says, some of which relating to what a user responds, and some of which corresponding to certain evaluation directed to the conversation performed by the agent.
  • the agent proceeds to 490 to ask for user input, e.g., to tell the agent which highlighted one is a denominator. Based on the answer received from the student, the agent follows the two links combined by an OR (the plus sign), where each of the links represents one path the user takes. For instance, if the user answers correctly on which one is a denominator, the agent proceeds to 490-3 to, e.g., further ask the user to evaluate the denominator. If the user answers incorrectly, the agent proceeds to 490-4 to provide a hint to the user on denominator and then follows link 490-2 to go back to 490 asking the user for input again on which one is a denominator.
  • OR the plus sign
  • Fig. 4A that represents the concepts that the agent plans to teach a student the concept of fraction. Hence, they together form a part of a representation for the agent’s mindset.
  • Fig. 4C shows exemplary dialogue content authored for teaching a concept associated with fraction, in accordance with an embodiment of the present teaching. With a similar dialogue policy, the conversation is intended to be carried out in a question-answer flow.
  • Fig. 5 A illustrates an exemplary representation for a shared mindset in the form of a T-PG (corresponding to a path in the T-AOG in Fig. 4B), in accordance with an embodiment of the present teaching.
  • the highlighted steps form a specific path taken in a dialogue on actions carried out by a dialogue agent based on the answers from the user.
  • what is shown in Fig 5A is a T-PG along various highlighted steps, e.g., 470, 510, 520, 530, 540, 550, ... , along the highlighted path.
  • T-PG as shown in Fig.
  • FIG. 5A represents an instantiated path traversed based on both the agent and the user’s actions and thus, representing a shared mindset.
  • FIG. 5B illustrates a part of authored dialogue content between an agent and a user based on which a representation of a shared mindset may be obtained, in accordance with an embodiment of the present teaching.
  • a representation of a shared mindset may be derived based on a flow of a dialogue, forming a particular path or T-PG traversed along an existing T-AOG.
  • a dialogue agent estimates the mindset of a user engaged in the dialogue based on observations of the conversation with the user so that the dialogue and a representation of the estimated user’s mindset are both adapted based on the dynamics of the dialogue. For example, in order to determine how to proceed with a dialogue, the agent may need to assess or estimate, based on observations of the user, a level of mastery of the user with respect to a specific topic. The estimation may be probabilistic as discussed with respect to Fig. 3B.
  • the agent may infer a current level of mastery of a concept and determine how to further conduct the conversation, e.g., either continue to tutor on the current topic if the estimated level of mastery is not adequate or move on to other concept if the estimated level of user’s mastery of the current concept suffices to do so.
  • the agent may assess regularly in the course of a dialogue and annotate a PG (parameterizes) along the way to facilitate the decision of a next move in traversing a graph.
  • Such annotated or parameterized S- AOGs may yield S-PGs, i.e., indicating, e.g., which nodes in S-AOGs have been adequately covered and which ones have not.
  • FIG. 5C depicts an exemplary S-PG of a corresponding S-AOG, representing an estimated user’s mindset, in accordance with an embodiment of the present teaching.
  • the underlying S-AOG is illustrated in Fig. 4A.
  • each node in this S-AOG is assessed based on the conversation and is parameterized or annotated based on such assessment.
  • nodes representing different sub- concepts related to fraction are annotated with respective different parameters indicating, e.g., a level of mastery of the corresponding nodes.
  • Fig. 5C the nodes in the initial S-AOG (in Fig. 4A) are now annotated in Fig. 5C with different weights, each of which indicates an assessed level of mastery of the sub-concept for that corresponding node.
  • the nodes in Fig. 5C are presented in different shades determined by the weights representing different degrees of mastery of the underlying sub-concepts. For example, nodes that are now dotted may correspond to those sub concepts that have been mastered so no further traverse is needed.
  • Nodes 560 and 565 (corresponding to“understand fraction” and“understand improper fraction”) may correspond to the sub-concepts that have not reached a required mastery level. All nodes connected to these two nodes that are in-between, e.g., mastered and not-yet-mastered, may be considered to be attributing reasons that the user still has not yet mastered the concepts of fraction and improper fraction.
  • the mastery levels estimated as such for respective nodes in an original S- AOG yield an annotated S-PG, representing an estimated user’s mindset which is to indicate degrees of understanding of the concepts associated with such nodes.
  • This provides a basis for a dialogue agent to understand relevant landscape about a user on, e.g., what the user already understood of what was taught and what the user still has problem with.
  • the representation of a user’s mindset is estimated dynamically based on, e.g., the user’s performance and activities during an on-going dialogue.
  • Fig. 5D shows exemplary types of personality traits of a user that can be estimated based on information observed in a conversation, in accordance with an embodiment of the present teaching.
  • an agent may, via multimodal information processing (e.g., by the multimodal information processor 170), estimate, in different dimensions, various characteristics of the user’s in terms of, e.g., whether the user has an outgoing personality, how mature the user is, whether the user is mischievous, whether the user easily gets excited, whether the user is generally cheerful, how confident or secure the user feels about him/herself, whether the user is reliable, rigorous, etc.
  • Such information once estimated, forms a profile of the user which may influence the dialogue system 100 to determine how to adapt its dialogue strategy when needed and in what manner its agent should conduct a dialogue with a user.
  • Both S-AOGs and T-AOGs may have certain structures, organized based on, e.g., topics, concepts, or flow of the conversation.
  • Fig. 6A depicts an exemplary generic structure of an S-AOG related to a tutoring dialogue, in accordance with an embodiment of the present teaching. Instead of being subject matter specific, this general structure as shown in Fig. 6A may be used for teaching on any subject matter.
  • the exemplary structure comprises different stages involved in a tutoring dialogue represented as different nodes in the S-AOG.
  • a node 600 is for a dialogue related to a greeting
  • node 615 for teaching the subject matter intended
  • node 620 for testing a student user on the subject matter taught
  • Different nodes may be connected in a way encompassing different flows among underlying sub-dialogues but specific flow in each dialogue may be dynamically determined based on the situation.
  • Some branches out of a node may be related via an AND relationship and some branches out of a node may be related via an OR relationship.
  • a dialogue related to tutoring may start with the greeting dialogue 600, such as“Good morning,”“Good afternoon,” or“Good evening.”
  • There are three branches out of the node 600 including to node 605 for a brief chitchat, to node 610 for a review of previously learned knowledge, and to node 615 to start to teach directly. These three branches are ORed together, i.e., a dialogue agent may proceed to follow any of the three branches.
  • the chat session 605 there are also three branches, one to the teaching node 615, one to the testing node 620, and one to the review node 610 .
  • the review node 610 also has two branches, one to the teaching node 615 and the other to the testing node 620 (may test the student first before teaching for prior knowledge or a prior level of mastery on the subject matter).
  • teaching and testing nodes are required dialogues so that the branches from nodes 605 and 610 to the teaching and testing nodes 615 and 620 are related by AND.
  • Teaching and testing may be iterated, as evidenced by the bidirectional arrows between nodes 615 and 620. Both the teaching node 615 or the testing node 620 may proceed to the evaluation node 625, as needed. That is, an evaluation may be carried out based on either on the teaching result from node 615 or a testing result from node 620. Based on an evaluation result, the dialogue may proceed to one of the three alternatives (related by OR), including to teaching 615 (to go over the concept again), testing 620 (re-testing), or review 610 (to strengthen a user’s understanding of some concepts), or even chat 605 (e.g., if the user is found frustrated, the dialogue system 100 may switch topics in order to continue to engage the user rather than losing the user).
  • This generic S-AOG for a tutoring related dialogue is provided as an illustration rather than a limitation. An S-AOG for tutoring may be derived according to any logic flows as needed by an application.
  • each of the nodes is itself a dialogue and, as discussed herein, may be associated with one or more T-AOGs, each representing a flow of conversation directed to the subject matter intended.
  • Fig. 6B depicts an exemplary T-AOG with dialogue content authored for S-AOG node 600 on greeting, in accordance with an embodiment of the present teaching.
  • a T-AOG may be defined as a dialogue policy for a dialogue. Following the steps defined in a T-AOG is to carry out the policy to achieve some intended purposes.
  • content in each rectangular box represents what is to be spoken by an agent and content in an ellipse represents what a user responds.
  • the agent first says one of three alternative greetings, i.e., good morning 630-1, good afternoon 630-2, and good evening 630-3.
  • a response to such a greeting from a user may differ. For example, a user may repeat what the agent said (i.e., good morning, good afternoon, or good evening). Some may repeat and then add“to you, too,” 635-1. Some may say“Thank you, and you?” at 635-2. Some may say both 635-1 and 635-2. Some may simply remain silent 635-3.
  • a dialogue agent may then answer the user’s response. Each answer may correspond to the user’s response, with respect to each of the alternative responses from the user. This is illustrated by the content at 640-1, 640-2, and 640-3 in Fig. 6B.
  • the T-AOG shown in Fig. 6B may encompass multiple T-AOGs.
  • 630-1, 635-2, and 640-2 in Fig. 6B may constitute a T-AOG for greeting.
  • 630- 1, 635-1, and 640-1 may correspond to another T-AOG for greeting;
  • 630-2, 635-1, 640-1 may form another one;
  • 630-1, 635-3, 640-3 form a different one;
  • 630-2, 635-3, and 640-3 may form yet another different one, etc.
  • these alternative T-AOGs all have substantial similar structure and generic content. Such a commonality may be utilized to generate a simplified T-AOG with flexible content associated with each node.
  • T-AOGs related to greetings although with different authored content for the greetings, they all have the similar structure, i.e., initial greeting plus response from a user and plus a response to the user’s response on greeting.
  • the T-AOG in Fig. 6B may not correspond to the most simplified generic T-AOG for greeting.
  • AOGs may be parameterized. Such parameterization may be applied to both S-AOGs and T-AOGs in terms of both parameters associated with nodes in the AOGs as well as parameters associated with links between different nodes, according to different embodiments of the present teaching.
  • Fig. 6C shows different exemplary types of parameterization, in accordance with an embodiment of the present teaching.
  • parameterized AOGs includes parameterized S-AOGs and T-AOGs.
  • each of its nodes may be parameterized with, e.g., a reward representing the reward obtained by covering the subject matter or topic/concept associated with the node.
  • a student user is already familiar with the concept associated with a node in S-AOG (e.g., already mastered the concept)
  • the lower the reward assigned to the node because there is no further benefit by teaching the student the associated concept.
  • Such rewards associated with nodes may be updated dynamically during the course of a tutoring dialogue. This is illustrated in Fig. 6D, where the S-AOG 310 with nodes associated with relevant math concepts related to fraction. As seen, each node representing a relevant concept is parameterized with a reward, estimated to indicate whether there is a reward to teaching a student the concept.
  • Each node in an S-AOG may have different branches and each branch leads up to another node associated with a different topic.
  • Such branches may also be associated with parameters such as probabilities to take the respective branches, as represented in Fig. 6C.
  • Parameterizing paths in an AOG is also illustrated in Fig. 6D.
  • Teaching towards fraction may require building up the knowledge starting from add and subtract and then multiplication and division.
  • the parameterized probability P a,s may indicate the likelihood of success in teaching a student to understand the concept of“subtract” if the concept of“add” is taught first.
  • probability Ps,a may indicate the likelihood of success in teaching the student to understand add if subtract is taught first.
  • from“add” to“multiplication”/“division” are parameterized with probabilities P a ,m and P a ,d, respectively.
  • from “subtract” to“multiplication”/ “division” are also parameterized with probabilities P s ,m and P s ,d, respectively.
  • a dialogue agent may select an optimized path by maximizing the probability of success in teaching intended concepts in an order that may work better.
  • Such probabilities may also be updated dynamically based on, e.g., observations from the dialogue. In this manner, the best course of action in teaching a student may be adapted in real time based on individual situations.
  • T-AOGs may also be applied to T-AOGs, as indicated in Fig. 6C.
  • a T-AOG represents a dialogue policy directed to a specific topic.
  • Each node in a T-AOG represent a specific step in a dialogue, often relating to what a dialogue agent is to say to a user or what a user is going to respond to the dialogue agent, or an evaluation of the conversion.
  • content associated with nodes in a T-AOG may be parameterized. This is illustrated in Fig. 6E, according to an embodiment of the present teaching. As shown in Fig.
  • Fig. 6E shows an exemplary parameterized T-AOG that corresponds to the T-AOG illustrated in Fig. 6B.
  • the initial greeting is now parameterized as“Good [ _ ]!”
  • a verbal answer 655-1 may be parameterized with different choices of content to respond to the initial greeting, as shown in braces associated with 655-1. That is, anything included in the parameterized set 655-1 may be recognized as a possible answer from a user to respond to the initial greeting from the agent.
  • the content for such a response from an agent at 660-1 may also be parameterized to bet a set of all possible responses.
  • the response content 660-2 by an agent in the event of a silence from the user may also be similarly parameterized.
  • a T-AOG for such a testing may comprise the steps of presenting a problem (665), inquiring the student for an answer (667), the student providing the answer (670-1 or 675- 1), responding to the user’s answer (670-2 or 675-2), and then evaluating the reward associated with the S-AOG node for“add” (677). For each node of the T-AOG in Fig. 6F, the content associated therewith is parameterized.
  • the parameters involved include X, Y, Oi, where X and Y are numerical numbers and Oi refers to an object of type i.
  • the first step at 665 of the testing is to present X objects of type 1 (“ol”) and Y objects of type 2 (“o2”), where X and Y are instantiated with numerical numbers (3 and 4) and“ol” and“o2” may be instantiated with types of objects (such as apple and orange). Based on such an instantiation of parameters, a specific testing question can be generated.
  • T-AOG in Fig.
  • a dialogue agent is to ask a user what is the sum of X objects of type“ol” and Y objects of type“o2.”
  • the text problem may be presented with“3 apples, 4 oranges” (or even picture of the same) and a testing question may be asked by instantiate the parameterized question to inquire about a sum of X+Y, e.g.,“how many fruits are there?” or“can you tell me the total number of fruits?”
  • flexible testing questions may be generated under the generic and parameterized T-AOG.
  • parameterized testing questions may also facilitate to generate an expected correct answer.
  • Such a dynamically generated expected correct answer may then be used to evaluate an answer from a student user in response to the question.
  • a T-AOG may be parameterized with a simpler graphical structure yet at the same time enables a dialogue agent to flexibly configure different dialogue content in a parameterized framework to carry out the task intended.
  • a dialogue agent may also dynamically derive a basis for the evaluation for the testing.
  • the answer may be classified as a not-know answer (e.g., when the user either did not respond at all or the answer does not contain a number) at 670- 1 or an answer with a number (either a correct or an incorrect number) at 675-1.
  • a response to an answer may also be parameterized. Either a not-know answer or an incorrect answer may be considered as a not-correct answer, which can be responded to at 670-2 using a parameterized response.
  • a response to a not-correct answer may be further classified into different situations and in each situation, the response content may be parameterized with appropriate content suitable for that classification. For example, when a not-correct answer is an incorrect total, the response to that may be parameterized to address the incorrect answer, which can be due to a mistake or a guess. If a not-correct answer is because the user simply does not know, e.g., did not answer at all, the response may be parameterized to direct to that situation with appropriate response alternatives. Similarly, the response to a correct answer may also be parameterized to address either it is indeed a correct answer or it is estimated to be a lucky guess.
  • the T-AOG includes a step at 677 for evaluating the current reward for mastery of the“add” concept.
  • the process may return to 665 to test the student on more problems. It may also proceed to“teaching” if the evaluation reveals that the student did not quite understand the concept or exit if the student is viewed as already mastered the concept. In some situations, the process may also go to exceptions if, e.g., it is detected that the student simply cannot do the work so that the system may consider switching topics temporarily as discussed with respect to Fig. 6A.
  • T-AOG corresponds to a dialogue policy dictating alternative possible flows of a conversation
  • an actual conversation may traverse a part of the T- AOG by following a particular path in the T-AOG.
  • Different users or the same user in different dialogue sessions may yield different paths embedded in the same T-AOG.
  • Such information may be useful in allowing the dialogue system 100 to personalize the dialogues by parameterizing links along different pathways with respect to different users and such parameterized paths may represent what works and what does not work well with respect to each user. For example, for each link between two nodes in a T-AOG, a reward of the link may be estimated with respect to the performance of each student user towards understanding of the underlying concept taught.
  • Fig. 6G illustrates a T-AOG associated with a user with different paths between different nodes parameterized with rewards updated based on dynamic information observed in conversations with the user, in accordance with an embodiment of the present teaching.
  • this exemplary parameterized T-AOG (which is similar to that presented in Fig. 6F), after presenting at 680 X object 1 and Y object 2 to a student user, the agent inquires, at 685, the user on the total of X+Y.
  • the answer from the student is incorrect (690-2), there may be different ways to respond to that, e.g., 695-1, 695-2, and 695-3.
  • Based on the past experience or known personality of the user (again estimated in a personalized manner), there may be different reward scores associated with each possible response, R22, R23, and R24, respectively. For instance, if it is known that the user is sensitive and works better with encouragement or positive manner, the reward associated with response 695-2 may be the highest.
  • the dialogue system 100 may select to respond to an incorrect answer with a response 695-2 that is more positive with encouragement. For example, the dialogue agent may say“Almost there.
  • the dialogue system 100 may dynamically configure and update, during each dialogue, the parameters to personalize the AOGs so that the dialogues may be conducted in a flexible (content is parameterized), personalized (parameters are computed based on personalized information), and, hence, a more productive manner.
  • information state 110 is represented based on not only AOGs but also a variety of types of information such as dialogue context 260, dialogue history 250, user profile 290, event-centric knowledge 270, and some commonsense models 280.
  • Representations of different mindsets 200-220 are determined based on dynamically updated AOGs as well as other information from 250-290.
  • AOGs are used to represent different mindsets
  • their corresponding PGs are generated based on the actual traversal (nodes and paths) in AOGs and the dynamic information collected during the dialogue.
  • values of parameters associated with nodes/links in AOGs may be dynamically estimated based on, e.g., the on-going dialogue, the dialogue history, the dialogue contexts, user profiles, events occurred during dialogues, etc.
  • different types of information such as knowledge about events, surroundings, user’s characteristics, activities, etc., may be traced and such traced knowledge may then be used to update different parameters and ultimately the information state 110.
  • AOGs/PGs are used to represent different mindsets, including a robot agent’s mindset (designed in terms of what is intended to be accomplished), a representation of a share mindset between a robot agent and a user (derived based on the actual dialogue occurred), and a mindset of the user (estimated based on the dialogue occurred and the performance of the user in the dialogue).
  • AOGs and PGs are parameterized, values of the parameters associated with nodes and links may be evaluated based on, e.g., information related to the dialogue, the performance of the user, the characteristics of a user, and optionally event(s) occurred during the dialogue, etc. Based on such dynamic information, representations of such mindsets can be updated over time during a dialogue based on the changing situations.
  • Fig. 7A depicts a high level system diagram of a knowledge tracing unit 700 for tracing information and updating rewards associated with nodes/path in AOGs/PGs, in accordance with an embodiment of the present teaching.
  • nodes in an AOG may be respectively parameterized with state related rewards and paths in PGs may also be respectively parameterized with path related rewards or utilities.
  • a reward/utility associated with a state or a node in an AOG may include a reward/utility representing a level of mastery of a concept associated with the node.
  • each S-AOG node associated with a concept may have one or more T-AOGs, each of which may correspond to a specific way to teach the concept.
  • a parsed path or PG is formed based on nodes and links in a T-AOG traversed during a dialogue.
  • a reward/utility associated with a path in a T-AOG or T-PG may represent a likelihood that this path will lead to a successful tutoring session or a successful mastery of the concept. Given that, the better the performance of a user assessed when traversing along a path, the higher a path reward/utility associated with the path.
  • Such path related rewards/utilities may also be determined based on performances of a plurality of users which indicate statistically what teaching style works better over a group of users. Such estimated rewards/utilities along different branching paths may be especially helpful in a dialogue session when determining which branching path to take to continue a dialogue and may guide a dialogue agent to select a path that may have a statistically better chance to lead to a better performance, i.e., quicker to achieve the level of mastery of the concept.
  • the illustrated embodiment shown in Fig. 7A is directed to tracing state and path rewards during a dialogue.
  • rewards associated with nodes and paths are determined based on different probabilities estimated based on the dynamic situations of the dialogue. For instance, for a node in an S-AOG associated with, e.g., concept of “add,” its reward is a state based reward representing whether there is a return or reward by teaching the concept“add” to a specific user. For each student user who signs up to learn math from a robot agent, the reward value for each node of an S-AOG on math concepts is adaptively computed.
  • the reward for each node in such an S-AOG may be assigned an initial reward value and the reward value may continue to change when the user goes through a dialogue dictated by an associated T-AOG (flows of conversation on concept“Add”).
  • T-AOG flows of conversation on concept“Add”.
  • the robot agent may ask questions to the user who then answers the questions.
  • the answers from the user may be continuously evaluated and probabilities are estimated on whether the user is learning or making progress.
  • Such probabilities estimated while traversing the T-AOG may be used to estimate the reward associated with the node in the S-AOG (i.e., the node representing concept“Add”) indicating whether the user has mastered the concept.
  • the reward value associated with a node representing the concept is updated during the dialogue. If the teaching is successful, the reward may drop to a low value indicating that there is no further value or reward to teach the student the concept because the student already mastered the concept. As seen, such state based rewards are personalized because they are computed based on the performance of each user in the dialogues.
  • T-AOG which comprises different nodes each of which may have multiple branches, representing alternative pathways. Choices of different branches lead to different traversals of the underlying T-AOG and each traversal yields a T-PG.
  • different branches may be associated with respective measurements which may indicate the likelihoods of achieving the intended goal when respective branches are selected. The higher the measurement associated with a branch, the more likely that it will lead to a path that will fulfill the intended purpose. However, optimizing a selection of a branch out of each of individual nodes may not lead to an overall optimal path.
  • optimization may be performed based on a path, i.e., the optimization is to be performed with respect to a path (of a certain length).
  • a path based optimization may be implemented as a look-ahead operation, i.e., in considering the next K choices along a path, what is the best choice at the current branch with respect to a current node.
  • Such a look-ahead operation is to base the selection of a branch in accordance with a compound measurement along each possible path, determined based on measurements accumulated over the links from a current node along each possible path.
  • the length of the look-up may vary and may be determined based on application needs.
  • the compound measurements associated with all the alternative paths may be referred to as path based rewards.
  • a choice of a branch from the current node may then be made by maximizing the path-based reward from all possible traverse from that current node.
  • the rewards along paths of a T-AOG may be determined based on a number of probabilities determined based on performance of a user observed during a dialogue. For example, at a current node in the T-AOG, a dialogue agent may ask a question to a student user and then receives an answer from the user in response to the question, where the answer corresponds to a branch stemming from the node for the question in the T-AOG. A measurement relating to a reward for the branch may then be estimated based on the probabilities. Such measurements, and hence the path based rewards, are personalized because they are computed based on personal information observed from a dialogue involving a specific user.
  • the measurements associated with different branches along a path in a T-AOG may be used to estimate the reward of the S-AOG node as to the level of mastery of the student.
  • the rewards including both node based and path based rewards, may constitute “utilities” or preferences of the user and can be used by a robot agent to adaptively determine how to proceed with a dialogue in a utility-driven dialogue planning. This is shown in Fig.
  • the knowledge tracing unit 700 comprises an initial knowing probability estimator 710, a knowing positive probability estimator 720, a knowing negative probability estimator 730, a guessing probability estimator 740, a state reward estimator 760, a path reward estimator 750, and an information state updater 770.
  • Fig. 7C is a flowchart of an exemplary process of the knowledge tracing unit 700, in accordance with an embodiment of the present teaching.
  • initial knowing probabilities for nodes in relevant AOG representations may be first estimated at 705. This may include both the initial knowing probabilities for each relevant node in an S-AOG and each branch of each T-AOG associated with the S-AOG node.
  • a dialogue agent may proceed with a dialogue with a user on a certain topic represented by a relevant S-AOG node and a specific T- AOG for the S-AOG node with associated probabilities initialized.
  • the robot agent start with the dialogue by following the T-AOG.
  • the NLU engine 120 may analyze the response and produces a language understanding output.
  • the NLU engine 120 may also perform language understanding based on information besides the utterance, e.g., information from a multimodal information analyzer 702.
  • the multimodal information analyzer 702 may analyze both audio and visual information to combine cues in different modalities to facilitate the NUL engine 120 to make a sense of what the user meant and output the user’s response with, e.g., an assessment as to the correctness of the response based on, e.g., the T-AOG.
  • the knowledge tracing unit 700 When the user’s response with the assessment is received, at 715 by the knowledge tracing unit 700, to trace knowledge based on what occurred in the dialogue, different modules may be invoked to estimate respective probabilities based on the received inputs. For instance, if the user’s response corresponds to a correct answer, the knowing positive probability estimator 720 may be invoked to determine the probabilities associated with positively knowing the correct answer; the knowing negative probability estimator 730 may be invoked to estimate the probabilities associated with not knowing the answer; the guessing probability estimator 740 may be invoked to determine the probabilities evaluating the likelihood that the user just made a guess.
  • the knowing positive probability estimator 720 may also determine the probabilities associated with positively knowing yet that the user made a mistake; the knowing negative probability estimator 730 may estimate the probabilities associated with not knowing and the user still answered wrong; the guessing probability estimator 740 may determine the probabilities that the answer is just a guess. These steps are performed at 725, 735, and 745, respectively.
  • a parsed graph which continues to grow with the progression of the dialogue.
  • Fig. 5A One example is shown in Fig. 5A.
  • the probability that the user’s knowledge in the underlying concept may be adaptively updated based on the estimated probabilities.
  • the probability of initial knowledge in (or knowing) the concept at time t+1 may be updated based on observations.
  • P(Lt) is the probability of initial knowledge at time t
  • P(S) is the probability of slip
  • P(G) represents the probability of guessing.
  • Such prior knowledge probability associated with a node in an S-AOG may then be used to compute, at 755 in Fig 7C by the state reward estimator 760, the state based reward or node based reward associated with a node in an S-AOG, representing the user’s mastery of related skills associated with the concept node.
  • a path based reward may be computed, at 765 in Fig. 7C by the path reward estimator 750, with respect to each of the paths.
  • the information state updater 770 may then proceed to update, at 775, the parameterized AOGs in the information state 110.
  • the updated parameterized AOGs can then be used to control the dialogue based on utilities (preferences) of the user.
  • different parameters used to parameterize AOGs may be learned based on observations and/or computed probabilities.
  • unsupervised learning approaches may be employed to learn such model parameters. This includes, e.g., knowledge tracing parameters and/or utility/reward parameters. Such learnings may be performed either online or offline. Below, an exemplary learning scheme is provided:
  • Figs. 8A - 8B depict utility-driven dialogue planning based on dynamically computed AOG parameters, in accordance with embodiments of the present teaching.
  • Utility- driven dialogue planning may include dialogue node planning and dialogue path planning. The former may refer to selecting a node in a S-AOG to proceed with a dialogue session. The latter may refer to selecting a path in a T-AOG to conduct a dialogue.
  • Fig. 8A shows an example of utility-driven tutoring planning with respect to parameterized S-AOGs, in accordance with an embodiment of the present teaching.
  • Fig. 8B shows an example of utility-driven path planning in a parameterized T-AOG, in accordance with an embodiment of the present teaching.
  • an exemplary S- AOG 310 is for teaching various math concepts and each node corresponds to one concept. It is shown in Fig. 8A with different nodes having reward related parameters associated therewith and some nodes therein may be parameterized with conditions formulated based on rewards of connected nodes. As seen in Fig. 8 A, node 310-4 is for teaching concept“Add,” node 310-5 is for teaching concept“Subtract,” ..., etc. Each node is parameterized with, e.g., a reward indicating the reward to teaching the concept in connection with, e.g., a current mastery level of the concept. The rewards associated with some nodes in S-AOG 310 are expressed as functions of reward parameters from its connected nodes.
  • an exemplary condition for teaching the concept of“Division” 310-3 may be that its reward level has to be high enough (i.e., a user has not yet mastered the concept of“Division”) and the reward Ra for“Add” (310-4) and reward R s for“Subtract” (310-5) have to be low enough (i.e., a user already mastered the prerequisite concepts on“Add” and“Subtract”).
  • the mathematical formulation of function Fd may be devised according application needs to satisfy such conditions.
  • Node based planning may be set up so that a dialogue (T-AOG) associated with a node conditioned on some reward criterion in an S-AOG may not be scheduled until the reward condition associated with the node is met.
  • the only nodes that are not conditioned can be scheduled are 310-4 and 310-5.
  • the reward associated therewith may be continuously updated (either Ra and R s ) and is propagated to nodes 310-2 and 310-3 so that Rm or Rd get updated as well in accordance with F m or Fd.
  • the rewards Ra and R s become low enough so that no dialogue associated with node 310-4 and 310-5 need to be scheduled.
  • the low Ra and R s may be plugged in F m or Fd so that conditions associated with node 310-2 and 310-3 may now be met to make 310-2 and 310-3 active because Rm or Rd may now become high enough so that they are ready to be chosen to carry out dialogues on topics of multiplication and division.
  • T-AOGs associated therewith may be used to initiate the dialogues for teaching the respective concepts.
  • the state based rewards associated with nodes in an S-AOG may be utilized to dynamically control how to traverse among the nodes in the S-AOG in a personalized way, e.g., adaptive based on a situation relevant to each individual. That is, in actual dialogues with different users, the traverse can be controlled adaptively in a personalized manner based on observations of the actual dialogue situations. For example, in Fig.
  • nodes 310-4 on“Add” is the darkest, signifying, e.g., a lowest reward value, which may indicate that a user already mastered the concept of“Add.”
  • Node 310-5 on“Subtract” has a reward value in-between, indicating, e.g., that the user is currently not yet mastered the concept but close.
  • Nodes 310-1, 310-2, and 310-3 are bright indicating, e.g., high levels of reward values representing that a user has not yet mastered the corresponding concepts.
  • Path related or path based rewards associated with paths in T-AOGs may also be dynamically computed based on observations of actual dialogues and may also be used for adapting how to traverse a T-AOG (how to select branches) during a dialogue.
  • Fig. 8B illustrates an example of utility driven path planning with respect to T-AOGs, in accordance with an embodiment of the present teaching.
  • a robot agent needs to determine how to respond.
  • the dialogue traversed a parse graph pgi . t with traversed states si, S2 , ... , st.
  • a look-ahead operation may be performed based on the path based rewards along alternative paths. For example, to look-ahead one step, the rewards associated with alternative branches stemming from st (one step further) may be considered and a branch that represents the best path based reward may be selected. To look ahead two steps, rewards associated with each of the first set of alternative branches stemming from st as well as rewards associated with every of the secondary alternative branches (stemming from each of the first set of alternative branches) are considered and a branch that leads to the best oath based reward is selected as the next step. A deeper look-ahead can also be implemented based on the same principle. The example illustrated in Fig.
  • 8B is a scheme in which two step look- ahead is implemented, i.e., at time t, the scope of the look-ahead includes multiple paths at t+1 as well as each of multiple paths at t+2 stemming from each path at t+1. A branch is then selected via look-ahead to optimize the path based reward.
  • the path based rewards associated with branches may first be initialized and then updated during a dialogue.
  • initial path based rewards may be computed based on prior dialogues of the user indicating.
  • such initial path based rewards may also be computed based on prior dialogues of multiple users who are similarly situated.
  • Each path based reward may then be dynamically updated with time during a dialogue based on how each branch choice leads up to a satisfaction as to the intended purpose of a dialogue.
  • a look-ahead optimization scheme may be driven by the utilities (or preferences) of each user as to how to proceed with a conversation. Thus, it enables adaptive path planning.
  • a* is the optimally selected path given multiple branch selections a, the current state st, and the parse graph pgi . t, EU is the expected utility of a branch choice a, and R(st+i, a) represents the reward of choice a at state st+i.
  • the optimization is recursive which allows look-ahead at any depth.
  • the dialogue system 100 in accordance with the present teaching is capable of dynamically control conversations with a user based on past accumulated knowledge about the user as well as the on- the-fly observations of the user in connection with the intended purposes of the underlying dialogues.
  • Fig. 8C illustrates the use of utility-driven dialogue management of a dialogue with a student user based on a combination of node and path planning, in accordance with an embodiment of the present teaching. That is, in a dialogue with a student user, a dialogue agent conducts the dialogue with the user via utility-driven dialogue management based on dynamic node and path selection based on parameterized AOGs.
  • S-AOG 310 comprises different nodes for respective concepts to be taught with annotated rewards and/or conditions.
  • the rewards associated with the nodes may be determined previously based on knowledge about the user. For example, as illustrated, four nodes (310-2, 310-3, 310-4, and 310-5) may have lower rewards (represented as darker nodes), e.g., indicating that the student user already mastered the concepts of add, subtract, multiply, and divide.
  • There is one node with a high rewards for teaching i.e., a dialogue may be scheduled
  • Selecting one of the S-AOG nodes to proceed with a dialogue is therefore reward-driven or utility-driven node planning.
  • Node 310-1 is shown to be associated with one or more T-AOGs 320, each corresponding to a dialogue policy governing a dialogue to teach the student the concept of “Fraction.”
  • One of the T-AOGs i.e., 320-1, may be selected to govern a dialogue session and T- AOG 320-1 comprises various steps such as 330, 340, 350, 360, 370, 380, .... T-AOG 320-1 may be parameterized with, e.g., path based rewards.
  • the path based rewards may be used to dynamically conduct path planning to optimize the likelihood of achieving the goal of teaching the student to master the concept of“Fraction.”
  • the highlighted nodes in 320-1 correspond to a path selected based on path planning, forming a parse graph, representing the dynamic traverse based on the actual dialogue. This is to illustrate that knowledge tracing during a dialogue enables the dialogue system 100 to continuously update the parameters in the parameterized AOGs to reflect the utilities/preferences learned from the dialogues and such learned utilities/preferences in turn enables the dialogue system 100 to adapt its path planning, making the dialogues more effective, engaging, and flexible.
  • a T-AOG may be created with parameterized content associated with different nodes.
  • the parameterized content associated with each node represents what is expected to be said/heard during a dialogue.
  • the more alternative content included in the parameterized content associated with each node the more capable that a parameterized T-AOG supports flexible dialogues.
  • Such alternative content for each node may be authored either manually, semi-automatically, or automatically.
  • Such authored alternative content may also be used as training data to facilitate effective recognition to improve the adaptive capability of the dialogue management.
  • Figs. 9A - 9B show a scheme of enhancing spoken language understanding in human machine dialogues by automated enrichment of parameterized content for AOGs, in accordance with an embodiment of the present teaching.
  • FIG. 9A presents a T-AOG as seen in Fig. 6F except that each node in Fig. 9A is now associated with one or more alternative content sets that a dialogue manager can use in conducting a dialogue.
  • alternative parameterized dialogue content sets may also be used to train ASR and/or NLU models to understand the utterances to be spoken.
  • a training data set associated with a node represents the parameterized content of the node.
  • node 665 is associated with two training data sets, [TN] and [To], wherein the former is for numbers (X and Y can be the data items included therein) and the latter is for objects ol or o2 (e.g., apple, orange, pear, etc.);
  • node 667 is associated with training data set [Ti] for inquiry sentences;
  • node 670-1 is associated with training data set [TNKA] for not-know answers from a user;
  • node 675-1 is associated with training data set [TN] for a user’s answer with a number;
  • node 670-2 is associated with training data set [TRNCA] for responses to a not-correct answer (which is either alternative responses to not-know answer or alternative responses to an incorrect answer);
  • node 670-1 is associated with training data set [TRCA] for responses to a correct answer;
  • node 675-2 is associated with training data set [TNKA
  • Fig. 9B illustrates exemplary training data sets associated with different nodes of T-AOG in Fig. 9A, in accordance with an embodiment of the present teaching.
  • [TN] may be a set of any single or multiple-digit numbers
  • [To] may include names of alternative objects
  • [Ti] may include alternative ways to ask for a total of two numbers
  • [TNKA] may include alternative ways to say“I don’t know” by a user
  • [TRC] may include alternative ways to respond to a correct answer
  • [TRNK] may include alternative ways to respond to a not-known answer from a user
  • [TRM] may include alternative ways to respond to a mistake answer from a user
  • [TRG] may include alternative ways to respond to a guess answer from a user
  • [TRI] may include alternative ways to respond to an incorrect answer from a user which can include alternative responses to a mistake answer [TRM] or alternative responses to a guessed answer [TRG]
  • [TRNCA] may include alternative ways to respond to
  • the training data sets for responses may be used for generating robot agent’s responses to a user’s answer. This may be in addition to being used for ASR/NLU to understand an utterance.
  • Such enriched training data sets associated with different nodes in a T-AOG may significantly improve both the capability of a robot agent to understand different ways to express the same answers from a user and the flexibility to generate a response to a user in different situations.
  • the enriched training data sets may be automatically generated in a bootstrapped manner.
  • Fig. 9C illustrates exemplary types of language variations (alternatives) to generate enriched training data for nodes with parameterized content, in accordance with an embodiment of the present teaching.
  • Language variations may be due to different spoken language, alternative expressions, ... , different accents, or even combined with specified different acoustic characteristics (e.g., pitch, volume, speed, etc.).
  • Fig. 10A depicts an exemplary high level system diagram 1000 for automatically generating enriched AOG content and use thereof for training ASR/NLU models to achieve enhanced performance, in accordance with an embodiment of the present teaching.
  • the system 1000 is provided for generating enriched training data sets for different nodes in a T-AOG and then training ASR and NLU models based on such enriched training data sets to obtain enhanced ASR and NLU models.
  • the system 1000 comprises a parameterized AOG retriever 1010, a parameterized AOG training data generator 1020, an ASR model training engine 1035, and an NLU model training engine 1040.
  • AOGs are used to represent different aspects in managing a machine human dialogue.
  • Each S-AOG may comprise a set of nodes each relating to a specific subject matter on a topic.
  • Each S-AOG node of a specific topic may be associated with one or more T-AOGs, each of which corresponds to a dialogue policy on the topic and includes back and forth conversation between a user and a robot agent on the topic.
  • a T-AOG may also have multiple nodes, each of which may represent one or more alternative content items uttered by a participating party in the dialogue (agent or user).
  • a T-AOG with multiple nodes may be created with parameterized content associated with each of the nodes.
  • the parameterized content of a node may be initially populated with some authored content generated when the T-AOG is created.
  • Such initial parameterized content may be expanded or enriched.
  • the goal is to enrich the parameterized content associated with each node to include other alternatives so that the conversation may be conducted in a more flexible and enriched manner.
  • the initial parameterized content associated with a TAG node may be used as the basis to generate a set of enriched parameterized content, which is then used as training data to train ASR and/or NLU models to enhance the capability of the robot agent to carry on a more flexible dialogue.
  • Fig. 1 OB is a flowchart of an exemplary process for obtaining enhanced ASR/NLU models based on automatically generated enriched AOG content, in accordance with an embodiment of the present teaching.
  • the parameterized AOG retriever 1010 first selects, from storage 1005, topic based templates (AOGs), e.g., an S-AOG at 1055 and then retrieves accordingly, at 1060, a T-AOG associated with each node in the selected S-AOG with initial sets of parameterized content associated therewith. Such retrieved parameterized content of the T-AOG nodes are then sent to the parameterized AOG training data generator 1020 for generating enriched parameterized content.
  • topic based templates e.g., an S-AOG at 1055
  • Such retrieved parameterized content of the T-AOG nodes are then sent to the parameterized AOG training data generator 1020 for generating enriched parameterized content.
  • the parameterized AOG training data generator 1020 accesses, at 1065, the authored content associated with each T-AOG node and uses it as the basis of generating the enriched content.
  • the parameterized AOG training data generator 1020 accesses the models in 1015 for known language variations and generates, at 1070, enriched training data based on initial authored content for each T-AOG node.
  • Such generated enriched training data are then used to create enriched parameterized AOGs which are stored in the enriched parameterized AOGs storage 1025.
  • the enriched parameterized content sets serve as the enriched training data and are saved in an enriched training data storage 1030.
  • the ASR model training engine 1035 trains, at 1075, ASR models 1045 using the training data 1030 bootstrapped from the initial authored content based on language variation models 1015.
  • language variations may be models in terms of both speech style (language, accent, etc.) and speech content (different ways of saying the same thing).
  • the ASR models 1045 obtained based on the enriched training data can then be used by an ASR to recognize a user’s utterances of speech content in speech styles that are captured within the enriched parameterized content sets.
  • the derived ASR models are then stored, at 1080, in storage 1045 for future use by the ASR engine 130 (Fig. 1 A).
  • the NLU model training engine 1040 trains, at 1085, NLU models 1050 using the training data 1030 bootstrapped from the initial authored content based on language variation models 1015.
  • language variations may be models in terms of both speech style (language, accent, etc.) and speech content (different ways of saying the same thing).
  • the NLU models 1050 obtained based on the enriched training data 1030 can then be used by the NLU engine 140 (Fig. 1A) to understand the meaning of a user’s speech based on ASR result from the ASR engine 130 (Fig. 1A).
  • the derived NLU models are then stored, at 1090, in storage 1050 for future use by the NLU engine 140.
  • enriched parameterized content sets associated with nodes of a T-AOG By developing enriched parameterized content sets associated with nodes of a T-AOG, it yields an augmented dialogue policy.
  • enriched parameterized content sets can be automatically generated (without human manual activities) and used to further enhance machine based speech recognition and understanding, it leads to more effective and efficient human machine dialogues.
  • the process in addition to enhancing human machine dialogue by automatically generating augmented dialogue policy, the process may be further enhanced by exploring multimodal contextual information in spoken language understanding.
  • Fig. 11 A depicts an exemplary high level system diagram for context aware spoken language understanding based on surround knowledge tracked during a dialogue, in accordance with an embodiment of the present teaching.
  • spoken language understanding includes both automated speech recognition (which recognizes words uttered) and natural language understanding (which understand the semantics of the speech based on the words uttered).
  • spoken language understanding may utilize speech contextual information to understand the semantics of an utterance.
  • Such traditional contextual information utilizes linguistic context, e.g., words/phrases are uttered before or after. For example, in sentence “Bob bought a bike and he went out to have a ride,”“he” means Bob (semantics) and such semantic ambiguity is resolved based on the linguistic context.
  • sensors of different types may be deployed at a dialogue scene to continuously monitor the surrounding, gather relevant sensor data, extract features, estimate characteristics related to different events, activities, determine spatial relationships among objects, and store such knowledge learned via observations.
  • Such continuously acquired multimodal contextual information may then be stored in different databases in the information state 110 (e.g., the dialogue history database 250, rich media dialogue context database 260, event-centric knowledge database 270) and exploited during ASR and/or NLU processes to improve the human machine interactions/dialogues.
  • the information state 110 e.g., the dialogue history database 250, rich media dialogue context database 260, event-centric knowledge database 270
  • a spoken language understanding unit 1100 includes the ASR engine 130 and the NLU engine 140.
  • the ASR engine 130 operates to perform speech recognition to generate textual words based on, e.g., vocabulary 1120, the ASR models 1045, and various types contextual information, including the dialogue history 250, rich media dialogue context 260, surround knowledge 270, ... , and user profile 290.
  • the ASR engine 130 takes audio input (utterances of users) and perform audio processing to recognize the words uttered. In recognizing the words spoken, contextual information acquired in different modalities may also be explored to assist the recognition.
  • contextual information about the user and his/her surrounding may be relied on, e.g., user is smiling, looks sad or frustrated, near a desk, pointing to a black board, sitting on a chair, what colors the user’s clothes are, etc.
  • the conventional linguistic context information may also be used to update the dialogue history 250 and/or the user profile 290.
  • the ASR engine 130 may analyze an audio input and determine, e.g., probabilities of different sequences of words, which may be modeled by the ASR models 1045.
  • the recognition may be performed by detecting phonemes (or some other linguistic units in different languages) and the sequence thereof based on vocabulary 1120 of an appropriate language.
  • the ASR models 1045 may be obtained via machine learning based on training data, which may be textual, acoustic, or both.
  • the present teaching discloses automatically generating enriched training data (by bootstrapping) from initial content associated with AOGs based on certain language variation models. Using such enriched training data to train the ASR models 1045, it yields models that can facilitate speech recognition of a wider range of content with efficiency.
  • ambiguity may arise as to phonemes. While acoustic information may reach its limit to disambiguate, visual information may be used to help to ascertain, e.g., via recognizing visemes (which correspond to phonemes) based on visual observation of the mouth movement of the speaker. Based on phonemes (or visemes or both), the ASR engine 130 may then recognize words being spoken in accordance with the vocabulary 1120 that may specify not only words but also compositions of phonemes for each word. In some situations, some words may have the same pronunciation. During human machine dialogue, it may be helpful to ascertain the exact word uttered. To disambiguate in this situation, visual information may be used to achieve that.
  • the pronunciation for“he” and“she” have the same pronunciation.
  • visual information may be exploited to see if any visual cues may be used to disambiguate. For example, a speaker may point at a person when referring to“he” or “she” in the speech and such information may be used to estimate which person the speaker is referring to.
  • the ASR engine 130 outputs one or more sequences of words estimated to be uttered by a speaker with each sequence associated with, e.g., some probabilities each of which represents a likelihood for a sequence of words recognized. Such outputs may be fed to the NLU engine 140 where the semantic meaning of a sequence of words is to be estimated. In some embodiments, multiple output sequences (candidates) may be processed by the NLU engine 140 and a most probable understanding of the speech may be selected. In some embodiments, the ASR engine 130 may select a single sequence of words that has the highest probability to be sent to the NLU engine 140 for further processing to determine the semantics. The semantics of a sequence of words may be determined based on language models 1130, the NLU models 1050, and multimodal information from different sources such as the dialogue history 250, the dialogue context 260, the event knowledge 270, and the profile of the speaker from 290.
  • the NLU models 1150 may be trained based on authored content associated with different AOGs.
  • the authored content (text, acoustic, or both) associated with AOGs may be automatically enriched based on, e.g., language variation models and such enriched textual training data may then be used to train the NLU models 1150 to derive models that may support a wider range of speech content.
  • NLU natural language understanding
  • the present teaching provides enhanced natural language understanding based on different types of contextual information acquired in multimodal domains. The aim is to enhance the ability of the NLU engine 140 to resolve ambiguities in different situations based multimodal information that is relevant.
  • information stored in the dialogue history 250, rich media dialogue context 260, event knowledge 270, and profile of the speaker 290 may be used in language understanding.
  • visual information captured from the dialogue scene may be analyzed to identify cues that can be used to disambiguate.
  • the user may stand in front of an object and pointing at it, e.g., the user may be standing next to, e.g., a desk, and pointing at, e.g., a computer on the desk.
  • representation may be generated in the rich media dialogue context (260) that reveals that the user is pointing at a computer on a desk close to the user.
  • the NLU engine 140 may estimate that“this” in the speech means the computer on the desk and the user wants to know what is displayed on the computer screen.
  • the rich media dialogue context may enable the NLU engine 140 to explore the information stored in the event knowledge 270 on event(s) related to the user around the time frame of his/her previous birthday. For instance, it may be recorded that last year the user went to Boston for the birthday. Such information may assist the NLU engine 140 to disambiguate the meaning of “the same thing I did last year.”
  • the event from last year may be recorded as an event or in a log in, e.g., the user file storage 290.
  • the NLU engine 140 may then reach an understanding that the user is referring to visiting Boston again this year for his/her birthday. Thus, understanding may then help the dialogue manager 150 (see Fig. 1) to then determine how to respond to the user in the dialogue.
  • the spoken language understanding unit 1100 may enhance its ability in spoken language understanding in both ASR (recognizing words uttered) and NLU (the semantics of what was uttered).
  • the surround knowledge 270 in Fig. 11 A may broadly include information of different aspects.
  • Fig. 1 IB illustrates exemplary types of surround knowledge to be tracked to facilitate context aware spoken language understanding, in accordance with an embodiment of the present teaching.
  • surround knowledge may include, but not limited to, observations of the environment (e.g., whether it is a sunny day), events occurred (in the past and in the current dialogue), objects observed in the dialogue scene and the object-to-object spatial relationships, activities observed (acoustic or visual), and/or statements made by users.
  • Such observations may be made via sensors in multiple modalities and can then be used to estimate or infer preferences or profiles of users, which can then be used by the dialogue manager 150 to determine responses according to both the inferred profiles of the users as well as the surrounding situations.
  • Fig. 12A illustrates of an example of tracking a personal profile based on conversation or statements made during dialogues, in accordance with an embodiment of the present teaching.
  • a robot agent a duck talking agent
  • the robot agent 1210 asks the user 1220 where he was born. The user answers that he was bom in Chicago.
  • speech information is analyzed and the speech based information (audio) is tracked and analyzed to extract useful profiling information.
  • the user’s profile is updated with an additional graph (or sub-graph) 1230 that links a node 1230-1 representing“I” (or the user) with another node 1230-2 representing“Chicago” with the link 1230-3 annotated as“bom in.”
  • a user profile may be gradually updated continuously based on the tracked audio information.
  • Fig. 12B illustrates an example of tracking personal profile during a dialogue based on visual observations, in accordance with an embodiment of the present teaching.
  • a visual sensor captures a dialogue scene as shown in 1240, which includes a boy (presumably the user engaged in the dialogue), and the objects/fixtures in the dialogue scene, e.g., there is a small bookcase, a lamp on the bookcase, another lamp on the floor, a window, a hanging picture on the wall with words“Michael Jordan.”
  • Such a visual scene may be collected and analyzed so that relevant objects may be extracted, spatial relationships inferred, ... , the observation that the boy is gazing at the poster of Michael Jordan.
  • an estimated conclusion may be that the boy likes Michael Jordan.
  • Another estimated conclusion may be that the boy and the poster co-exist in the dialogue scene.
  • Other different factors may also impact the estimation of what the visual representation may mean. For example, if the scene is the boy’s own room, then the fact that he has such a poster on the wall may give the conclusion that he likes Michael Jordan a much higher probability. If the boy is in a scene of elsewhere such as his friend’s room or in school, then the conclusion of co-existence of the boy and the poster may weigh higher.
  • a graph representing the relationship between the observed boy and the poster may be created with a node 1250-1 representing the boy linked with another node 1250-2 representing Michael Jordan with a link 1250-3 representing the relationship between the two.
  • the link represents“likes.”
  • the link may also be a“co-exist” relationship if its probability is higher.
  • link 1250-3 may also be annotated with different relationships (e.g.,“likes” and“co-exist”) each of which may be associated with a probability.
  • the monitored visual information may be used to continuously update the surround information.
  • Such observation may be used by the dialogue manager 150 to determine how to carry on a conversation.
  • the dialogue manager 150 may decide to ask the user“Do you like Michael Jordan?” to improvement the engagement of the user.
  • FIG. 12C shows an exemplary partial personal profile updated during a dialogue based on multimodal input information acquired from the dialogue scene, in accordance with an embodiment of the present teaching.
  • the integrated graph representation 1260 is a combination of the graph 1230 in Fig. 12A, which is derived based on tracked audio information, and the graph 1250 in Fig. 12B, which is derived based on tracked visual information.
  • users may be classified into different groups based on tracked information and characteristics associated with each group may also be continually established based on tracked information from users in such groups. As will be discussed below, such group characteristics may be used by a dialogue manager to adaptively adjust dialogue strategy to make the conversation more effective.
  • Fig. 12D shows an exemplary personal knowledge representation constructed based on a conversation, in accordance with an embodiment of the present teaching.
  • information acquired from such conversations may be used to continuously build profile knowledge about the user, which may then be relied on by the dialogue manager 150 to guide the dialogue.
  • profile knowledge For instance, in a user profile, the birthday of the user may be established. If in conversations, the user mentions certain trips on his birthdays in different years, such knowledge may be represented in a graph as shown in Fig. 12D. In this example, on the birthday in 2016, the user took a cruise to Alaska and on the same day in 2018, the user flew to Las Vegas for the birthday.
  • Such knowledge may be explicitly conveyed by the user or may be inferred by the system from the conversation, e.g., the user mentioned the dates for the trip without expressly indicating that those trips were for the birthday celebration. With the knowledge of the user’s birthday, the system may infer that those trips were taken to celebrate the user’s birthday.
  • such knowledge may be represented as a graph 1270, in which the user is represented as an entity“I” 1270-1 and it is linked to different pieces of knowledge about the user. For instance, via a link“Birthday” 1270-2, it points to a representation of the birthday 1270-3 with a specific date of the user“10/2/1988”. Based on the two trips mentioned by the user, two destination representations 1270-5 and 1270-7 are provided, with 1270- 7 representing Alaska and 1270-5 representing Las Vegas. To link the user to these two destinations, there are two links 1270-4 and 1270-8 that link the user to these two destinations, each with an annotation of the corresponding travel period on the link.
  • additional representation may be offered to link the user’s birthday to these two trips.
  • two additional links 1270-6 and 1270-9 may be provided that link from birthday representation 1270-3 to the two destination representations 1270-5 and 1270-7.
  • the year of the trip is indicated with an annotation of the computed age of the user at which the trip was taken.
  • this representation one can identify the user’ s birthday and some events associated with the user’s birthday, i.e., the user took two trips on his/her birthday, one to Alaska on a cruise in 2016 and the other to Las Vegas by flying in 2018, when the user is 28 and 30 years old (computed from the birthday and the year of the trips), respectively.
  • a representation of tracked certain knowledge about the user may later be utilized by the robot agent to determine what to say in certain situations. For example, if the robot agent is talking to the user on a day and noted that it is the user’s birthday, the robot agent may greet the user with“Happy Birthday!” In addition, to further engage the user, the robot agent may say“You went to Las Vegas last year on your birthday. Where do you plan to go this year?” Such personalized and context aware conversation may enhance the engagement of the user and improve the affinity between the user and the robot agent.
  • Fig. 13 A illustrates exemplary user groups formed based on observations about users to facilitate adaptive dialogue management, in accordance with an embodiment of the present teaching.
  • the user groups in Fig. 13 A are for illustration rather than limitation and any other groupings may also be added and possible.
  • users may be classified into different groups such as social groups (e.g., Facebook chat or interest groups), ethnic groups (e.g., Asian groups or Hispanic groups), age groups which may include youth groups (e.g., toddler groups or teens group) and adult groups (e.g., senior group and working professional groups), gender groups, ... , and possibly various professional groups.
  • Each group of users may share some common characteristics which may be used to construct group profiles.
  • Such group profiles with characteristics shared by the group members may be relevant to dialogue planning or control.
  • users in some sub-group under, e.g., the Asian groups may share some common characteristics such as accent in speaking English.
  • Such common characteristics of a group of users may be used in, e.g., dialogue planning.
  • understanding that a user belonging to a particular ethnic group with a commonly known accent may facilitate the dialogue manager 150 to plan the tutoring session with certain measures designed to address issues related accent to develop correct pronunciation.
  • the dialogue manager 150 may explore some known common popular interests of teens when the dialogue manager 150 desires to better engage a teen user in a dialogue.
  • the dialogue system 100 may also accumulate knowledge or profile of individual users based on on-the-fly observations from dialogues or information about each user from other sources.
  • Figs. 12A-12C show some examples of building some aspects of a user based on multimodal information collected during a dialogue.
  • Fig. 13B illustrates an exemplary content/construct of an individualized user profile, in accordance with an embodiment of the present teaching.
  • a personal profile of a user may include demographic information of the user (e.g., ethnicity, age, gender, etc.), known preferences, languages (e.g., native language, second language, etc.), learning schedule (e.g., tutoring sessions on different topics), ...
  • Some of the profile information may be declared (e.g., demographics) and some may be observed or learned via communications. For instance, with respect to information related to the user’s second language, e.g., English, information about his/her proficiency level, the accent, ..., etc. may be recorded and can be used by, e.g., the dialogue manager 150, to plan or control a dialogue. Details related to some of the personal characteristics may also be collected and recorded.
  • information related to the user’s second language e.g., English
  • the accent e.g., ..., etc.
  • Details related to some of the personal characteristics may also be collected and recorded.
  • Fig. 13B is detailed feature descriptions of a user’s accent in a second language.
  • the accent may be described in terms of both acoustic features (e.g., acoustic characterization of the user’s pronunciation) and viseme features (e.g., visual characterization of user’s mouth movement when speaking).
  • acoustic features e.g., acoustic characterization of the user’s pronunciation
  • viseme features e.g., visual characterization of user’s mouth movement when speaking.
  • Fig. 13C shows an exemplary visualization of accent profile distribution related to different ethnic groups, in accordance with an embodiment of the present teaching.
  • Accent may be characterized using a set of features. For example, viseme may be used to represent accent related features, i.e., characterizing accent using visual features related to mouth movements. Accent may also be characterized based on acoustic features.
  • a person’s accent may be represented by a vector with instantiated values of such features and such a vector may be projected into a high dimensional space in a coordinate system as a point. What is illustrated in Fig. 13C is 3D projections of accent vectors of different ethnic groups in a 3D coordinate system.
  • each accent feature vector of a specific user is projected as a point and different circles in Fig. 13C represent boundaries of projected accent feature vectors of users in corresponding different ethnic groups.
  • Each circle includes one or more points therein and the variations in accent of different members in the same group correspond to the spread or shape of the circle.
  • a group’s accent profile may be derived based on a plurality of projection points (member accent vectors) representing the accent profiles of the group members.
  • a group accent profile vector may be derived by, e.g., averaging the member accent vectors in each dimension or, e.g., using a centroid of each group distribution as the group profile.
  • each circle in Fig. 13C has a centroid (center of the circle) representing the group accent profile, projected from, e.g., a group accent feature vector.
  • a user When a user is engaged in a human machine dialogue, the user may be observed and multimodal information is captured from the dialogue scene to estimate different types of information to be used for adaptive dialogue strategy. Based on the observed multimodal information, a user profile may be dynamically established and used by a robot agent to adaptive control how to converse with the user.
  • One exemplary type of information to be estimated is the user’s accent. Accent information may be used to improve both dialogue quality and the effectiveness in tutoring language.
  • estimated accent information a robot agent can adjust its language understanding capacity to ascertain the meaning of the user’ s utterances. If a robot agent is for teaching a user to learn a certain language, e.g., English, such estimated accent information may be explored to determine a dynamic teaching plan in an adaptive manner.
  • the group profile as illustrated in Fig. 13C may be used to determine which group a new user may belong.
  • point 1350 may represent an accent vector of a new user.
  • the distances between point 1350 to distributions of different ethnic groups may be used to estimate the ethnicity of the new user.
  • the distance between point 1350 and the centroid of group A1 is 1360; the distance between point 1350 and the centroid of group A2 is 1370; the distance between point 1350 and the centroid of group A3 is 1380; and the distance between point 1350 and the centroid of group A4 is 1390.
  • the distance may also be assessed between point 1350 and a closest point of each distribution.
  • the new user may be estimated to belong to an ethnic group which has the shortest distance 1370 between point 1350 and the centroid of that estimated ethnic group.
  • group A2 e.g., the Japanese group.
  • the group profile (centroid) may be updated as well.
  • the estimated ethnicity and the representative accent profile (i.e., centroid of A2) of the ethnic group may be used to adaptive tutoring planning.
  • Figs. 17B - 17E shows how known accent information (e.g., visemes) of a user may be used to adaptively conduct a dialogue to tutor a student to learn a language.
  • Fig. 14A depicts an exemplary high level diagram of a system for tracking individual speech related characteristics to update a user profile, in accordance with an embodiment of the present teaching.
  • a user profile may be dynamically updated based on information observed during human machine dialogues to characterize the underlying user and can then be used to guide a robot agent to adapt the dialogues with the user in accordance with the observed characteristics of the user.
  • the exemplary system diagram 1400 is directed to establishing and updating the accent information of a user with respect to a certain language (e.g., English).
  • the system 1400 receives audio/visual information captured when a user 1402 is speaking, analyzes the received audio/visual information to extract both acoustic and visual features (viseme features) related to speech, classifies the features to estimate the characteristics of the user’s speech, and updates the user profile stored in 290 accordingly.
  • the system 1400 comprises an acoustic feature extractor 1410, a viseme feature extractor 1420, an acoustic feature updater 1440, and a feature based accent group classifier 1450.
  • Fig. 14B is a flowchart of an exemplary process for system 1400 for tracking individual speech related characteristics and user profiles thereof, in accordance with an embodiment of the present teaching.
  • audio signals corresponding to the user’s utterance as well as visual signals capturing the mouth movement of the user during the speech are acquired.
  • the audio signals are analyzed, at 1415 by the acoustic feature extractor 1410, to estimate acoustic features associated with the speech of the user 1402.
  • the acoustic features are identified based on, e.g., acoustic accent profiles 1470 which characterize different accents based on acoustic characteristics identified based on speech signals.
  • the received visual signals are analyzed, at 1425 by the viseme feature extractor 1425, to estimate or determine the viseme features related to the mouth movements while the user speaks.
  • the viseme features are identified based on, e.g., the visual images of the mount movements when the user is speaking, in accordance with various language-based viseme models 1420.
  • To classify the user’s accent into an appropriate accent group see Figs.
  • the feature based accent group classifier 1450 receives the acoustic features from the acoustic feature extractor 1410 and the viseme features from the viseme feature extractor 1420 and classify, at 1445, the user’s accent into an accent group in accordance with language-based accent group profiles stored in 1460. Once classified, the user profile in 290 associated with the user 1402 is updated accordingly to incorporate the estimated accent classification. As discussed herein, such estimated accent information may later be used by, e.g., a robot agent, to adaptively plan a dialogue. [00180] In addition to linguistic characteristics of a user, multimodal information (250 - 290 in Fig.
  • An event may refer to something occurred in the past or observed during a dialogue, either done or mentioned by someone in a dialogue. For example, a user in a dialogue scene may walk towards an object in the scene or a user may mention that he is flying to Las Vegas in June to celebrate his birthday.
  • Fig. 15A provides an exemplary abstract structure representing an event, in accordance with an embodiment of the present teaching.
  • an event includes an act performed with respect to an object.
  • the act involved in an event can be described by a verb and an object can be anything that the act is directed to.
  • an act may be performed by an entity in a dialogue scene and an object may be anything in a scene that can be acted on. For instance, as illustrated in Fig.
  • an event can be that someone walks towards an object such as a blackboard, a table, ..., or a computer; someone points at an object, such as a blackboard, a table, or a computer on a table; ..., and someone lifts up an object, etc.
  • an object such as a blackboard, a table, ..., or a computer
  • someone points at an object such as a blackboard, a table, or a computer on a table
  • someone lifts up an object etc.
  • Figs. 15B - 15C show an example of tracking event-centric knowledge during a dialogue as dialogue context based on observations from a dialogue scene, in accordance with an embodiment of the present teaching.
  • knowledge about activities or events, whether occurred during a dialogue or not may enable a robot agent to enhance its performance.
  • Event-centric knowledge may be used to assist a robot agent to understand the meaning or intent of a speech from a human user. For instance, a user engaged in a dialogue with a robot agent may walk towards a blackboard and then points at it and asks,“What is this?” This is illustrated in Fig. 15B.
  • a dialogue scene can be tracked and information is analyzed to detect various objects present in a dialogue scene, the spatial relationships among such objects, the act(s) performed by a user and the impact of such acts on the objects in the scene including the dynamic change in spatial relations of the objects due to such acts, etc.
  • the user in the scene is observed, via visual sensor(s), to walk towards a blackboard, raise his hand, and point at a blackboard when the user uttered “What is this?”
  • Such observations may be used to dynamically construct an event (that the user walks towards and points at a blackboard in the scene) with an exemplary representation as shown in Fig. 15C to describe the event visually observed in the dialogue scene.
  • the event includes two acts, one (Act 1) being walking towards a blackboard and the other (Act 2) being pointing at the blackboard.
  • Each act involved may be annotated with a time T representing a time when the act is carried out (see T1 for Act 1 and T2 for Act 2).
  • the sequence of acts in an event may be associated with a timeline associated with different acts. For instance, a user may first walk towards (Act 1 at Tl) a blackboard and then point at (Act 2 at T2) the blackboard at the time the user utters“What is this?” The timing of each act may be compared with the timing of an utterance and time correspondences may also assist to understand the meaning of the utterance.
  • Information gathered in multiple modalities facilitates tracking/learning knowledge about what is happening in a dialogue scene, which can be described in a dynamically constructed event representation.
  • Such dynamically constructed event knowledge may play an essential role to help a robot agent to estimate the semantics (by NLU) of a sequence of words recognized (by ASR) when traditional language models and context cannot effectively resolve certain ambiguities.
  • NLU semantics
  • ASR ASR
  • the robot agent can at least conclude that the user refers to the backboard by“this,” either what is on the blackboard or what is a blackboard.
  • the robot agent may be able to further narrow down the intent of the user. For example, if a user is pointing at a computer screen and asked“What is this?” and the robot agent just previously displayed a picture to the user, then the robot agent may determine an appropriate response to the user’s question, e.g., asking “Do you mean the picture I just displayed on the computer?” So, rich media contextual information, including visually observed events, acts, etc., may assist a robot agent to devise adaptive dialogue strategies in order to continuously engage a user in a human machine dialogue session.
  • Figs. 15D - 15E show another example of tracking multimodal information to identify event-centric knowledge to facilitate spoken language understanding in human machine dialogue, in accordance with an embodiment of the present teaching.
  • a user in a dialogue may say“I like to try this.”
  • ASR may process the utterance and recognize words“I,”“like,”“to,”“try,” and“this,” without more non-traditional contextual information, a robot agent is not able to determine what the user refers to by“this.”
  • the visual information simultaneously monitored at the dialogue scene may be analyzed and an event detected while the user utters the words may be used by the robot agent to estimate the meaning of the utterance.
  • FIG. 15D - 15E show another example of tracking multimodal information to identify event-centric knowledge to facilitate spoken language understanding in human machine dialogue, in accordance with an embodiment of the present teaching.
  • a user in a dialogue may say“I like to try this.”
  • ASR may process the utterance and recognize words“I,”“like,”“to,”“try,” and“
  • the user uttering these words may be observed, via visual means, to reach out (1 st act) to a laptop on a desk, open (2 nd act) the laptop, and start to type (3 rd act) on the keyboard.
  • Such event knowledge learned from visual information may be used to build an event representation as shown in Fig. 15E, where the event is represented to include three acts (reach out, open, and type), each of which is associated with a timing (Tl, T2, and T3, respectively) and an object to which each act is directed.
  • the sequence of acts in an event may be identified according to the order of the times associated therewith.
  • the robot agent may infer that by“this” the user likely is referring to something that the user likes to do on the laptop and in response, the robot agent may devise a response strategy by asking the user“What would you like to try on the computer?” to stay relevant to what the user is doing/thinking and, hence, to achieve better engagement of the user.
  • the robot agent conducts a dialogue aiming at achieving personalized context aware human machine dialogues. It is personalized because the robot agent according to the present teaching utilizes user profile information and dynamic personal information updates (examples shown in Figs. 12A - 12C) to understand and respond to (described below) a user. It is context aware because it utilizes event knowledge dynamically estimated via information acquired in different modalities either in real-time (examples shown in Figs. 15A - 15E) or previously established to facilitate a robot agent to understand a dialogue context.
  • FIG. 16A depicts an exemplary high level system diagram for personalized context aware dialogue management (PCADM), in accordance with an embodiment of the present teaching.
  • the information state 100 is centered around the PCADM and includes personalized information as well as rich media context information established/updated based on sensor information across different modalities.
  • the PCADM comprises multimodal information processing components, knowledge tracking/update components, components for estimating/updating minds of different parties, the dialogue manager 150, and components responsible for generating deliverable responses (determined by the dialogue manager 150), all in a personalized and context aware manner based on various information dynamically updated in the information state 110.
  • the multimodal information processing components include, e.g., the SLU engine 1100 (spoken language understanding, including both ASR 130 and NLU engine 140, see Fig. 11), a visual information recognizer 1600, and an ego motion detector 1610.
  • the knowledge update components may include, e.g., a surround knowledge tracker 1620, a user profile update engine 1630, and a personalized common sense updater 1670.
  • Components for estimate/update minds of different parties include, e.g., an agent mind update engine 1660, a shared mind monitoring engine 1640, and a user’s mind estimation engine 1650.
  • the components for generating deliverable responses include, e.g., the NLG engine 160 and the TTS engine 170 (see Fig. 1).
  • the dialogue manager 150 manages a dialogue based on speech understanding of a user’s utterance from the SLU engine 1100, a relevant dialogue tree (e.g., a specific T-AOG governing an underlying dialogue), as well as different types of data from the information state 110 (e.g., user profile, dialogue history, rich media context, different minds estimated, event knowledge, ... , common sense), which enables the dialogue manager 150 to determine a response to the user’s utterance in a personalized and context aware manner.
  • the information state 110 is dynamically established/updated by various components, such as 1620, 1630, 1640, 1650, 1660, and 1670.
  • the SLU engine 1100 upon receiving signals related to the user’s speech (which may include both audio and visual for, e.g., mouth movement), the SLU engine 1100 performs spoken language understanding.
  • information from the information state 110 may be explored during both speech recognition (for ascertaining words spoken) and speech understanding (for knowing the meaning of the utterance).
  • speech recognition for ascertaining words spoken
  • speech understanding for knowing the meaning of the utterance.
  • event knowledge observed with respect to the user in the dialogue scene may also be used to resolve certain ambiguities.
  • Such event knowledge may be derived by analyzing visual input by the visual information recognizer 1600 and the surround knowledge tracker 1620 and event representation may then be stored in the information state 110 and accessed by the SLU engine 110 in understanding the semantics of the user’s utterances.
  • Visual and other types of observation may also be monitored and analyzed to derive, either alone or in combination with audio information, different contextual information. For example, birds’ chirping (sound) and green trees (visual) may be combined to infer that it is an outdoor scene. Ego movement of the user detected via, e.g., haptic information may be combined with visual information to infer changes in spatial relationship between the user and objects present in the scene. Facial expression of the user may also be recognized from visual information and may be used to estimate the emotion or an intent of the user. Such estimation may also be stored in the information state 110 and then used, e.g., by the user’s mind estimation engine 1650 to estimate the user’s mind or by the dialogue manager 150 to determine what to do in order to continue to engage the user in the dialogue.
  • a dialogue between a robot agent and a user may be driven by AOGs (or agent’ s mind), representing an intended topic with an anticipated conversation flow with certain authored dialogue content.
  • AOGs or agent’ s mind
  • certain dialogue path in the AOGs may be identified, by the shared mind monitoring engine 1640, and information relevant to a shared mindset may be estimated and used to update the recorded shared mindset representation in the information state 110.
  • the shared mind monitoring engine 1640 may also utilize rich media context developed based on multimodal sensory input.
  • the user’s mind estimation engine 1650 may further estimate the user’s mind and then update the recorded user’s mind in the information state 110.
  • the dialogue manager 150 determines a personalized and context aware response to the user based on the updated information state 110.
  • the response may be determined based on an understanding of what the user said (semantics), the user’s emotional or mindset state, the user’s intent, the user’s preferences, the dialogue policy as dictated by relevant AOGs, and an estimated level of engagement of the user.
  • the robot agent may also further personalize the response by generating personalized content for the response in a way that is adapted to a particular user. For instance, as discussed with respect to Figs.
  • a response may be from a categorized subject (e.g., a response to an incorrect answer) which may have parameterized content. Given that, a specific piece of content associated with the parameterized content may be selected for a certain user based on knowledge about the user, e.g., the user’s preferences or the emotional state of the user. This is achieved by the NLG engine 160. More details related to NLG engine 160 and the TTS engine 170 are provided with reference to Figs. 16C - 16D.
  • Fig. 16B is a flowchart of an exemplary process for personalized context aware dialogue management, in accordance with an embodiment of the present teaching.
  • different components in Fig. 16A analyzes, at 1615, information in various domains. For instance, acoustic signals related to a user’s utterance and/or environmental sound may be received; audio signals may be analyzed by the SLU engine 1100 to understand what the user said. Visual signals may also be received that capture the user’s physical appearance and movement (e.g., mouth and body movement) and may be analyzed by the visual info recognizer 1600 to detect, e.g., different objects, facial features of the user, body movements, etc.
  • acoustic signals related to a user’s utterance and/or environmental sound may be received; audio signals may be analyzed by the SLU engine 1100 to understand what the user said.
  • Visual signals may also be received that capture the user’s physical appearance and movement (e.g., mouth and body movement) and may be analyzed by the visual info recognizer
  • Such multimodal data analysis results from the 1100 and 1600 may then be utilized by other different components to derive higher levels of understanding of the surrounding, preferences, and/or emotions.
  • the surround knowledge tracker 1620 may track, at 1625, e.g., the dynamic spatial relationships among different objects, assess emotion or intent of the user, etc.
  • Such tracked surrounding situation may then be used by the user profile update engine 1630 to assess, e.g., characteristics of the user such as preference observed and update, at 1635, the user profile in the information state 110 based on the observations and analysis results.
  • the rich media context stored in the information state 110 can be updated at 1645.
  • the updated user profile and rich media context may then be utilized by the SLU engine 1100 to perform, at 1655, personalized context aware spoken language understanding, which includes, but not limited to, recognize words uttered by the user (e.g., based on accent information related to the user) and the semantics of the words uttered (e.g., based on visual or other cues revealed in other modalities).
  • the dialogue manager 150 determines, at 1665, a response to the user that is considered appropriate given the context and the known characteristics of the user. For example, the dialogue manager 150 may determine that as the user answer a question incorrectly, a response to point it out to the user is to be delivered to the user.
  • the NLG engine 160 may select one of the multiple alternative responses in the parameterized content (see Figs. 9A - 9B) associated with a node with a designated purpose to generate, at 1675, a personalized response specific for the user.
  • the selection may be done based on the personal information stored in the information state 110 (e.g., the user has a sensitive personality, has previously answered similar questions incorrectly, and currently appears to be frustrated). For example, given that a user is known to be sensitive and easily frustrated and has repeatedly made mistakes, the NLG engine 160 may generate a response that is intended to be gentle to avoid further frustrating the user.
  • the response may also be rendered, by the TTS engine 170, in a personalized and context aware manner. For instance, if the user is known to have a southern accent and is currently in a noise environment (e.g., specified in the information state 110), the TTS engine 170 may render, at 1685, the response with a southern accent and with a higher volume.
  • the process returns to step 1605 to handle the next round of dialogue in a personalized and context aware manner.
  • Fig. 16C depicts exemplary high system diagrams of the NLG engine 160 and the TTS engine 170 to produce a context aware and personalized audio response, in accordance with an embodiment of the present teaching.
  • both the NLG engine 160 and the TTS engine 170 may adapt a response in accordance with tracked information stored in the information state 110.
  • the NLG engine 160 may generate a response in accordance with a text response from the dialogue manager 150 with modification or adjustment determined based on information related to the user and in accordance with the known dialogue context.
  • the TTS engine 170 may also further adapt the (already adapted) response in its rendered delivery form in a personalized and context aware manner.
  • the NLG engine 160 comprises a response initializer 1602, a response enhancer 1608, a response adapter 1612, and an adaptive response generator 1616.
  • Fig. 16D is a flowchart of an exemplary process of the NLG engine 160, in accordance with an embodiment of the present teaching.
  • the response initializer 1602 first receives, at 1632, a text response from the dialogue manager 150 and then initializes, at 1634, a response by, e.g., selecting an appropriate response from, e.g., parameterized content associated with a certain node in the dialogue policy. For instance, assume that a dialogue policy, as shown in Fig.
  • a dialogue to teach a student the concept of“Add” is used to dictate a tutoring dialogue with the user.
  • the user is presented with a problem on adding X and Y.
  • the dialogue manager 150 follows the path in the dialogue policy in Fig. 9A to node 675-2 and determines a response is from the parameterized content associated with node 675- 2.
  • the correct answer from the user may be because the user understood the problem and answered correctly or due to a lucky guess and the dialogue manager 150 may use the parameterized content at node 675-2 as a pool from which the response is to be selected.
  • the parameterized content set for the current situation is [TRCA].
  • [TRCA] is defined to be a combination of [TRC] and [TRG], where set [TRC] is for a response to a correct answer based on correct understanding of what is taught, while set [TRG] is for a response to a correct answer provided based on a guess.
  • a determination of whether the user’s answer is a guess or an actual correct answer may be based on the probabilities P(Lt+i
  • obs correct) and P(G) estimated for such possibilities as discussed herein.
  • the response initializer 1602 may which one of the two parameterized content sets serves as a pool from which a response may be further selected.
  • a response may be selected from a group selected based on the probabilities as described above. For instance, if it is more likely that the correct answer is provided because the user indeed has mastered the concept being taught, then the selection pool for a response is [TRC] . Once selected, the response is to be selected from the parameterized content set [TRC].
  • a response may be selected or generated in a way that is grammatical (based on syntax and semantic models 1604 and 1606), knowledgeable (based on topic knowledge models 1616 and common sense models 280), humanistic (based on user profile 290, the agent’s mind 200, and the user’s mind 220 estimated based on the dialogue situations), and intelligent (based on topic control in accordance with the STC-AOGs 230, the dialogue context 260, the dialogue history 250, and the event-centric knowledge 270 established based on observations of relevant dialogues).
  • the output of the response initializer 1602 may then be sent to the response enhancer 1608, which may further narrow down the choices based on, e.g., professional knowledge on the subject matter (represented by topic knowledge models 1614), common sense knowledge (represented by common sense models 280), ... , etc.
  • Relevant topic based knowledge and common sense models are retrieved at 1636 and used to enhance the response selection/generation at 1638.
  • Topic knowledge models 1614 may include knowledge graphs that represent human knowledge in certain subject matters that can be used to control response generation.
  • the commonsense models 280 may include any representations such as subject-predicate-object triples to model human common sense. These models may be used to ensure that NLG engine 160 will produce sentences that make sense or consistent with the known facts.
  • a selected response may then be sent to the response adapter 1612 where the response is adapted in a context aware and personalized manner.
  • the response adapter may operate based on multiple considerations by accessing at 1642 various types of information from, e.g., the dialogue history (250), the dialogue context (260), the event knowledge (270) previously acquired, any user preferences (290), and the estimated mindsets of the agent and the user (200 and 220) and adapting, at 1646, the response accordingly.
  • a gentle response to an incorrect answer may be generated or selected (from an existing content set).
  • the robot agent may ask the user“Where do you plan to travel to this year on your birthday?” to better engage the user.
  • the robot agent may ask“Do you like Legos?” to develop a more interesting and engaging conversation.
  • a response adapted to the user’s familiarity or liking is likely better engaging the user s
  • the robot agent continuously updates the estimation on the agent’s mindset 200, the shared mindset 210, and the user’s mindset 220 based on what is observed.
  • a response to a user’ s utterance may also be adapted based on the estimated minds. For instance, if the user answers multiple questions on a math concept (e.g., fraction) correctly, the estimation on the user’s mind may indicate that the user has mastered the concept. In this case, among alternative responses for responding to the last correct answer (e.g., “Great job!”“Fantastic!” ... , or“Well done. You have mastered the concept. Would you like to move to a new concept?”), the robot agent may select one that acknowledges to the user of his achievement and offers to move on.
  • a math concept e.g., fraction
  • the adapted response from the response adapter 1612 is then sent to the adaptive response generator 1616 to generate, at 1648, a context aware and personalized response.
  • the adaptive response generator 1616 generates the response in accordance with the syntax models 1604 and the semantic models 1606.
  • Such generated adaptive response is then sent, at 1652, to the TTS engine 170 to generate a rendition that is adapted to the user’s preferences.
  • Fig. 16E is a flowchart of an exemplary process for the TTS engine 170, in accordance with an embodiment of the present teaching.
  • the adaptive response processor 1618 receives, at 1654, the adaptive response in text form from the NLG engine 160, it processes, at 1656, the received response.
  • the text response needs to be rendered into acoustic form via, e.g., text to speech conversion or TTS.
  • the adaptive TTS property analyzer 1622 retrieves, at 1658, relevant information from the user profile 290.
  • Relevant information may include, e.g., age group and/or ethnic group of the user, known accent of the user, gender, etc., which may be indicative of some preferred way to convert the text response to an audio (speech) form.
  • the representative accent of the ethnic group e.g., the characteristic acoustic features of the centroid of a distribution representing the average accent of the ethnic group
  • the text response may be converted into a speech signal with the characteristic acoustic features of the ethnic group.
  • the adaptive TTS property analyzer 1622 invokes the text to speech converter 1624 to convert, at 1668, the adaptive text response into speech form based on, e.g., standard TTS configuration stored in, e.g., TTS property configurations 1626. If any preference exists, the adaptive TTS property analyzer 1622 analyzes, at 1664, the information from the user profile to identify specific preferences and retrieves, at 1666 from the TTS property configuration 1626, specific TTS conversion configuration parameters for converting the text response to a speech form that exhibits specific speech characteristics preferred by the user.
  • the text to speech converter 1624 converts, at 1668, the received adaptive text response to an audio signal representing the speech form of the adaptive text response in a style dictated by the user’s preference.
  • the generated audio signal for the speech form of the adaptive text response is then sent, at 1672, to the robot agent for rendering (or responding to the user).
  • spoken language understanding in a dialogue may be personalized in a context aware manner, the dialogue itself or the way to conduct the dialogue (back and forth communications between a machine and a human) may be adaptively configured based on the dynamics of the dialogue, the communication can be delivered with stylistic choices determined with personalization and content sensitive. For example, a robot agent for tutoring a student a second language (e.g., English) can conduct the course based on what is known about the user.
  • a second language e.g., English
  • a student user belongs to a known ethnic group generally known to have a particular accent profile
  • the tutoring may be conducted with that in mind and a course plan may be generated with respect to that particular accent profile to have a targeted way to overcome that accent to make sure that the student will develop correct pronunciation.
  • a response to a dialogue user is devised in a personalized and context aware manner by tracing different types of information related to the surrounding of the dialogue based on multimodal information.
  • Another consideration in deriving a response in a dialogue is the assessment made by a robot agent for the user’s performance. Such an assessment may not only guide how to carry on a dialogue with the user but also be used to adaptively adjust the teaching plan in the course of the dialogue.
  • Fig. 17A depicts an exemplary high level diagram of a machine tutoring system 1700 for adaptive personalized tutoring via dynamic feedback, in accordance with an embodiment of the present teaching.
  • the illustrated machine tutoring system 1700 comprises a teacher 1710, which is backed up by a teaching plan execution unit 1770, a communication understanding unit 1720 for understand what a student user 1702 said, a grader unit 1730 for assessing the performance of the student user 1702, and a teaching plan adjustment unit 1750.
  • the teaching plan for a student user may be devised based on knowledge about the user from the user profile 290. For example, if it is known that the student belongs to an ethnic group, which may have a corresponding accent profile with respect to a language to be tutored, a teaching plan may be devised for teaching the student of the language based on the known accent profile.
  • the four units in the system 1770 form a feedback loop, making it possible to continuously adapt the teaching plan during the course of the tutoring.
  • an initial teaching plan 1760 is determined based on curriculum 1740 and the user profile 290
  • the student’s communication is analyzed by the communication understanding unit 1720 and the result is sent to the grader unit 1730 to assess the performance of the user.
  • Such an assessment may be made with respect to the curriculum 1740 which specifies the expected performance, which may be evaluated in accordance with intended achievements specified in the curriculum 1740. For instance, in tutoring a student English, there may be a series of tutoring sessions, some directed to pronunciation, some to vocabulary, some to grammar, some to reading, and some to compositions.
  • Each tutoring session may be directed to a specific goal in light of the series of goals relevant to mastering English at a certain level.
  • the content may be directed to reading different words targeting at different phonemes.
  • the acoustic signals of the user’s speaking the words as well as the visual information on mouth movement while speaking such words may be recorded and the acoustic properties and visemes of the user’ s reading such words may be analyzed and compared with the standard corresponding acoustic and viseme profiles as part of the assessment.
  • the assessment result is then used by the teaching plan adjustment unit 1750 to determine whether the initial teaching plan needs to be adjusted and if so, how to change the teaching plan given the student’s performance and the intended goal of the curriculum 1740.
  • the adjustment may be based on the deviation between observed acoustic/viseme characteristics and the standard acoustic/viseme profiles.
  • the adjustment to the teaching plan is then used to update the teaching plan 1760 so that a revised teaching activities may be carried out by the teaching plan execution unit 1770.
  • the observed performance and assessment thereof may also be sent to a user profile updater 1780 to continuously update the user profile 290.
  • Different units in Fig. 17A as described herein may be a part of the personalized and context aware dialogue management as shown in Figs.
  • the user profile updater 1780 may correspond to or be a part of the user profile update engine 1630; the communication understanding unit 1720 may correspond to or be a part of the SLU engine 1100; the teaching plan execution unit 1770 may be a part of the dialogue manager 150.
  • Fig. 17A is to show the feedback nature of the closed loop system 1700 in terms of adapting a teaching plan dynamically during the course of a dialogue session.
  • One example of adjusting a teaching plan based on user’s profile is to have an individualized teaching plan to teach each student in learning the correct pronunciation in a language.
  • pronunciation can be measured both acoustically (phonemes) and visually (visemes).
  • accent profiles of different ethnic groups and individual users may be established via audio and visual information.
  • the accent profile may be established based on audio/visual information on how the user read certain language materials.
  • an accent profile for the group may be devised based on the accent profiles of its members.
  • Fig. 17B illustrates exemplary approaches that a robot agent teacher may adopt to tutor a student, in accordance with an embodiment of the present teaching.
  • a robot teacher can act like a human to devise different ways to tutor a student in learning a language.
  • a teacher may tutor a student via an acoustic, a visual, or a textual ways.
  • a tutor may illustrate to a student what is the correct, an incorrect way to pronounce a word, or a phoneme and may demonstrate acoustically a comparison between a correct and an incorrect way to produce a sound.
  • a tutor may also illustrate to a student visually in terms of lip/mouth movement while pronounce a sound. In some situation, a tutor may also show a student visually the vocal track animation while producing a sound so that the student may follow in order to pronounce correctly.
  • a tutor may also provide textual passages that a student can read in order to follow the instructions in correctly pronouncing words correctly.
  • a robot tutor may rely on dynamic observations of a student’s performance, either acoustically or visually, in order to selectively choose a certain way to tutor the student.
  • Such a selection of teaching approaches may include both materials to be taught based on a student’s progress or the way to teach the student.
  • the grader unit 1730 may dynamically assess a student’s performance based on observations related multiple aspects, e.g., whether the student answers correctly or not, whether the student’s pronunciation presents any accent, and whether the student’s visemes conform with required visemes.
  • Fig. 17C provides exemplary aspects of performance of a user that the grader unit 1730 may evaluate during a tutoring session, in accordance with an embodiment of the present teaching.
  • the grader unit 1730 may be designed to evaluate a student in terms of different aspects of language learning, e.g., linguistic features such as syntactic performance (whether the student uses correct syntax), semantics (whether the student understands semantics of words, sentences), pronunciation related features such as how the student pronounce, what is the observed viseme, etc., the student’s reading fluency, the student’s overall comprehension, the student’s use of language, and various other observations of the student that may be relevant to a determination as to what is an appropriate ways to teach the student, such as the student’s gender, age group, or whether the language being taught to the student appears to be his/her first or second language.
  • linguistic features such as syntactic performance (whether the student uses correct syntax), semantics (whether the student understands semantics of words, sentences), pronunciation related features such as how the student pronounce, what is the observed viseme, etc., the student’s reading fluency, the student’s overall comprehension, the student’s use of language, and various other observations of the student that may be relevant
  • teaching plan adjustment unit 1750 may facilitate its decision as to how to adjust the teaching plan, including the materials to be taught as well as the way (acoustic, visual, textual) to teach the student.
  • teaching plan adjustment unit 1750 One example is shown in Figs. 17D - 17F related to teaching English language.
  • Fig. 17D provides exemplary projected accent profile distribution 1330 of Americans in spoken English as well as exemplary acoustic waveforms for different phonemes and visual visemes corresponding to such phonemes for a centroid point of the distribution 1330.
  • Fig. 17D on the left is the distribution 1330 of accent profiles of a group of Americans projected in a coordinate system and on the right are, e.g., corresponding acoustic waves of different phonemes in American English and their corresponding visemes derived from a centroid (e.g., average) of the distribution 1330.
  • a goal is for a student to achieve an accent profile preferably falling with the scope of 1330.
  • a tutoring plan is to help a student to achieve, in speaking English, the acoustic characteristics similar to the illustrated acoustic waveforms and corresponding mouth shapes/movements as shown for different phonemes.
  • Such help may be delivered via sound, i.e., robot agent says a phoneme or a word to the user and asks the student to mimic the sound.
  • the robot agent may show a student visually the shape and movement of the mouth when reading a phoneme or a word.
  • Both acoustic and visual means deployed to teach a student to reach a correct pronunciation may be based on the standard accent profile of the underlying language, e.g., the accent profile of the centroid (average pronunciation) of the distribution 1330 for American English.
  • Fig. 17E shows an example of deviation between accent profile of one ethnic group A4 from that of a language to be taught (A3), in accordance with an embodiment of the present teaching.
  • A3 a language to be taught
  • exemplary visemes from two ethnic groups are shown, e.g., two exemplary visemes 1780-1 and 1790-1 from a standard accent profile corresponding to the centroid of group A3 and two exemplary corresponding visemes 1780-2 and 1790-2 observed from a user from the ethnic group A4 1340.
  • a teaching plan may be devised to incorporate measures/steps to correct the accent by teaching a student how to shape the mouth in order to pronounce correctly.
  • differences in speech signals of various phonemes may also be used to determine whether accent correction is needed and if so, what to be incorporated in the teaching plan to make it happen. Therefore, such observed differences in either phonemes or visemes provide the basis to develop an adaptive teaching plan.
  • Fig. 17F shows an example of tutoring content incorporated in a teaching plan adaptively developed based on observed viseme features of a user with respect to standard visemes of an underlying spoken language, in accordance with an embodiment of the present teaching.
  • teaching content 1790-3 is developed with an interface in which a user can see his/her own viseme first and is prompted to, e.g., click on a button for“Create Corrective Shape” to display the correct viseme (mouth shape) of the phoneme.
  • the robot agent may also provide some accompanying verbal instructions in order to coaching the student to correctly pronounce the phoneme while watching a standard viseme.
  • Such instructions may also be adaptively created based on, e.g., the difference between 1790-1 and 1790-2. For example, if the viseme of the user appears too wide instead of being more round as in the standard viseme, the instruction may be designed to tell the user that his/her mouth needs to form a round shape.
  • Fig. 17G is a flowchart of an exemplary process for adaptively creating personalized tutoring plans via dynamic information tracking and performance feedback, in accordance with an embodiment of the present teaching.
  • the user’s multimodal input is received first at 1705.
  • a current teaching plan is accessed, at 1715, which is developed based on a curriculum.
  • the user’s performance is assessed at 1725 with respect to the intended goals of the curriculum and discrepancy between the user’s performance and the intended goals of the curriculum is identified at 1735.
  • Such discrepancy may be identified in different modalities, e.g., in acoustic features and in visual features and can then be used to modify, at 1745, the current teaching plan to derive a modified adaptive teaching plan.
  • the observations made from the user and the assessment of the user’s performance may be used to update, the user profile at 1755.
  • the robot agent proceeds with the dialogue, at 1765, in accordance with the modified teaching plan.
  • Fig. 18 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.
  • the user device on which the present teaching is implemented corresponds to a mobile device 1800, including, but is not limited to, a smart phone, a tablet, a music player, a handled gaming console, a global positioning system (GPS) receiver, and a wearable computing device (e.g., eyeglasses, wrist watch, etc.), or in any other form factor.
  • a mobile device 1800 including, but is not limited to, a smart phone, a tablet, a music player, a handled gaming console, a global positioning system (GPS) receiver, and a wearable computing device (e.g., eyeglasses, wrist watch, etc.), or in any other form factor.
  • GPS global positioning system
  • Mobile device 1800 may include one or more central processing units (“CPUs”) 1840, one or more graphic processing units (“GPUs”) 1830, a display 1820, a memory 1860, a communication platform 1810, such as a wireless communication module, storage 1890, and one or more input/output (I/O) devices 1840. Any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 1800. As shown in Fig. 18 a mobile operating system 1870 (e.g., iOS, Android, Windows Phone, etc.), and one or more applications 1880 may be loaded into memory 1860 from storage 1890 in order to be executed by the CPU 1840.
  • CPUs central processing units
  • GPUs graphic processing unit
  • the applications 1880 may include a browser or any other suitable mobile apps for managing a conversation system on mobile device 1300. User interactions may be achieved via the I/O devices 1840 and provided to the automated dialogue companion via a network.
  • computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein.
  • the hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to appropriate settings as described herein.
  • a computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming and general operation of such computer equipment and as a result the drawings should be self-explanatory.
  • Fig. 19 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.
  • a specialized system incorporating the present teaching has a functional block diagram illustration of a hardware platform, which includes user interface elements.
  • the computer may be a general purpose computer or a special purpose computer. Both can be used to implement a specialized system for the present teaching.
  • This computer 1900 may be used to implement any component of conversation or dialogue management system, as described herein.
  • conversation management system may be implemented on a computer such as computer 1900, via its hardware, software program, firmware, or a combination thereof.
  • Computer 1900 includes COM ports 1950 connected to and from a network connected thereto to facilitate data communications.
  • Computer 1900 also includes a central processing unit (CPU) 1920, in the form of one or more processors, for executing program instructions.
  • the exemplary computer platform includes an internal communication bus 1910, program storage and data storage of different forms (e.g., disk 1970, read only memory (ROM) 1930, or random access memory (RAM) 1940), for various data files to be processed and/or communicated by computer 1900, as well as possibly program instructions to be executed by CPU 1920.
  • Computer 1400 also includes an I/O component 1960, supporting input/output flows between the computer and other components therein such as user interface elements 1980.
  • Computer 1900 may also receive programming and data via network communications.
  • aspects of the methods of dialogue management and/or other processes may be embodied in programming.
  • Program aspects of the technology may be thought of as“products” or“articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
  • Tangible non-transitory“storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
  • All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with conversation management.
  • a network such as the Internet or various other telecommunication networks.
  • Such communications may enable loading of the software from one computer or processor into another, for example, in connection with conversation management.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • the physical elements that carry such waves, such as wired or wireless links, optical links or the like also may be considered as media bearing the software.
  • terms such as computer or machine“readable medium” refer to any medium that participates in providing instructions to a processor for execution.
  • a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium.
  • Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings.
  • Volatile storage media include dynamic memory, such as a main memory of such a computer platform.
  • Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system.
  • Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)

Abstract

La présente invention concerne un procédé, un système, un support et des modes de réalisation pour le dialogue homme-machine. Un énoncé est reçu de la part d'un utilisateur engagé dans le dialogue homme-machine sur un sujet dans une scène de dialogue. Des informations d'ambiance multimodales associées au dialogue homme-machine sont obtenues et analysées afin de suivre le contexte multimodal du dialogue homme-machine. L'opération pour la compréhension de la langue parlée de l'énoncé est effectuée, en tenant compte du contexte, sur la base du contexte multimodal suivi, afin de déterminer la sémantique de l'énoncé.
PCT/US2020/036748 2019-06-17 2020-06-09 Système et procédé pour un dialogue homme-machine personnalisé et multimodal tenant compte du contexte WO2020256993A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202080054000.3A CN114270337A (zh) 2019-06-17 2020-06-09 用于个性化和多模态的上下文感知的人机对话的系统和方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962862296P 2019-06-17 2019-06-17
US62/862,296 2019-06-17

Publications (1)

Publication Number Publication Date
WO2020256993A1 true WO2020256993A1 (fr) 2020-12-24

Family

ID=74037544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/036748 WO2020256993A1 (fr) 2019-06-17 2020-06-09 Système et procédé pour un dialogue homme-machine personnalisé et multimodal tenant compte du contexte

Country Status (2)

Country Link
CN (1) CN114270337A (fr)
WO (1) WO2020256993A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989822A (zh) * 2021-04-16 2021-06-18 北京世纪好未来教育科技有限公司 识别对话中句子类别的方法、装置、电子设备和存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470381A (zh) * 2022-08-16 2022-12-13 北京百度网讯科技有限公司 信息交互方法、装置、设备及介质
CN115860013B (zh) * 2023-03-03 2023-06-02 深圳市人马互动科技有限公司 对话消息处理方法、装置、系统、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130029308A1 (en) * 2009-07-08 2013-01-31 Graesser Arthur C Methods and computer-program products for teaching a topic to a user
US20160259775A1 (en) * 2015-03-08 2016-09-08 Speaktoit, Inc. Context-based natural language processing
US20160378861A1 (en) * 2012-09-28 2016-12-29 Sri International Real-time human-machine collaboration using big data driven augmented reality technologies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130029308A1 (en) * 2009-07-08 2013-01-31 Graesser Arthur C Methods and computer-program products for teaching a topic to a user
US20160378861A1 (en) * 2012-09-28 2016-12-29 Sri International Real-time human-machine collaboration using big data driven augmented reality technologies
US20160259775A1 (en) * 2015-03-08 2016-09-08 Speaktoit, Inc. Context-based natural language processing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989822A (zh) * 2021-04-16 2021-06-18 北京世纪好未来教育科技有限公司 识别对话中句子类别的方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN114270337A (zh) 2022-04-01

Similar Documents

Publication Publication Date Title
US11017779B2 (en) System and method for speech understanding via integrated audio and visual based speech recognition
US11024294B2 (en) System and method for dialogue management
US20190371318A1 (en) System and method for adaptive detection of spoken language via multiple speech models
US20190206402A1 (en) System and Method for Artificial Intelligence Driven Automated Companion
US11183187B2 (en) Dialog method, dialog system, dialog apparatus and program that gives impression that dialog system understands content of dialog
US11017551B2 (en) System and method for identifying a point of interest based on intersecting visual trajectories
US11003860B2 (en) System and method for learning preferences in dialogue personalization
Griol et al. An architecture to develop multimodal educative applications with chatbots
US11200902B2 (en) System and method for disambiguating a source of sound based on detected lip movement
WO2020256993A1 (fr) Système et procédé pour un dialogue homme-machine personnalisé et multimodal tenant compte du contexte
US11308312B2 (en) System and method for reconstructing unoccupied 3D space
US10785489B2 (en) System and method for visual rendering based on sparse samples with predicted motion
US20190251350A1 (en) System and method for inferring scenes based on visual context-free grammar model
US20190251716A1 (en) System and method for visual scene construction based on user communication
KR20220128897A (ko) 인공지능 아바타를 활용한 회화 능력 평가 시스템 및 그 방법
Gonzalez et al. AI in informal science education: bringing turing back to life to perform the turing test
Irfan et al. Dynamic emotional language adaptation in multiparty interactions with agents
WO2020256992A1 (fr) Système et procédé de dialogue intelligent sur la base d'un suivi de connaissances
Zikky et al. Utilizing Virtual Humans as Campus Virtual Receptionists
KR20190106011A (ko) 대화 시스템 및 그 방법, 그 방법을 실행하기 위하여 매체에 저장된 컴퓨터 프로그램
Grazioso et al. " What's that called?": a multimodal fusion approach for cultural heritage virtual experiences.
Pammi Synthesis of listener vocalizations: towards interactive speech synthesis
Carrión On the development of Adaptive and Portable Spoken Dialogue Systems: Emotion Recognition, Language Adaptation and Field Evaluation
Pammi Synthesis of listener vocalizations
Caon Automatic speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09/05/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20827229

Country of ref document: EP

Kind code of ref document: A1