WO2021003471A1

WO2021003471A1 - System and method for adaptive dialogue management across real and augmented reality

Info

Publication number: WO2021003471A1
Application number: PCT/US2020/040830
Authority: WO
Inventors: Victor Zhang
Original assignee: DMAI, Inc.
Priority date: 2019-07-03
Filing date: 2020-07-03
Publication date: 2021-01-07
Also published as: CN114287030A

Abstract

The present teaching relates to method, system, medium, and implementations for managing a user machine dialogue. Information is received related to a user machine dialogue in a dialogue scene that involves a user and managed by a dialogue manager in accordance with an initial dialogue strategy. Based on the information, the initial dialogue strategy is adapted to generate an updated dialogue strategy, based on which it is determined whether the user machine dialogue is to continue in an augmented dialogue reality having a virtual scene rendered in the dialogue scene. If so, a virtual agent manager is activated to create the augmented dialogue reality and manage the user machine dialogue therein.

Description

SYSTEM AND METHOD FOR ADAPTIVE DIALOGUE MANAGEMENT ACROSS REAL AND AUGMENTED REALITY

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from: U.S. Provisional Patent Application

62/870162, filed July 3, 2019, U.S. Provisional Patent Application 62/870168, filed July 3, 2019, U.S. Provisional Patent Application 62/870201, filed July 3, 2019, U.S. Provisional Patent Application 62/870174, filed July 3, 2019, U.S. Provisional Patent Application 62/870211, filed July 3, 2019, U.S. Provisional Patent Application 62/870217, filed July 3, 2019, U.S. Provisional Patent Application 62/870224, filed July 3, 2019, each of which are hereby incorporated by reference in their entireties.

BACKGROUND

1 Technical Field

[0001] The present teaching generally relates to computer. More specifically, the present teaching relates to computerized intelligent agent.

2. Technical Background

[0002] With advancement of artificial intelligence technologies and the explosion of Internet-based communications due to the ubiquitous Internet’s connectivity, computer aided dialogue systems have become increasingly popular. For example, more and more call centers deploy automated dialogue robot to handle customer calls. Various hotels installed various kiosks that can answer questions from tourists or guests. Online bookings (whether travel accommodations or theater tickets, etc.) are also more frequently done by chatbots. In recent years, automated human machine communications in other areas are also becoming increasingly popular.

[0003] Such traditional computer aided dialogue systems are usually pre-programed with certain questions and answers based on commonly known patterns of conversations in different domains. Unfortunately, a human conversant can be unpredictable and sometimes does not follow a pre-planned dialogue pattern. In addition, in certain situations, a human conversant may digress during the process and continuing a fixed conversation pattern likely will cause irritation or loss of interests. When this happens, such traditional machine dialogue systems will not be able to continue to engage a human conversant so that the human machine dialogue either has to be aborted to hand to a human operator or the human conversant simply leaves the dialogue, which is undesirable.

[0004] In addition, traditional machine based dialogue systems usually are not designed to consider the emotional factors of a human, let alone taking into consideration as to how to address such emotional factors when conversing with a human. For example, a traditional machine dialogue system usually does not initiate the conversation unless a human activates the system or asks some questions. Even if a traditional dialogue system does initiate a

conversation, it has a fixed way to start a conversation and does not change from human to human or adjusted based on observations. As such, although they are programmed to faithfully follow the pre-designed dialogue pattern, they are usually not able to react to the dynamics of the conversation and adapt in order to keep the conversation going in a way that can engage the human. In many situations, when a human involved in a dialogue is clearly annoyed or frustrated, a traditional machine dialogue systems is usually completely insensitive to it and continue to press the conversation in the same manner, causing annoyance and destroying the interest of the human. This not only makes the conversation end unpleasantly (the machine is still unaware of that) but also turns the person away from conversing with any machine based dialogue system in the future.

[0005] In some application, conducting a human machine dialogue session based on what is observed from the human is crucially important in order to determine how to proceed effectively. One example is an education related dialogue. When a chatbot is used for teaching a child to read, whether the child is perceptive to the way he/she is being taught has to be monitored and taken into account continuously in order to be effective. But traditional systems do not address such issues. Another limitation of the traditional dialogue systems is that they are not aware of the dialogue context in different dimensions. For example, a traditional dialogue system is not equipped with the ability to observe the context of a conversation and improvise as to dialogue strategy in order to engage a user and improve the user experience.

[0006] Thus, there is a need for methods and systems that address such limitations.

SUMMARY

[0007] The teachings disclosed herein relate to methods, systems, and programming for advertising. More particularly, the present teaching relates to methods, systems, and programming related to exploring sources of advertisement and utilization thereof.

[0008] In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform capable of connecting to a network for managing a user machine dialogue. Information is received related to a user machine dialogue in a dialogue scene that involves a user and managed by a dialogue manager in accordance with an initial dialogue strategy. Based on the information, the initial dialogue strategy is adapted to generate an updated dialogue strategy, based on which it is determined whether the user machine dialogue is to continue in an augmented dialogue reality having a virtual scene rendered in the dialogue scene. If so, a virtual agent manager is activated to create the augmented dialogue reality and manage the user machine dialogue therein.

[0009] In a different example, a system for managing a user machine dialogue, which includes a dialogue manager and an information state updater. The dialogue manager is configured for receiving information related to a user machine dialogue in a dialogue scene involving a user, wherein the user machine dialogue is managed in accordance with an initial dialogue strategy. The information updater is configured for updating an information state based on the information to facilitate adaptation of the initial dialogue strategy stored in the information state to generate an updated dialogue strategy. Based on the information state, the dialogue manager is further configured for determining, based on the updated dialogue strategy, whether the user machine dialogue is to continue in an augmented dialogue reality having a virtual scene rendered in the dialogue scene and if so, activating a virtual agent manager to create the augmented dialogue reality and manage the user machine dialogue therein.

[0010] Other concepts relate to software for implementing the present teaching. A software product, in accord with this concept, includes at least one machine-readable non- transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or other additional information.

[0011] In one example, a machine-readable, non-transitory and tangible medium having data recorded thereon for user machine dialogue, wherein the medium, when read by the machine, causes the machine to perform a series of steps to implement a method of managing a user machine dialogue.

[0012] Additional advantages and novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

[0014] Fig. 1 depicts a networked environment for facilitating a dialogue between a user operating a user device and an agent device in conjunction with a user interaction engine, in accordance with an embodiment of the present teaching;

[0015] Figs. 2A-2B depict connections among a user device, an agent device, and a user interaction engine during a dialogue, in accordance with an embodiment of the present teaching;

[0016] Fig. 3 A illustrates an exemplary structure of an agent device with exemplary types of agent body, in accordance with an embodiment of the present teaching; [0017] Fig. 3B illustrates an exemplary agent device, in accordance with an embodiment of the present teaching;

[0018] Fig. 4A depicts an exemplary high level system diagram for an overall system for the automated companion, in accordance with various embodiments of the present teaching;

[0019] Fig. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, in accordance with an embodiment of the present teaching;

[0020] Fig. 5 illustrates exemplary multiple layer processing and communications among different processing layers of an automated dialogue companion, in accordance with an embodiment of the present teaching;

[0021] Fig. 6A depicts an exemplary configuration of a dialogue system centered around an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching;

[0022] Fig. 6B is a flowchart of an exemplary process of a dialogue system using an information state capturing dynamic information observed during a dialogue, in accordance with an embodiment of the present teaching;

[0023] Figs. 7A depicts an exemplary construction of an information state, in accordance with an embodiment of the present teaching;

[0024] Fig. 7B illustrates how different minds are connected in a dialogue with a robot tutor teaching a user adding fractions, in accordance with an embodiment of the present teaching; [0025] Fig. 7C shows an exemplary relationship among an agent’s mind, a shared mind, and a user’s mind represented in an information state, in accordance with an embodiment of the present teaching;

[0026] Fig. 8A illustrates an exemplary dialogue scene where actual and virtual realities can be combined to achieve adaptive dialogue strategy, in accordance with an embodiment of the present teaching;

[0027] Fig. 8B illustrates exemplary aspects of operation for achieving adaptive dialogue strategy, in accordance with an embodiment of the present teaching;

[0028] Fig. 9A depicts an exemplary high level system diagram of a system for adaptive environment modeling and robot/sensor collaboration, in accordance with an embodiment of the present teaching;

[0029] Fig. 9B is a flowchart of an exemplary process of a system for adaptive environment modeling and robot/sensor collaboration, in accordance with an embodiment of the present teaching;

[0030] Fig. 10A describes an exemplary hierarchical dialogue agent collaboration in an augmented dialogue reality, in accordance with an embodiment of the present teaching;

[0031] Fig. 10B shows exemplary composition of a virtual agent to be deployed in an augmented dialogue reality, in accordance with an embodiment of the present teaching;

[0032] Fig. 11 A shows an exemplary dialogue scene with a robot agent and a user, in accordance with an embodiment of the present teaching;

[0033] Fig. 11B shows an augmented dialogue reality scene where a robot agent creates a virtual avatar aiming to carry out a designated dialogue with a user, in accordance with an embodiment of the present teaching; [0034] Fig. l lC shows an augmented dialogue reality scene where a robot agent creates a virtual avatar for carrying on a designated dialogue with a user and a virtual companion for providing companionship to the user, in accordance with an embodiment of the present teaching;

[0035] Fig. 11D shows an augmented dialogue reality scene where a robot agent creates a virtual avatar for carrying on a designated dialogue, which further creates a virtual companion for providing companionship to the user, in accordance with an embodiment of the present teaching;

[0036] Fig. 1 IE shows a robot agent sequentially creating different virtual avatars each being responsible for a designated task, in accordance with an embodiment of the present teaching;

[0037] Fig. 12A depicts an exemplary high level system diagram of a virtual agent manager in connection with a dialogue manager to manage a dialogue in an augmented dialogue reality, in accordance with an embodiment of the present teaching;

[0038] Fig. 12B is a flowchart of an exemplary process of collaboratively managing a dialogue between a dialogue manager and a virtual agent manager, in accordance with an embodiment of the present teaching;

[0039] Fig. 12C is a flowchart of an exemplary process of a virtual agent manager, in accordance with an embodiment of the present teaching;

[0040] Fig. 13 A illustrates exemplary types of constraints observed by an augmented reality launcher, in accordance with an embodiment of the present teaching;

[0041] Figs.13B -13E depict examples of rendering a virtual scene; [0042] Fig. 13F illustrates different constraints to be observed in rendering a virtual agent, in accordance with an embodiment of the present teaching;

[0043] Fig. 13G shows different constraints to be observed in rendering virtual objects, in accordance with an embodiment of the present teaching;

[0044] Fig. 14 depicts an exemplary high level system diagram of the augmented reality launcher for rendering a virtual scene in an actual dialogue scene, in accordance with an embodiment of the present teaching;

[0045] Fig. 15 is a flowchart of an exemplary process of an augmented reality launcher for combining a virtual scene with an actual dialogue scene, in accordance with an embodiment of the present teaching;

[0046] Fig. 16 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments; and

[0047] Fig. 17 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments.

DETAILED DESCRIPTION

[0048] In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.

[0049] The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue framework. The present teaching incorporates artificial intelligence in an automated companion with an agent device interfacing with a human ad working in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.

[0050] The automated companion according to the present teaching is capable of personalizing a dialogue by adapting in multiple fronts, including but is not limited to, the subject matter of the conversation, the hardware/software/components or actual or virtual dialogue environments used to carry out the conversation, and the expression/behavior/gesture used to deliver responses to a human conversant. The adaptive dialogue strategy is to make the conversation more engaging, realistic, and productive by flexibly changing the conversation strategy based on observations on how receptive the human conversant is to the dialogue. The adaptive dialogue system according to the present teaching can be configured to be driven by a goal driven strategy by intelligently and dynamically configuring hardware/software components, dialogue setting (virtual or actual), or dialogue policy that are considered most appropriate to achieve an intended goal. Such adaptations/optimizations are carried out based on learning, including from prior conversations as well as from an on-going conversation such as observations of a human conversant on behavior/reactions/emotions exhibited during the conversation. Such a goal driven strategy may be exploited to remain the human conversant engaged in the conversation even when, in some instances, the conversation may appear to be deviating from a path initially designed for the intended goal.

[0051] More specifically, the present teaching discloses a user interaction engine providing backbone support to an agent device to facilitate more realistic and more engaging dialogues with a human conversant. Fig. 1 depicts a networked environment 100 for facilitating a dialogue between a user operating a user device and an agent device in conjunction with a user interaction engine, in accordance with an embodiment of the present teaching. In Fig. 1, the exemplary networked environment 100 includes one or more user devices 110, such as user devices 110-a, 110-b, 110-c, and 110-d, one or more agent devices 160, such as agent devices 160- a, ... 160-b, a user interaction engine 140, and a user information database 130, each of which may communicate with one another via network 120. In some embodiments, network 120 may correspond to a single network or a combination of different networks. For example, network 120 may be a local area network (“LAN”), a wide area network (“WAN”), a public network, a proprietary network, a proprietary network, a Public Telephone Switched Network (“PSTN”), the Internet, an intranet, a Bluetooth network, a wireless network, a virtual network, and/or any combination thereof. In one embodiment, network 120 may also include various network access points. For example, environment 100 may include wired or wireless access points such as, without limitation, base stations or Internet exchange points 120-a, ... , 120-b. Base stations 120-a and 120-b may facilitate, for example, communications to/from user devices 110 and/or agent devices 160 with one or more other components in the networked framework 100 across different types of network. [0052] A user device, e.g., 1 10-a, may be of different types to facilitate a user operating the user device to connect to network 120 and transmit/receive signals. Such a user device 110 may correspond to any suitable type of electronic/computing device including, but not limited to, a desktop computer (110-d), a mobile device (110-a), a device incorporated in a transportation vehicle (110-b), ... , a mobile computer (110-c), or a stationary device/computer (110-d). A mobile device may include, but is not limited to, a mobile phone, a smart phone, a personal display device, a personal digital assistant (“PDAs”), a gaming console/device, a wearable device such as a watch, a Fitbit, a pin/broach, a headphone, etc. A transportation vehicle embedded with a device may include a car, a truck, a motorcycle, a boat, a ship, a train, or an airplane. A mobile computer may include a laptop, an Ultrabook device, a handheld device, etc. A stationary device/computer may include a television, a set top box, a smart household device (e.g., a refrigerator, a microwave, a washer or a dryer, an electronic assistant, etc.), and/or a smart accessory (e.g., a light bulb, a light switch, an electrical picture frame, etc.).

[0053] An agent device, e.g., any of 160-a, ... , 160-b, may correspond one of different types of devices that may communicate with a user device and/or the user interaction engine 140. Each agent device, as described in greater detail below, may be viewed as an automated companion device that interfaces with a user with, e.g., the backbone support from the user interaction engine 140. An agent device as described herein may correspond to a robot which can be a game device, a toy device, a designated agent device such as a traveling agent or weather agent, etc. The agent device as disclosed herein is capable of facilitating and/or assisting in interactions with a user operating user device. In doing so, an agent device may be configured as a robot capable of controlling some of its parts, via the backend support from the application server 130, for, e.g., making certain physical movement (such as head), exhibiting certain facial expression (such as curved eyes for a smile), or saying things in a certain voice or tone (such as exciting tones) to display certain emotions.

[0054] When a user device (e.g., user device 110-a) is connected to an agent device, e.g., 160-a (e.g., via either a contact or contactless connection), a client running on a user device, e.g., 110-a, may communicate with the automated companion (either the agent device or the user interaction engine or both) to enable an interactive dialogue between the user operating the user device and the agent device. The client may act independently in some tasks or may be controlled remotely by the agent device or the user interaction engine 140. For example, to respond to a questions from a user, the agent device or the user interaction engine 140 may control the client running on the user device to render the speech of the response to the user. During a conversation, an agent device may include one or more input mechanisms (e.g., cameras, microphones, touch screens, buttons, etc.) that allow the agent device to capture inputs related to the user or the local environment associated with the conversation. Such inputs may assist the automated companion to develop an understanding of the atmosphere surrounding the conversation (e.g., movements of the user, sound of the environment) and the mindset of the human conversant (e.g., user picks up a ball which may indicates that the user is bored) in order to enable the automated companion to react accordingly and conduct the conversation in a manner that will keep the user interested and engaging.

[0055] In the illustrated embodiments, the user interaction engine 140 may be a backend server, which may be centralized or distributed. It is connected to the agent devices and/or user devices. It may be configured to provide backbone support to agent devices 160 and guide the agent devices to conduct conversations in a personalized and customized manner. In some embodiments, the user interaction engine 140 may receive information from connected devices (either agent devices or user devices), analyze such information, and control the flow of the conversations by sending instructions to agent devices and/or user devices. In some embodiments, the user interaction engine 140 may also communicate directly with user devices, e.g., providing dynamic data, e.g., control signals for a client running on a user device to render certain responses.

[0056] Generally speaking, the user interaction engine 140 may control the state and the flow of conversations between users and agent devices. The flow of each of the conversations may be controlled based on different types of information associated with the conversation, e.g., information about the user engaged in the conversation (e.g., from the user information database 130), the conversation history, surround information of the conversations, and/or the real time user feedbacks. In some embodiments, the user interaction engine 140 may be configured to obtain various sensory inputs such as, and without limitation, audio inputs, image inputs, haptic inputs, and/or contextual inputs, process these inputs, formulate an understanding of the human conversant, accordingly generate a response based on such understanding, and control the agent device and/or the user device to carry out the conversation based on the response. As an illustrative example, the user interaction engine 140 may receive audio data representing an utterance from a user operating user device , and generate a response (e.g., text) which may then be delivered to the user in the form of a computer generated utterance as a response to the user. As yet another example, the user interaction engine 140 may also, in response to the utterance, generate one or more instructions that control an agent device to perform a particular action or set of actions.

[0057] As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user’s speech or gesture so that the user’s emotion or intent may be estimated and used to determine a response to the user.

[0058] Fig. 2A depicts specific connections among a user device 110-a, an agent device 160-a, and the user interaction engine 140 during a dialogue, in accordance with an embodiment of the present teaching. As seen, connections between any two of the parties may all be bi-directional, as discussed herein. The agent device 160-a may interface with the user via the user device 110-a to conduct a dialogue in a bi-directional communications. On one hand, the agent device 160-a may be controlled by the user interaction engine 140 to utter a response to the user operating the user device 110-a. On the other hand, inputs from the user site, including, e.g., both the user’s utterance/action and information about the surrounding of the user, are provided to the agent device via the connections. The agent device 160-a may be configured to process such input and dynamically adjust its response to the user. For example, the agent device may be instructed by the user interaction engine 140 to render a tree on the user device. Knowing that the surrounding environment of the user (based on visual information from the user device) shows green trees and lawns, the agent device may customize the tree to be rendered as a lush green tree. If the scene from the user site shows that it is a winter weather, the agent device may control to render the tree on the user device with parameters for a tree that has no leaves. As another example, if the agent device is instructed to render a duck on the user device, the agent device may retrieve information from the user information database 130 on color preference and generate parameters for customizing the duck in a user’s preferred color before sending the instruction for the rendering to the user device.

[0059] In some embodiments, such inputs from the user’s site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.

[0060] In some embodiments, a client running on the user device may be configured to be able to process raw inputs of different modalities acquired from the user site and send the processed information (e.g., relevant features of the raw inputs) to the agent device or the user interaction engine for further processing. This will reduce the amount of data transmitted over the network and enhance the communication efficiency. Similarly, in some embodiments, the agent device may also be configured to be able to process information from the user device and extract useful information for, e.g., customization purposes. Although the user interaction engine 140 may control the state and flow control of the dialogue, making the user interaction engine 140 light weight improves the user interaction engine 140 scale better.

[0061] Fig. 2B depicts the same setting as what is presented in Fig. 2A with additional details on the user device 110-a. As shown, during a dialogue between the user and the agent 210, the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner. This may further enhance the user experience or engagement. Fig. 2B illustrates exemplary sensors such as video sensor 230, audio sensor 240, ... , or haptic sensor 250. The user device may also send textual data as part of the multi-model sensor data. Together, these sensors provide contextual information surrounding the dialogue and can be used for the user interaction system 140 to understand the situation in order to manage the dialogue. In some embodiment, the multi-modal sensor data may first be processed on the user device and important features in different modalities may be extracted and sent to the user interaction system 140 so that dialogue may be controlled with an understanding of the context. In some embodiments, the raw multi-modal sensor data may be sent directly to the user interaction system 140 for processing.

[0062] As seen in Figs. 2A-2B, the agent device may correspond to a robot that has different parts, including its head 210 and its body 220. Although the agent device as illustrated in Figs. 2A-2B appears to be a person robot, it may also be constructed in other forms as well, such as a duck, a bear, a rabbit, etc. Fig. 3A illustrates an exemplary structure of an agent device with exemplary types of agent body, in accordance with an embodiment of the present teaching. As presented, an agent device may include a head and a body with the head attached to the body. In some embodiments, the head of an agent device may have additional parts such as face, nose and mouth, some of which may be controlled to, e.g., make movement or expression. In some embodiments, the face on an agent device may correspond to a display screen on which a face can be rendered and the face may be of a person or of an animal. Such displayed face may also be controlled to express emotion. [0063] The body part of an agent device may also correspond to different forms such as a duck, a bear, a rabbit, etc. The body of the agent device may be stationary, movable, or semi-movable. An agent device with stationary body may correspond to a device that can sit on a surface such as a table to conduct face to face conversation with a human user sitting next to the table. An agent device with movable body may correspond to a device that can move around on a surface such as table surface or floor. Such a movable body may include parts that can be kinematically controlled to make physical moves. For example, an agent body may include feet which can be controlled to move in space when needed. In some embodiments, the body of an agent device may be semi-movable, i.e., some parts are movable and some are not. For example, a tail on the body of an agent device with a duck appearance may be movable but the duck cannot move in space. A bear body agent device may also have arms that may be movable but the bear can only sit on a surface.

[0064] Fig. 3B illustrates an exemplary agent device or automated companion 160- a, in accordance with an embodiment of the present teaching. The automated companion 160-a is a device that interacts with people using speech and/or facial expression or physical gestures. For example, the automated companion 160-a corresponds to an animatronic peripheral device with different parts, including head portion 310, eye portion (cameras) 320, a mouth portion with laser 325 and a microphone 330, a speaker 340, neck portion with servos 350, one or more magnet or other components that can be used for contactless detection of presence 360, and a body portion corresponding to, e.g., a charge base 370. In operation, the automated companion 160-a may be connected to a user device which may include a mobile multi -function device (110-a) via network connections. Once connected, the automated companion 160-a and the user device interact with each other via, e.g., speech, motion, gestures, and/or via pointing with a laser pointer. [0065] Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user’s response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion. The automated companion may use a camera (320) to observe the user’s presence, facial expressions, direction of gaze, surroundings, etc. An animatronic embodiment may“look” by pointing its head (310) containing a camera (320),“listen” using its microphone (340),“point” by directing its head (310) that can move via servos (350). In some embodiments, the head of the agent device may also be controlled remotely by a, e.g., the user interaction system 140 or by a client in a user device (110-a), via a laser (325). The exemplary automated companion 160-a as shown in Fig. 3B may also be controlled to“speak” via a speaker (330).

[0066] Fig. 4A depicts an exemplary high level system diagram for an overall system for the automated companion, according to various embodiments of the present teaching. In this illustrated embodiment, the overall system may encompass components/function modules residing in a user device, an agent device, and the user interaction engine 140. The overall system as depicted herein comprises a plurality of layers of processing and hierarchies that together carries out human-machine interactions in an intelligent manner. In the illustrated embodiment, there are 5 layers, including layer 1 for front end application as well as front end multi-modal data processing, layer 2 for characterizations of the dialog setting, layer 3 is where the dialog management module resides, layer 4 for estimated mindset of different parties (human, agent, device, etc.), layer 5 for so called utility. Different layers may correspond different levels of processing, ranging from raw data acquisition and processing at layer 1 to layer 5 on processing changing utilities of participants of dialogues. [0067] The term“utility” is hereby defined as preferences of a party identified based on states detected associated with dialogue histories. Utility may be associated with a party in a dialogue, whether the party is a human, the automated companion, or other intelligent devices. A utility for a particular party may represent different states of a world, whether physical, virtual, or even mental. For example, a state may be represented as a particular path along which a dialog walks through in a complex map of the world. At different instances, a current state evolves into a next state based on the interaction between multiple parties. States may also be party dependent, i.e., when different parties participate in an interaction, the states arising from such interaction may vary. A utility associated with a party may be organized as a hierarchy of preferences and such a hierarchy of preferences may evolve over time based on the party’s choices made and likings exhibited during conversations. Such preferences, which may be represented as an ordered sequence of choices made out of different options, is what is referred to as utility. The present teaching discloses method and system by which an intelligent automated companion is capable of learning, through a dialogue with a human conversant, the user’s utility.

[0068] Within the overall system for supporting the automated companion, front end applications as well as front end multi-modal data processing in layer 1 may reside in a user device and/or an agent device. For example, the camera, microphone, keyboard, display, Tenderer, speakers, chat-bubble, and user interface elements may be components or functional modules of the user device. For instance, there may be an application or client running on the user device which may include the functionalities before an external application interface (API) as shown in Fig. 4A. In some embodiments, the functionalities beyond the external API may be considered as the backend system or reside in the user interaction engine 140. The application running on the user device may take multi-model data (audio, images, video, text) from the sensors or circuitry of the user device, process the multi-modal data to generate text or other types of signals (object such as detected user face, speech understanding result) representing features of the raw multi-modal data, and send to layer 2 of the system.

[0069] In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, Tenderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc. Such higher level characteristics may be obtained by processing units at layer 2 and the used by components of higher layers, via the internal API as shown in Fig. 4A, to e.g., intelligently infer or estimate additional information related to the dialogue at higher conceptual levels. For example, the estimated emotion, attention, or other characteristics of a participant of a dialogue obtained at layer 2 may be used to estimate the mindset of the participant. In some embodiments, such mindset may also be estimated at layer 4 based on additional information, e.g., recorded surrounding environment or other auxiliary information in such surrounding environment such as sound.

[0070] The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant. How each dialogue progresses often represent a human user’s preferences. Such preferences may be captured dynamically during the dialogue at utilities (layer 5). As shown in Fig. 4A, utilities at layer 5 represent evolving states that are indicative of parties’ evolving preferences, which can also be used by the dialogue management at layer 3 to decide the appropriate or intelligent way to carry on the interaction. [0071] Sharing of information among different layers may be accomplished via

APIs. In some embodiments as illustrated in Fig. 4A, information sharing between layer 1 and rest of the layers is via an external API while sharing information among layers 2-5 is via an internal API. It is understood that this merely a design choice and other implementations are also possible to realize the present teaching presented herein. In some embodiments, through the internal API, various layers (2-5) may access information created by or stored at other layers to support the processing. Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc. In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).

[0072] Fig. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each node may represent a point of the current state of the dialogue and each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.

[0073] If, at node 1, the user responses negatively, the path is for this stage is from node 1 to node 10. If the user responds, at node 1, with a“so-so” response (e.g., not negative but also not positive), dialogue tree 400 may proceed to node 3, at which a response from the automated companion may be rendered and there may be three separate possible responses from the user,“No response,”“Positive Response,” and“Negative response,” corresponding to nodes 5, 6, and 7, respectively. Depending on the user’s actual response with respect to the automated companion’s response rendered at node 3, the dialogue management at layer 3 may then follow the dialogue accordingly. For instance, if the user responds at node 3 with a positive response, the automated companion moves to respond to the user at node 6. Similarly, depending on the user’s reaction to the automated companion’s response at node 6, the user may further respond with an answer that is correct. In this case, the dialogue state moves from node 6 to node 8, etc. In this illustrated example, the dialogue state during this period moved from node 1, to node 3, to node 6, and to node 8. The traverse through nodes 1, 3, 6, and 8 forms a path consistent with the underlying conversation between the automated companion and a user. As seen in Fig. 4B, the path representing the dialogue is represented by the solid lines connecting nodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue is represented by the dashed lines.

[0074] Fig. 5 illustrates exemplary communications among different processing layers of an automated dialogue companion centered around a dialogue manager 510, according to various embodiments of the present teaching. The dialogue manager 510 in Fig. 5 corresponds to a functional component of the dialogue management at layer 3. A dialog manager is an important part of the automated companion and it manages dialogues. Traditionally, a dialogue manager takes in as input a user’s utterances and determine how to respond to the user. This is performed without taking into account the user’s preferences, user’s mindset/emotions/intent, or surrounding environment of the dialogue, i.e., given any weights to the different available states of the relevant world. The lack of an understanding of the surrounding world often limits the perceived authenticity of or engagement in the conversations between a human user and an intelligent agents.

[0075] In some embodiments of the present teaching, the utility of parties of a conversation relevant to an on-going dialogue is exploited to allow a more personalized, flexible, and engaging conversion to be carried out. It facilitates an intelligent agent acting in different roles to become more effective in different tasks, e.g., scheduling appointments, booking travel, ordering equipment and supplies, and researching online on various topics. When an intelligent agent is aware of a user’s dynamic mindset, emotions, intent, and/or utility, it enables the agent to engage a human conversant in the dialogue in a more targeted and effective way. For example, when an education agent teaches a child, the preferences of the child (e.g., color he loves), the emotion observed (e.g., sometimes the child does not feel like continue the lesson), the intent (e.g., the child is reaching out to a ball on the floor instead of focusing on the lesson) may all permit the education agent to flexibly adjust the focus subject to toys and possibly the manner by which to continue the conversation with the child so that the child may be given a break in order to achieve the overall goal of educating the child.

[0076] As another example, the present teaching may be used to enhance a customer service agent in its service by asking questions that are more appropriate given what is observed in real-time from the user and hence achieving improved user experience. This is rooted in the essential aspects of the present teaching as disclosed herein by developing the means and methods to learn and adapt preferences or mindsets of parties participating in a dialogue so that the dialogue can be conducted in a more engaging manner.

[0077] Dialogue manager (DM) 510 is a core component of the automated companion. As shown in Fig. 5, DM 510 (layer 3) takes input from different layers, including input from layer 2 as well as input from higher levels of abstraction such as layer 4 for estimating mindsets of parties involved in a dialogue and layer 5 that learns utilities/preferences based on dialogues and assessed performances thereof. As illustrated, at layer 1, multi-modal information is acquired from sensors in different modalities which is processed to, e.g., obtain features that characterize the data. This may include signal processing in visual, acoustic, and textual modalities.

[0078] Such multi-modal information may be acquired by sensors deployed on a user device, e.g., 110-a during the dialogue. The acquired multi-modal information may be related to the user operating the user device 110-a and/or the surrounding of the dialogue scene. In some embodiments, the multi-model information may also be acquired by an agent device, e.g., 160-a, during the dialogue. In some embodiments, sensors on both the user device and the agent device may acquire relevant information. In some embodiments, the acquired multi-model information is processed at Layer 1, as shown in Fig. 5, which may include both a user device and an agent device. Depending on the situation and configuration, Layer 1 processing on each device may differ. For instance, if a user device 110-a is used to acquire surround information of a dialogue, including both information about the user and the environment around the user, raw input data (e.g., text, visual, or audio) may be processed on the user device and then the processed features may then be sent to Layer 2 for further analysis (at a higher level of abstraction). If some of the multi-modal information about the user and the dialogue environment is acquired by an agent device, the processing of such acquired raw data may also be processed by the agent device (not shown in Fig. 5) and then features extracted from such raw data may then be sent from the agent device to Layer 2 (which may be located in the user interaction engine 140).

[0079] Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in Fig.

5)·

[0080] Processed features of the multi-modal data may be further processed at layer

2 to achieve language understanding and/or multi-modal data understanding including visual, textual, and any combination thereof. Some of such understanding may be directed to a single modality, such as speech understanding, and some may be directed to an understanding of the surrounding of the user engaging in a dialogue based on integrated information. Such understanding may be physical (e.g., recognize certain objects in the scene), perceivable (e.g., recognize what the user said, or certain significant sound, etc.), or mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).

[0081] The multimodal data understanding generated at layer 2 may be used by

DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.

[0082] In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).

[0083] An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user’s emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user’s utility (e.g., the user may prefer speech in certain accent similar to his parents’), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.

[0084] In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug. There may be other forms of deliverable form of a response that is acoustic but not verbal, e.g., a whistle.

[0085] To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in Fig. 5. Such a responses in its determined deliverable form(s) may then be used by a Tenderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).

[0086] Fig. 6A depicts an exemplary configuration of a dialogue system 600 centered around an information state 610 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching. The dialogue system 600 comprises multimodal information processor 620, an automatic speech recognition (ASR) engine 630, a natural language understanding (NLU) engine 640, a dialogue manager (DM) 650, a natural language generation (NLG) engine 660, a text-to-speech (TTS) engine 670. The system 600 interfaces with a user 680 to conduct a dialogue.

[0087] During the dialogue, multimodal information is collected from the environment (including from the user 680), which captures the surrounding information of the conversation environment, the speech from the user 680, expressions, either facial or physical, of the user, etc. Such collected multimodal information is analyzed by the multimodal information processor 620 to extract relevant characterizing features in different modalities in order to estimate different characteristics of the user, the environment, etc. For instance, the speech signal may be analyzed to determine speech related features such as talking speed, pitch, or even accent. The visual signal related to the user may also be analyzed to determine, e.g., facial features or physical gestures, etc. in order to determine expressions of the user. Combining the acoustic features and visual features, the multimodal information analyzer 620 may also be able to estimate the emotional state of the user, e.g., high pitch and fast talking plus an angry facial expression may indicate that the user is upset. In some embodiments, the observed user activities may also be analyzed to indicate, e.g., the user is pointing or walking towards a specific object. Such information may provide useful context in understand the intent of the user or what the user is referring to in his/her speech. The multimodal information processor 620 may continuously analyze the multimodal information and store such analyzed information in the information state 610, which is then used by different components in system 100 to facilitate decision makings.

[0088] In operation, the speech information of the user 680 is sent to the ASR engine 630 to perform speech recognition. The speech recognition may include discern the language spoken and the words being uttered by the user 680. To understand the semantics of what the user said, the result from the ASR engine 630 is further processed by the NLU engine 640. Such understanding may rely on not only the words being spoken but also other information such as the gesture of the user 680 and/or other contextual information such as what was said previously. Based on the understanding of the user’s utterance, the dialogue manager 650 (same as the dialogue manager 510 in Fig. 5) determines how to respond to the user and such determined response may then be generated by the NLG engine 660 and further transformed from the text form to speech signals via the TTS engine 670. The output of the TTS engine 670 may then be delivered to the user 680 as a response to the user’s utterance. The process continues via such back and forth responses to carry on the conversation with the user 680.

[0089] As seen in Fig. 6A, components in system 600 are connected to the information state 610, which, as discussed herein, captures the dynamics around the dialogue and provides relevant and rich contextual information that can be used to facilitate speech recognition (ASR), language understanding (NLU), to determine an appropriate response (DM), to generate the response (NLG), and to transform the generated textual response to a speech form (TTS). As discussed herein, the information state 610 may represent the dynamics relevant to a dialogue obtained based on multimodal information, either related to the user 680 or to the surrounding of the dialogue.

[0090] Upon receiving the multimodal information from the dialogue scene (either about the user or about the dialogue surrounding), the multimodal information processor 670 analyzes the information and characterizes the dialogue surroundings at different levels, e.g., acoustic characteristics (e.g., pitch, speed, accent of the user), visual characteristics (e.g., facial expressions of the user, objects in the environment), physical characteristics (e.g., user’s hand waving or pointing at an object in the environment), estimated emotion and/or the state of mind of the user, and/or preferences or intent of the user. Such information may then be stored in the information state 610.

[0091] The rich media contextual information stored in the information state 610 may facilitate to facilitate different components to play their respective roles so that the dialogue may be conducted in a way that is more engaging and more effective with respect to the goals intended, e.g., understanding the utterance of the user 680 in light of what was observed in the dialogue scene, assessing the performance of the user 680, and/or estimating the utilities associated with the user in light of the intended goal of the dialogues, determining how to respond to the utterance of the user 680 based on assessed performance and utilities of the user, and delivering the response in a manner that is consider most appropriate based on what is known about the user, etc.

[0092] For instance, with accent information captured in the information state about a user, represented in both acoustic form (e.g., special way to speak certain phonemes) and visual form (e.g., special visemes of a user), the ASR engine 630 may utilize that information to figure out the words a user said. Similarly, NLU engine 640 may also utilize the rich contextual information to figure out the semantics of what a user means. For instance, if a user points to a computer placing on a desk (visual information) and said,“I like this,” the NLU engine 640 may combine the output of the ASR engine 630 (i.e.,“I like this”) and the visual information that the user is pointing at a computer in the room to understand that by“this” the user means the computer. As another example, if 180 user repeatedly made mistakes in a tutoring session and, at the same time, the user appears to be quite annoyed assessed based on the tone of the speech and on the facial expression (determined based on multimodal information), instead of keeping pressing on the tutoring content, the DM may determine to change the topic temporarily based on known preferences of the user (e.g., likes to talk about Lego games) in order to continue to engage the user. The decision of distracting the user temporarily may be determined based on, e.g., utilities previously observed with respect to the user as to what worked (e.g., temporarily distracting the user based on some preferred topics worked) and what would not work (e.g., continue to pressure the user to do better).

[0093] Fig. 6B is a flowchart of an exemplary process of the dialogue system 600 with the information state 610 capturing dynamic information observed during the dialogue, in accordance with an embodiment of the present teaching. As seen in Fig. 6B, the process is an iterated process. At 605, multimodal information is received, which is then analyzed by the multiple information processor 670 at 625. As discussed herein, the multimodal information includes information related to the user 680 and/or that related to the dialogue surroundings. Multimodal information related to the user may include the user’s utterance and/or visual observations of the user such as physical gestures and/or facial expressions. Information related to the dialogue surroundings may include information related to the environment such as objects present, the spatial/temporal relationships between the user and such observed objects (e.g., user stands in front of a desk) , and/or the dynamics between the user’s activities and the observed objects (e.g., user walked towards the desk and points at a computer on the desk). An understanding of the multimodal information captured from the dialogue scene may then be used to facilitate other tasks in the dialogue system 600.

[0094] Based on the information stored in the information state 610 (representing the past state) as well as the analysis result from the multimodal information processor 670 (on present state), the ASR engine 620 and the NLU engine 630 perform, at 625, respectively, speech recognition to ascertain the words spoken by the user and language understanding based on the recognized words. ASR and NLU may be performed based on the current information state 610 as well as the analysis results from the multimodal information processor 670.

[0095] Based on the multimodal information analysis and the result of language understanding, i.e., what the user said or meant, the changes of the dialogue state are traced, at 635, and such changes are used to update, at 645, the information state 610 accordingly to facilitate the subsequent processing. To carry on the dialogue, the DM 650 determines, at 655, a response based on a dialogue tree designed for the underlying dialogue, the output of the NLU engine 630 (understanding of the utterance), and the information stored in the information state 610. Once the response is determined, the response is generated, by the NLG engine 650, in, e.g., its textual form based on the information state 610. When a response is determined, there may be different ways of saying it. The NLG engine 650 may generate, at 665, a response in a style based on the user’ s preferences or whatever is known to be more appropriate to the particular user in the current dialogue. For instance, if the user answers a question incorrectly, there are different ways to point out that the answer is incorrect. For a particular user in the present dialogue, if it is known that the user is sensitive and easily gets frustrated, a gentler way to tell the user that his/her answer is not correct may be used to generate the response. For example, instead of saying“It is wrong,” the NLG engine 650 may generate a textual response of“It is not completely correct.”

[0096] The textual response, generated by the NLG engine 650, may then be rendered into a speech form, at 675 by the TTS engine 660, e.g., in an audio signal form. Although standard or commonly used TTS techniques may be used to perform TTS, the present teaching discloses that the response generated by the NLG engine 650 may be further personalized based on information stored in the information state 610. For instance, if it is known that a slower talking speed or a softer talking manner works better for the user (e.g., the student is known to have a slower processing speed on speech due to, e.g., ADHD), the generated response may be rendered, at 675 by the TTS engine 660, into a speech form accordingly, e.g., with a lower speed and pitch. Another example is to render the response with an accent consistent with the student’s known accent according to the personalized information about the user in the information state 610. The rendered response may then be delivered, at 685, to the user as a response to the user’s utterance. Upon the response to the user, the dialogue system 600 then traces the additional change of the dialogue and updates, at 695, the information state 610 accordingly.

[0097] Fig. 7A depicts an exemplary construction of the information state representation 610, in accordance with an embodiment of the present teaching. Without limitation, the information state 610 includes estimated minds or mindsets. As illustrated, the estimated minds include agent’s mind 700, user’s mind 720, and shared mine 710 in connection with other information recorded therein. The agent’s mind 700 may refer to the intended goal(s) that the dialogue agent (machine) is to achieve in a particular dialogue. The shared mind 710 may refer to the representation of the present dialogue situation which is a combination of the agent’s carrying out the intended agenda according to the agent’s mind 700 and the performance of the user. The user’s mind 720 may refer to the representation of an estimation, by the agent according to the shared mind or the performance of the user, of where the student is with respect to the intended purpose of the dialogue. For example, if an agent’s current task is teaching a student user the concept of fraction in math (which may include sub-concepts to build up the understanding of fraction), the user’s mind may include an estimated level of mastery of the user on various sub concepts. Such estimation may be derived based on the assessment of the student performance at different stages of the tutoring relevant sub-concepts.

[0098] Fig. 7B illustrates how such different minds are connected in an example of a robot tutor 705 teaching a student user 680 on concept 715 related to adding fractions, in accordance with an embodiment of the present teaching. As seen, a robot agent 705 is interacting with a student user 680 via multimodal interactions. The robot agent 705 may start with the tutoring based on initial agent’s mind 700 (e.g., the course on adding fractions which may be represented as AOGs). During the tutoring, student user 180 may answer questions from the robot tutor 705 and such answers in light of the questions form a certain path, yielding shared mind 710. Based on the user’s answers, the performance of the user is assessed and the user’s mind 720 is estimated with respect different aspects, e.g., whether the student masters the concept taught.

[0099] As seen in Fig. 7A, the estimated minds are also in connection with or include various representations of a plurality types of other information, including but not limited to, Spatial-Temporal-Causal And-Or-Graphs STC-AOGs 730, STC parsed graphs (STC-PGs) 740, dialogue history 750, dialogue context 760, event-centric knowledge 770, common sense models 780, ... , and user profiles 790. These different types of information may be of multiple modalities and constitute different aspects of the dynamics of each dialogue with respect to each user. As such, the information state 610 captures both general information of various dialogues and personalized information with respect to each user and each dialogue.

[00100] These different minds interconnect and together they facilitate different components in the dialogue system 600 to carry out respective tasks in a more adaptive, personalized, and engaging manner. Fig. 7C shows an exemplary relationship among the agent’s mind 700, the shared mind 710, and a user’s mind 720 represented in the information state 610, in accordance with an embodiment of the present teaching. As discussed herein, the share mind 710 is a representation of the present dialogue setting, obtained based on what have been said by the agent to the user and what the user have responded to the agent and it is a combination of what the agent intended (according to the agent’ s mind) and how the user performed in following the agent’ s intended agenda. Based on the shared mind 710, it can be traced as to what the agent is able to achieve and what the user is able to achieve up to that point.

[00101] Tracing such dynamic knowledge enables the system to estimate what the user has so far achieved up to that point or in a tutoring setting, which concept or sub-concepts that the student user has so far mastered, i.e., estimating the user’s mind 720. The estimated student’s mind facilitates the agent to adjust or update the dialogue strategy in order to achieve the intended goal or adjust the agent’s mind by learning how to adapt to the user to derive updated agent’s mind 700. Based on the dialogue history, the dialogue system 600 learns the preferences of the user or what works better for the user (utility) and such information is to be incorporated into the information state, which can be used by the agent to adapt the dialogue strategy based on utility-drive dialogue planning, which further lead to a further update on the shared mind based on the user’s response. The process repeats and the agent continues to adapt dialogue strategy based on the information state. [00102] Fig. 8 A illustrates an exemplary dialogue scene 800 where actual and virtual realities can be combined to achieve adaptive dialogue strategy, in accordance with an embodiment of the present teaching. In this illustration, it can be seen that the dialogue scene is a room with a number of objects, including a desk, a chair, a computer on the desk, walls, a window, and various hanging pictures and/or board on the wall. In addition to such fixtures, there is a robot agent 810 on the desk, a user 840, some sensors such as cameras 820-1 and 820-2, a speaker 830, and a drone 840. Some of the sensors may be fixed (such as the cameras 820 and the speaker 830) and some may be deployable on-the-fly based on needs. One example of such a need based deployable sensor is a drone such as 850. Such a deployable device may include both sensing capabilities and other functionalities such as the capability of producing sound to deliver, e.g., an utterance of the robot agent. Depending on the purpose of deploying such a deployable sensor, the location and orientation of the deployment may be determined based on such a purpose. For instance, as shown in Fig. 8A, the user 840 may enter the dialogue scene 800 with his head turning to the window so that neither of the robot agent 810 and the cameras 820-1 and 820-2 is capable of capturing the face of the user. When the robot agent desires to recognize the user based on the face, it may utilize resources under its control to get the face of the user by, e.g., deploying the drone 850, adjusting the aim of its camera to acquire the data needed. This is illustrated in Fig. 8A.

[00103] In the meantime, in some situations, recognizing a person based on face may also be based on the side view of the face. In this case, the robot agent may also adjust the parameters of certain camera in the dialogue scene to capture the side view of the person for recognition purpose. Cameras deployed in a dialogue scene, although may be mounted as fixture, their poses may be adjusted by applying different tilts, angles, etc. to capture visual information of different regions of the scene. The robot agent 810 may be configured to be able to adjust the parameters of different sensors to achieve that. As shown in Fig. 8A, to obtain a side view of user 840, the robot agent 810 may control the parameters of camera 820-2 to achieve that. In addition to visual information, sensors in other modalities may also be controlled by the robot agent to acquire information in need. For example, the audio sensor 830 in Fig. 8A may be controlled to gather acoustic information in the dialogue scene 800. In some embodiments, there may be multiple robots (not shown) in a dialogue scene with designated tasks. For instance, the robot agent 810 may serve as a primary robot and there may be other deployable second robots in the scene that can be deployed by the primary robot for dynamically designated tasks. For instance, the primary robot 810 may deploy a secondary robot to, e.g., walk towards a user and conduct a dialogue with the user to confirm, e.g., his identity or confirm certain information in question.

[00104] With deployable and/or adjustable multimodal sensors/robots, flexible and adaptive dialogue strategy may be achieved. In some embodiments, in addition to deployment of physical sensors/robots, virtual agents may also be generated and rendered in the dialogue scene. Fig. 8B illustrates exemplary aspects of operation for adaptive dialogue strategy, in accordance with an embodiment of the present teaching. Adaptive dialogue strategy may include spontaneous dialogue which may rely on, e.g., environmental modeling via need based flexible deployment of sensors/robots and collaborative multi-agent based dialogue. As will be discussed below, in spontaneous dialogue between a user and a robot agent, the robot agent may dynamically deploy a virtual agent when needed and so that the dialogue is to be conducted in an augmented reality dialogue scene where a virtual scene and the physical scene is combined. Each virtual agent may have a designated task (e.g., test a student on a math concept in a virtual scene) and may be configured to carry on a conversation with the user within the scope of the designated task. [00105] During a conversation between a virtual agent and the user, the virtual agent may be configured to conduct a conversation based on virtual content that is part of the virtual scene. For instance, the robot agent may generate a virtual agent and a number of virtual objects and render them in the dialogue scene as a virtual scene. The virtual agent may conduct the designated dialogue in accordance with the virtual objects rendered in the virtual scene. Such dynamic virtual content may be dynamically determined based on the main dialogue between the robot agent and the user. For example, if a robot agent is teaching a student user the math concept of adding, knowing that the student likes fruits, the robot agent may generate a virtual scene with a virtual agent who is picking fruits in an orchard and a number of apples and oranges thrown in the space. This virtual scene is generated to facilitate the conversation between the virtual agent and the student about adding the fruits picked from the orchard.

[00106] Each virtual agent may be configured with a sub-dialogue policy adaptively determined based on the progress of the dialogue. The sub-dialogue policy is generated to achieve the designated purpose of the virtual agent and used to control the activities of the virtual agent during the sub-dialogue. A virtual agent may also span off additional virtual agents, each with its own designated purpose and activities conducted to achieve the designated purpose. Each virtual agent, once the designated purpose is fulfilled, will cease to exist and the control may then be returned to the agent (either physical or virtual) that created it. In this manner, the physical robot agent and virtual agent(s) may collaborate to achieve an overall goal of the dialogue.

[00107] For environment modeling, Fig. 9A depicts an exemplary high level system diagram of a system 900 for adaptive environment modeling via robot/sensor collaboration, in accordance with an embodiment of the present teaching. The depicted system 900 may be implemented in a primary sensor such as the robot agent 810 and configured to enable the robot agent 810 to coordinate robot/sensor deployment/adjustment for environment modeling. The robot agent 810 may also be configured to be able to communicate with different deployable secondary robot/sensors (820, 830, and 850) in the dialogue scene and exercise controls thereof.

[00108] As shown in Fig. 9A, the system for environment modeling comprises a visual data analyzer 910, an object recognition unit 920, an audio data analyzer 930, a scene modeling unit 940, a data acquisition adjuster 950, a robot/sensor deployer 960, and a robot/sensor parameter adjuster 970. In operation, there may be a primary robot (such as the robot agent 810) that coordinate the deployment or adjustment of secondary robots/sensors based on what is observed and what is needed. The purpose of deploying secondary robot/sensors may be to acquire needed information in order to modeling the environment. For instance, if from the perspective of the robot agent 810, some object may be occluded by some object in the scene, the robot agent may deploy a sensor installed at a right location in the scene to capture the occluded region with a clear view for analysis. As discussed previously, if the robot agent 810 desires to recognize a user’s identify via face recognition but does not have the sensor data that captures the user’s face, the robot agent 810 may either adjust the pose of a camera at an appropriate location to acquire the needed data (e.g., adjust the parameters of camera 820-2 in Fig. 8A to capture a side view of the user 840) and/or deploy a drone (e.g., 850 in Fig. 8A) with a camera facing the user to acquire the needed facial information.

[00109] The decisions on which robot/sensor to deploy or adjust may be made based on observations made via existing robot/sensor deployed. Data from existing robot/sensor may be acquired and analyzed to understand the environment, which may be incomplete. Based on such an understanding of the surrounding, the robot agent may determine whether additional information is needed and if so, from where. Fig. 9B is a flowchart of an exemplary process of the system 900 for adaptive environment modeling via robot/sensor collaboration, in accordance with an embodiment of the present teaching. At 905, information from presently deployed sensors may be received first. Such received information may be in multiple modalities such as audio and visual data. For example, the visual data analyzer 910 may receive visual input from camera(s) deployed (from a camera either mounted in the environment or carried by a drone) and/or the audio data analyzer 930 may receive audio input from acoustic sensors deployed. For example, if the acoustic signal may relate to the noise related to door opening and the visual signal from a presently deployed camera (e.g., camera 820-1 in Fig. 8 A) may capture visual information associated with one corner of the dialogue scene.

[00110] The received multimodal information may then be analyzed, by the visual/audio data analyzers 910 and 920 at 912. Objects present in the scene and observable from the presently deployed camera may be detected, at 925, by the object recognition unit 920 based on, e.g., object detection models 915. Such information may then be used by the scene modeling unit 940, at 935, to perform scene modeling in accordance with, e.g., scene interpretation models 945. For instance, camera 820-1 may capture the scene in which a user 840 walks in a door in the scene and the audio sensor 830 may capture the sound of door opening. Analysis result of such multimodal data (e.g., detection of a door and opening as well as a person walking through the door) may be construed based on scene interpretation models 945. The robot agent 810 may acquire data from such deployed sensors through its connection with them, exercise control over any of the connected robots/sensors via deployment and/or adjusting parameters associated therewith.

[00111] Based on the present modeling of the dialogue environment based on the multimodal data, the robot agent may determine, at 937, whether additional information is needed in order to understand the environment. For example, if certain information is incomplete, e.g., one corner of the room is not observed or the face of the user 810 is not visible, the robot agent 810 may proceed to 952 to analyze what is known to identify what is missing and then determine, by the data acquisition adjuster 950 at 965, based on information stored in robot/sensor deployment configuration 955 (see Fig. 9A), to understand what is available for deployment and with what configuration parameters. With that knowledge, the data acquisition adjuster 950 may determine, at 967, whether it needs to deploy a secondary robot/sensor to acquire information from a certain space in the dialogue scene. For example, with reference to Fig. 8A, when the robot agent needs to see the face of the user walking into the dialogue scene and recognized that the user is facing the window where no camera can capture, the robot agent may then deploy a drone with a carry- on camera to a location where the drone can use its camera to capture the face of the user.

[00112] If the presently deployed robot/sensor can be adjusted to cover the desired space (answer no to the inquiry whether to deploy a robot/sensor at 967), the data acquisition adjuster 950 invokes the robot/sensor parameter adjuster 970 to compute, at 985, the needed adjustment to parameters of certain robot/sensor. If the data acquisition adjuster 950 determines, at 967, that additional robot/sensor is to be deployed, the data acquisition adjuster 950 invokes the robot/sensor deployer 960 to determine, at 975, the robot/sensor to be deployed to obtain the needed information. The determination is based on the information stored in the robot/sensor deployment configuration 955. The parameters to be used to deploy the new robot/sensor are then determined by the robot/sensor parameter adjuster 970 at 985 based on the information related to configuration of such robot/sensor. When the adjustment is made (either deployed a robot/sensor with certain parameter configuration or parameters of presently deployed robot/sensor have been adjusted), the system 900 loops back to 905 to receive multimodal information acquired from adjusted robot/sensors. The process repeats until the scene modeling is completed. In some embodiments, a primary robot may also un-deploy robot/sensors. For example, if the purpose of a robot/sensor has been fulfilled (e.g., provided needed data), the robot/sensor can be un-deployed.

[00113] As discussed herein, in addition to adjusting robot/sensor parameters and/or dynamically deploying robot/sensor based on needs, a primary robot may also be configured to generate a virtual environment including a virtual agent and/or virtual objects in the real dialogue scene to create an augmented reality dialogue scene. Fig. 10A describes an exemplary hierarchical dialogue agent collaboration scheme in a virtual scene of an augmented dialogue reality, in accordance with an embodiment of the present teaching. A virtual scene may include a virtual host agent, and possibly with some virtual objects that, together with the virtual host agent, may be rendered in accordance with a certain layout to form a virtual scene. The virtual host agent may be deployed with some objectives to carry out a designated dialogues task in accordance with a dialogue policy generated for the designated dialogue task to meet the objectives.

[00114] A virtual scene is rendered in the dialogue scene so that it creates an augmented dialogue reality. In this augmented dialogue reality, the initial dialogue manager may either hands off the dialogue management to the virtual agent in the virtual scene or may co-operate with the virtual agent with coordinated allocation of tasks. In some embodiments, a virtual agent may operate to carries out a designated dialogue task in a virtual scene with one or more virtual objects, while the dialogue manager that creates the virtual scene may operate in a separate scene (i.e., physical space) and time. In some situations, while a virtual agent is operating in the virtual scene, the dialogue manager in the original dialogue scene may be suspended. When the virtual agent fulfills its objectives, the dialogue manager may be resumed to continue the original dialogue. In some embodiment, a dialogue manager may create multiple virtual scenes with a virtual agent operating in each virtual scene.

[00115] A visual host agent may also further create additional secondary virtual scenes with secondary virtual agents, each of which may also have additional objectives with designated tasks and corresponding designated dialogue policies designed to achieve the additional objectives. In some embodiments, such a secondary virtual scene may replace the initial virtual scene with the host virtual agent so that the initial virtual scene ceases to exist. Accordingly, the initial augmented dialogue reality dialogue is changed to create a modified augmented dialogue reality. In some embodiments, the secondary virtual scene may co-exist with the initial virtual scene. In this case, the initial augmented dialogue reality is further augmented to have both the initial virtual agent and the secondary virtual agent present at the same time (but with coordinated operation as disclosed below).

[00116] Such a secondary virtual agent may have some designated objectives to be achieved via a new designated dialogue task with a new designated operational policy. Each virtual scene (with a virtual agent and/or virtual objects) may crease to exist once the designated objectives are fulfilled. In the case where a secondary virtual scene replaces the initial (or parent) virtual scene, the secondary virtual agent takes over the control in dialogue management while in operation. When it exits from the operation and ceases to exist, the initial (or parent) virtual scene (that created the secondary virtual scene) will then resume its operation to continue the dialogue. In the case where a second virtual scene co-exists with the initial virtual scene, each of the parent virtual agent and the secondary virtual agent may collaborate with separate yet complementary tasks. [00117] In some embodiments, the secondary virtual agent may cease to exist when it achieves its objectives without affecting the parent virtual agent. In some embodiments, the exit of either one of the virtual agents may cause the other to also cease to exist. In some embodiments, there may be a seniority or priority order, e.g., the initial dialogue manager that started the augmented dialogue reality may have the highest seniority/priority and each virtual scene created has a lower seniority/priority than the parent that created it. In some embodiments, the exit of one with a higher seniority/priority may cause all other virtual agents at a lower seniority/priority order cease to exist but not vice versa. For instance, if a virtual agent exits virtual scene, all secondary virtual scenes created by it may also cease to exist.

[00118] Fig. 10B shows exemplary composition of a virtual agent to be deployed in an augmented dialogue reality, in accordance with an embodiment of the present teaching. Each virtual agent may be represented by a character (which may be dynamically determine during the dialogue based on, e.g., user preference represented in the information state 610). Such an agent may operate with one or more objects appearing with the virtual agent in the virtual scene, which may form the basis of the dialogue. A virtual agent may also include a planner. The planner may comprise a subordinate dialogue manager for carrying out its designated dialogue task in accordance with a designated dialogue policy associated therewith, and a scheduler which together with the subordinate dialogue manager may control the virtual agent on the timing to enter or exit the dialogue scene. For instance, the exit condition may specify the conditions under which the virtual agent is considered to have fulfilled its objectives so that it can cease to exist.

[00119] Figs. 11 A - 1 IE provide examples of augmented dialogue reality in which an actual dialogue scene is combined with a virtual dialogue scene via collaboration among agents, some real and some virtual. Fig. 11 A shows an exemplary dialogue scene with a robot agent 810 and a user 1110, in accordance with an embodiment of the present teaching. The robot agent 810 here is capable of creating virtual scene(s) when needed in order to generating an augmented dialogue reality. Fig. 11B shows an augmented dialogue reality 1100 where a robot agent 810 interacting with a user 1110 in a dialogue scene creates a virtual scene with a virtual agent (avatar) 1120 aiming to carry out a designated dialogue tasks involving user 1110, in accordance with an embodiment of the present teaching. In this augmented dialogue reality 1100, the dialogue scene and the virtual scene is mixed to create the augmented dialogue reality 1100, in which, the virtual scene includes the virtual agent/avatar 1120, a set of virtual objects 1130, and a corresponding designated dialogue policy which may be used by the virtual agent to conduct the designated dialogue task to teach, e.g., a user to count (e.g., how many children in a certain color).

[00120] The virtual agent/avatar 1120 may be created with an objective of attracting the user to learn how to count in a fun way and it is invoked when, e.g., the dialogue manager observed that the user (a kid) is losing attention or interest. The virtual scene with the virtual agent 1120 may cease to exist when the user can count correctly according to some criterion (e.g., correct for a certain number of times). In this example, there are four colors involved in the objects and each time, the avatar 1120 may ask the user to count in accordance with one color. The exit condition may be specified associated with the avatar (e.g., 80% of time counting correctly) and once met, the virtual avatar 1120 may be considered as having fulfilled its objective so that it, together with the associated objects 1130, may cease to exist.

[00121] Fig. l lC shows an augmented dialogue reality 1150 with more than one virtual agents, each of which may have a separate designated objective, in accordance with an embodiment of the present teaching. As shown in Fig. 11C, a robot agent 810 creates an augmented dialogue reality 1150 with two virtual agents 1120 and 1140. Virtual avatar 1120 corresponds to a first virtual agent created for, e.g., carrying on a designated dialogue. Virtual agent 1140 which is a clown in this example is for providing companionship to the user, in accordance with an embodiment of the present teaching. In this example, the robot agent 810 may create both the avatar 1120 in some embodiments. The robot agent 810 may create the virtual companion 1140 upon an observation that the user 1110 is not happy or appears to be frustrated. Creating the virtual companion 1140 may be to cheer the user up. In this case, the avatar 1120 and the companion 1140 may simultaneously present in the augmented dialogue reality 1150 but may each have a different objective and a corresponding dialogue task and policy. For instance, the avatar 1120 may be responsible for presenting the virtual objects and asking questions, while the companion 1140 may be designed to speak to the user only when the avatar finished asking questions to, e.g., encourage the user to answer or give hint or do something else to help the user to answer the questions.

[00122] In some embodiments, the virtual companion 1140 may be thrown or generated by the virtual avatar 1120. In this case, the virtual companion 1140 is created by the avatar 1120, which may do so based on similar observation that the user seems to need some cheer-up while interacting with the avatar 1120. Fig. 1 ID shows the embodiment in which the virtual avatar 1120 generates the virtual companion 1140, in accordance with an embodiment of the present teaching. Fig. 11E shows another augmented dialogue reality 1170 where the robot agent 810 sequentially creating different virtual scenes, each being responsible for a designated task, in accordance with an embodiment of the present teaching. In this illustration, the robot agent 810 may create multiple augmented dialogue realities at different times. For example, subsequent to, e.g., the virtual avatar 1120 fulfilled its objective on counting and ceases to exist, the robot agent 810 creates another avatar 1180, whose objective may be to teach the user 1110 the concept of adding, which is one step further than counting as performed by the avatar 1120. The subsequent avatar 1180 may be animated to pick different fruits in an orchard and present the picked fruits to the user and ask the user to add different pieces of fruit to come up with a total. In the above examples, it show that the robot agent 810 may continually create different virtual agents at different times, each of them may be responsible to achieve certain objectives. In addition, these examples also show that a virtual agent may also create, based on needs detected from a dialogue it conducts in the augmented dialogue reality, one or more secondary virtual scenes with virtual agents to fulfill the objectives associated with such needs.

[00123] Fig. 12A depicts an exemplary high level system diagram of a virtual agent manager 1200 operating in connection with a dialogue manager 650 to manage a dialogue in an augmented dialogue reality, in accordance with an embodiment of the present teaching. In conventional dialogue systems, a dialogue is managed by a dialogue manager based on a dialogue tree that is pre-determined. According to the present teaching, the dialogue manager 650 also manages a dialogue with a user 680 based on a dialogue trees/policies 1212. Different from the conventional technologies, the dialogue manager 650 also relies on information from the information state 610 updated by an information state updater 1210 based on the dynamics of the dialogue with the user 680 and coordinates or collaborates with a virtual agent manager 1200 to adaptively create an augmented dialogue reality in order to enhance the effectiveness of the dialogue and improve user engagement.

[00124] As discussed herein with respect to Fig. 7A, the information state 610 includes different types of information characterizing the dynamics of the dialogue environment, history, events, user preferences, and the estimated minds of different parties involved in a dialogue. Such information is continually updated by the information state updater 1210 during dialogues and can be used to adaptively conduct the dialogue based on what may be working better for the user involved. With such rich information, the dialogue manager 650 may integrate information from different sources and at different levels of abstraction to make dialogue control decisions, including, e.g., whether and when to include a virtual agent in the dialogue scene, the objective(s) to be achieved by the virtual agent, etc. For instance, if the dialogue manager 650 may estimate that a young user is distracted during a dialogue and the user likes avatars so that deploying an avatar to continue the conversation with the user on a certain subject may help to focus the user, the dialogue manager 650 may request the virtual agent manager 1200 to generate a virtual scene with an avatar to carry out a conversation on the subject matter with the user.

[00125] Fig. 12B is a flowchart of an exemplary process of the dialogue manager 650 in collaboratively managing a dialogue with a virtual agent manager, in accordance with an embodiment of the present teaching. In interacting with a user 680 in a dialogue, each utterance from the user is processed to reach a spoken language understanding (SLU) result (not shown). When the dialogue manager 650 receives, at 1205, the SLU result obtained based on the user’s utterance, it accesses, at 1215, relevant information with respect to the user from the information state 610. Based on the SLU result and the information state, the dialogue manager 650 determines, at 1225, dialogue strategy which includes a determination whether a virtual scene is to be generated. If no virtual scene is to be generated, determined at 1235, the dialogue manager 650 accesses, at 1265, a relevant dialogue tree in the dialogue trees/policies 1212, determines, at 1275, a response to the user based on the dialogue tree, and then responds, at 1285, the user based on the determined response. If a virtual scene is to be generated, the dialogue manager 650 invokes, at 1245, the virtual agent manager 1200 to create a virtual scene. Such a created virtual scene is to be rendered or projected in the physical dialogue scene to form an augmented reality dialogue scene. As discussed herein, a virtual agent such as an avatar will appear in the virtual scene to perform certain activities (virtual) to interact with the user to carry on a sub-dialogue on a designated subject matter.

[00126] In some embodiments, once the virtual agent manager is invoked to handle the sub-dialogue, the dialogue manager 650 may wait until the sub-dialogue is ended when a release signal is received at 1255 indicating the return of control of the dialogue back to the dialogue manager 650. The process then proceeds to step 1215 to continue the dialogue based on an assessment of the current information state. In some embodiments, the dialogue manager 650 may start a new dialogue session determined based on, e.g., how the virtual agent wraps up the sub-dialogue. For example, if the dialogue manager 650 invokes the virtual agent manager 1200 to, e.g., teaching a user to add via a fun but virtual visual scene, how the dialogue manager 650 is to continue the dialogue when the virtual agent releases itself may depend on the status of the sub dialogue. If the sub-dialogue is successful, the exit condition of release the virtual agent based on success may be met and, in this case, the dialogue manger 650 may move on to the next topic in the dialogue. If the sub-dialogue is not successful, the virtual agent may exit when the intended goal is not met. In this case, the dialogue manager 650 may proceed with some different strategies. In either case, once the virtual agent exits and the virtual scene is released, the dialogue manager 650 may go back to 1215 to determine a strategy to continue the conversation based on, e.g., information in the current information state.

[00127] To collaborate with the virtual agent manager 1200, the dialogue manager 650 may provide different operational parameters to the virtual agent manager 1200 based on, e.g., the current state of the dialogue and the intended purpose of the dialogue. For example, the dialogue manger 650 may provide a pointer to a portion of the dialogue tree currently used, the objective to be achieved, the user’s identity, etc. In some embodiments, information modeling the current dialogue environment stored in the environment modeling database 980 (generated, e.g., by the environment modeling system 900 as disclosed in Figs. 9A-9B) may also be accessed by the virtual agent manager 1200 in order for it to generate a virtual scene in a manner consistent with the current dialogue environment.

[00128] With the information in connection with a request for generating a virtual scene from the dialogue manager 650, the virtual agent manager 1200 may access data in the information state 610 characterizing, e.g., the current dialogue and its intended goal(s), the user and preferences, the environment, the history, and the estimated states of the mind of the participants of the dialogue, etc. In this illustrated embodiment, the virtual agent manager 1200 comprises a virtual scene determiner 1220, a virtual agent determiner 1230, avirtual object selector 1240, a dynamic policy generator 1250, an augmented reality launcher 1260, and a policy enforcement controller 1270. Fig. 12C is a flowchart of an exemplary process of the virtual agent manager 1200, in accordance with an embodiment of the present teaching. Upon being invoked, the virtual scene determiner 1220 in the virtual agent manager 1200 accesses, at 1207, information from different sources for the purpose of determining what the virtual scene is to be. Such information may include data stored in the information state 610, the environment modeling information of the dialogue scene stored in database 980, etc.

[00129] As discussed herein, a virtual scene may include a virtual agent and/or some virtual objects to be thrown in the dialogue scene. For example, as shown in Figs. 1 IB - 1 IE, a virtual agent may correspond to an avatar designed to perform some acts, e.g., throwing objects in the space and asking the user to count or add. To create a virtual agent, the virtual scene determiner 1220 may invoke the virtual agent determiner 1230 to determine, at 1217, an appropriate virtual agent based on, e.g., the information state 610 such as user preferences (likes avatar and colorful objects), available models for virtual agents stored in a virtual agent database 1232. It may also invoke the virtual object selector 1240 to select, at 1227, objects to be rendered in the virtual scene based on available object models stored in a virtual object database 1222.

[00130] In some embodiments, a determination of a virtual agent may be made based on, e.g., user preferences stored in the information state 610. To control the virtual agent in the virtual scene, a policy may be dynamically generated, at 1237, by the dynamic policy generator 1250 based on, e.g., information passed from the dialogue manager 650 (e.g., a part of the dialogue tree) and various modeled sub-policies stored in a sub-policy database 1242. Such a sub-policy for a sub-dialogue may be related to the overall dialogue policy that governs the operation of the dialogue manager 650. In some embodiments, such generated sub-policy may also be used in selecting virtual objects to be used in the virtual scene. Based on the virtual agent determined (by the virtual agent determiner 1230) and the selected virtual objects (by the virtual object selector 1240), the virtual scene determiner 1220 generates, at 1247, a virtual scene based on the virtual agent and the selected virtual objects.

[00131] The dynamically generated sub-policy for the virtual agent may then be used, by the policy operation controller 1270 to govern the behavior of the virtual agent. To render the virtual scene in the dialogue scene to create an augmented reality dialogue scene, the virtual scene determiner 1220 sends the information characterizing the virtual scene to the augmented reality launcher 1260, which then launches, at 1257, the virtual scene with the virtual agent and virtual objects present therein. Once launched, an instance of the virtual scene is registered, at 1267, in a virtual scene instance registry 1252. Any instance registered in registry 1252 may correspond to a live virtual scene with associated virtual agent and objects. The registration may also include the sub-policy associated therewith. Such registered virtual scene and the information associated therewith may be used by the policy enforcement controller 1270 to control, at 1277, the virtual characters/objects related to the virtual scene in accordance with the sub-policy associated therewith. During the performance of the virtual agent based on the sub-policy, the policy enforcement controller 1270 controls the virtual agent to carry on a dialogue with the user 680. Such interaction may continue to be monitored by the information state updater 1210. In this case, although the dialogue manager may standby without taking actions when a virtual agent steps in to conduct the dialogue, the information state 610 may be continuously monitored and updated.

[00132] In operating the virtual scene and controlling performance of the virtual agent with respect to the virtual objects, the sub-policy associated therewith may specify one or more exit conditions, which may be checked, at 1280, to see if any of the exit conditions is satisfied. If an exit condition is satisfied (e.g., the user answers questions from the virtual agent correctly), the policy enforcement controller 1270 proceeds to release the virtual agent and the scene by removing the instance created for the virtual scene in the virtual scene instance registry 1252 and may then send a release signal to dialogue manager 650. At that point, the control of the dialogue may be transferred from the virtual agent manager 1200 back to the dialogue manager 650.

[00133] If none of the exit conditions is met, it may be further checked, at 1285, to see whether there is a need to create another virtual scene with a different virtual agent and associated objects and sub-policy. In some situation, this may be warranted. For instance, if a user is just not cooperating and seems to be bored, the current virtual agent may be programmed to activate a different virtual agent that may be better positioned to enhance the engagement. If this is to occur, the current virtual agent may generate, at 1287, a request with relevant information, create a new instance of the virtual agent manager 1200 at 1290, and then invokes, at 1295, the newly created new instance of the virtual agent manager 1200 which will lead to a process similar to the one depicted in Fig. 12C. Thus, virtual scene and management thereof may be recursive. If none of the exit conditions associated with the virtual scene is meet and there is no need to create another virtual scene, the virtual agent in the current virtual scene may be controlled to proceed to step 1207 to continue the process. In this way, the virtual agent may continue to carry out the designated sub-policy until either being released to return the control back to whoever created it upon any of the exit conditions is met or handing off the dialogue to another virtual agent to serve some other designated purpose.

[00134] As discussed herein, creating a virtual environment may involve creating some virtual character such as an avatar and virtual objects, which are then rendered in some identified space. The determination of a virtual scene to be generated (made by the virtual scene determiner 1220) may include the character of the virtual agent and the virtual objects to be used by the virtual agent. Once generated, information related to the virtual scene is sent to the augmented reality launcher 1260 for render the virtual agent and objects in the dialogue scene. In doing so, the augmented reality launcher 1260 may need to access various information to ensure that the rendering is done appropriately in compliant with different constraints. Fig. 13 A illustrates exemplary types of constraints that the augmented reality launcher 1260 may need to observe, in accordance with an embodiment of the present teaching. Constraints that may be used to control rendering of a virtual scene in a real scene may include physical constraints, visual constraints, and/or semantic constraints.

[00135] Physical constraints may include the requirements of rendering an object on a tabletop or render a virtual agent standing on the floor. Visual constraints may include limitations to be enforced in order for the visual scene to make sense. For example, the virtual character should be rendered in the field of view of the user and/or facing the user. One example shown in Fig. 13B illustrates the point, in which a virtual scene 1310 is rendered in a dialogue scene in such a way that is not within the field of view of a user 1320 so that the user 1320 cannot even see the rendered scene. Similarly, virtual objects to be presented to the user may be the basis of the conversation that the virtual character intends to have. For instance, objects may be rendered in a way that will serve the intended purpose, including, e.g., they need to be rendered also within the field of view of the user and depending on the purpose of presenting such virtual objects, they may need to be arranged in a manner that will serve the purpose. Another example is shown in Fig. 13C, where a virtual avatar 1120 and a number of objects (flying birds) 1340 and 1350 are rendered within the field of view of a user 1330.

[00136] Another exemplary type of constraint is semantic constraints, which are limitations to rendering given some known purpose of rendering the virtual scene. For instance, if the purpose of throwing virtual objects in the scene is for the user to learn how to count, the virtual objects should not occlude each other. If virtual objects are rendered to occlude each other, it makes it difficult for the user to count and for the robot agent to assess whether the user actually knows how to count. One example is shown in Fig. 13D, where a number of coins are rendered in a manner that they occlude each other in a virtual scene 1360, making it difficult for a person who sees the rendered coins to count. Instead, given the known purpose (semantic constraint) of rendering the coins is to enable a user to count, another way to render the coins in compliant with the semantic constraint is shown in Fig. 13E, where there is no occlusion, making it easy to count.

[00137] Therefore, in creating a virtual scene in an augmented reality dialogue scenario, rendering both a virtual character and objects may subject to different constraints and the constraints as applied to render a virtual character may differ from that applied to render virtual objects. Fig. 13F shows different constraints to be observed in rendering a virtual agent, in accordance with an embodiment of the present teaching. As discussed herein, a virtual agent needs to be rendered in general within the field of view of a user engaged in the dialogue, within a certain distance to the user (not too far, not too close), with a certain pose, i.e., at a certain location with a height/size and a certain orientation (e.g., face the user), and may not occlude virtual objects that the user needs to see.

[00138] Fig. 13G shows different constraints to be observed in rendering virtual objects, in accordance with an embodiment of the present teaching. Each virtual object may also need to be rendered within the field of view of a user engaged in the dialogue, within a certain distance to the user (not too far, not too close), with a certain pose, i.e., at a certain location with a height/size. Depending on the purpose of deploying virtual objects, they may also need to comply certain inter-object spatial relationship restrictions such as without any occlusion.

[00139] Certain dynamics during a dialogue may cause constraints change with time, e.g., when the user moves around or change pose. A virtual scene may need to be rendered continuously with respect to the changing constraints. For example, when field of view changes, a virtual scene may need to be re-rendered at a different spatial space. In some embodiments, additional virtual character may be introduced in the augmented reality scene, requiring the virtual character (such as the virtual companion 1140 as shown in Figs. 11C - 1 IE) and objects deployed earlier to be rendered differently to accommodate the additional agent.

[00140] Fig. 14 depicts an exemplary high level system diagram of the augmented reality launcher 1260 for rendering a virtual scene in an actual dialogue scene, in accordance with an embodiment of the present teaching. In this illustrated embodiment, the augmented reality launcher 1260 comprises a user pose determiner 1410, a constraint generator 1420, an agent pose determiner 1430, an object pose determiner 1440, a visual scene generator 1460, a text to speech (TTS) unit 1450, a, audio video (A/V) synchronizer 1470, an augmented reality Tenderer 1480, and a virtual scene registry unit 1490. Fig. 15 is a flowchart of an exemplary process of the augmented reality launcher 1260 for combining a virtual scene with an actual dialogue scene, in accordance with an embodiment of the present teaching.

[00141] In operation, the agent pose determiner 1430 and the object pose determiner 1440 receive, at 1500, information related to the virtual agent and virtual objects from the virtual scene determiner 1220 (see Fig. 12A). In addition, the user pose determiner 1410 accesses, at 1510, information from the information state 610 and determine, at 1515, e.g., user’s pose for the purposes of estimating a field of view to be applied to the virtual scene. Furthermore, to generate dynamic constraints to be used to render the virtual scene, the constraint generator 1420 receives, at 1520, information about the sub-policy for the virtual agent and then generate, at 1525, different types of constraints (physical, visual, and semantic) in consideration of the dialogue scene (from the information state 610), the intended purposes of the sub-dialogue, as well as the field of view (estimated based on the current pose of the user).

[00142] The constraint generator 1420 obtains information from different sources in order to generate physical constraints 1422, visual constraints 1442, and semantic constraints 1432. For instance, it may receive the estimated user pose information from 1410, information from the information state 610 about the physical dialogue scene, the current sub-dialogue policy associated with the virtual agent to be generated in order to determine, correspondingly, the constraints to be imposed on the physical location of the virtual agent/objects, the field of view in accordance with the user’s pose/appearance, and the limitations to be used in rendering objects based on the objectives of the virtual agent and the physical conditions in the space where the objects are to be rendered.

[00143] Such generated constraints are then used, at 1530, by the agent pose determiner 1430, the object pose determiner to figure out where to position and how to orient the virtual agent and the objects in light also of the user’s pose information (from the user pose determiner 1410). With the individual and relative locations/orientations of the virtual agent/objects determined, the visual scene generator 1460 generate, at 1535, the virtual scene with such virtual agent/objected therein in a manner that meets the physical/visual/semantic constraints to make sure that the virtual agent and objects are to be rendered within the field of view (estimated, e.g., by the user pose determiner 1410 based on estimated user’s pose), with an appropriate size and distance, and/or without occlusion. At the same time, as the virtual agent is to conduct a sub dialogue with a user in accordance with a designated sub-policy, i.e., carry out a conversation, the speech of the virtual agent (e.g., dictated by the sub-policy) is generated, at 1540, by the TTS unit 1450. Such speech may then be synchronized, at 1545 by the A/V synchronizer 1470, with the visual scene generated. The visual scene with synchronized visual/audio may then be sent, from the A/V synchronizer 1470, to the augmented reality Tenderer 1480, that is to render, at 1550, the virtual scene in the physical dialogue scene to form an augmented reality dialogue scene. As discussed herein, such a virtual scene is then registered, at 1555 by the virtual scene registry unit 1490, in the virtual scene instance registry 1252.

[00144] Fig. 16 is an illustrative diagram of an exemplary mobile device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. In this example, the user device on which the present teaching is implemented corresponds to a mobile device 1600, including, but is not limited to, a smart phone, a tablet, a music player, a handled gaming console, a global positioning system (GPS) receiver, and a wearable computing device (e.g., eyeglasses, wrist watch, etc.), or in any other form factor. Mobile device 1400 may include one or more central processing units (“CPUs”) 1640, one or more graphic processing units (“GPUs”) 1630, a display 1620, a memory 1660, a communication platform 1610, such as a wireless communication module, storage 1690, and one or more input/output (I/O) devices 1640. Any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the mobile device 1600. As shown in Fig. 16 a mobile operating system 1670 (e.g., iOS, Android, Windows Phone, etc.), and one or more applications 1680 may be loaded into memory 1660 from storage 1690 in order to be executed by the CPU 1640. The applications 1680 may include a browser or any other suitable mobile apps for managing a conversation system on mobile device 1600. User interactions may be achieved via the I/O devices 1640 and provided to the automated dialogue companion via network(s) 120.

[00145] To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming and general operation of such computer equipment and as a result the drawings should be self-explanatory. [00146] Fig. 17 is an illustrative diagram of an exemplary computing device architecture that may be used to realize a specialized system implementing the present teaching in accordance with various embodiments. Such a specialized system incorporating the present teaching has a functional block diagram illustration of a hardware platform, which includes user interface elements. The computer may be a general purpose computer or a special purpose computer. Both can be used to implement a specialized system for the present teaching. This computer 1700 may be used to implement any component of conversation or dialogue management system, as described herein. For example, conversation management system may be implemented on a computer such as computer 1700, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the conversation management system as described herein may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.

[00147] Computer 1700, for example, includes COM ports 1750 connected to and from a network connected thereto to facilitate data communications. Computer 1700 also includes a central processing unit (CPU) 1720, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 1710, program storage and data storage of different forms (e.g., disk 1770, read only memory (ROM) 1730, or random access memory (RAM) 1740), for various data files to be processed and/or communicated by computer 1700, as well as possibly program instructions to be executed by CPU 1720. Computer 1700 also includes an I/O component 1760, supporting input/output flows between the computer and other components therein such as user interface elements 1780. Computer 1700 may also receive programming and data via network communications. [00148] Hence, aspects of the methods of dialogue management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as“products” or“articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory“storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.

[00149] All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with conversation management. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible“storage” media, terms such as computer or machine“readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[00150] Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.

[00151] Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution— e.g., an installation on an existing server. In addition, the fraudulent network detection techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/ software combination.

[00152] While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.

Claims

WE CLAIM:

1. A method implemented on at least one machine including at least one processor, memory, and communication platform capable of connecting to a network for managing a user machine dialogue, the method comprising:

receiving information related to a user machine dialogue in a dialogue scene involving a user and managed by a dialogue manager in accordance with an initial dialogue strategy;

adapting the initial dialogue strategy, based on the information, to generate an updated dialogue strategy;

determining, based on the updated dialogue strategy, whether the user machine dialogue is to continue in an augmented dialogue reality having a virtual scene rendered in the dialogue scene; and

in response to the determination that the user machine dialogue is to continue in an augmented dialogue reality, activating a virtual agent manager to create the augmented dialogue reality and manage the user machine dialogue therein.

2. The method of claim 1, wherein the information related to the user includes at least one of:

a spoken language understanding result obtained based on an utterance of the user; and data from an information state, including at least one of:

a user profile,

a representation of a state of the user;

a dialogue history;

dialogue context representation, and a representation of a dialogue policy associated with the user machine dialogue.

3. The method of claim 1, wherein the information related to the dialogue scene includes data about a surrounding of the dialogue scene in one or more modalities.

4. The method of claim 1, wherein the virtual scene includes at least one of:

a virtual agent created, based on the information, for carrying out a designated dialogue task in the virtual scene in accordance with a designated dialogue policy; and

at least one virtual object to be rendered in the virtual scene, wherein

the designated dialogue task is determined based on the updated dialogue strategy, the designated dialogue policy is generated in accordance with the designated dialogue task, and

the virtual agent is managed by the virtual agent manager in carrying out the designated dialogue task based on the designated dialogue policy.

5. The method of claim 4, wherein the at least one object is

selected and/or rendered based on the information;

utilized by the virtual agent in carrying out the designated dialogue task with respect to the user in the augmented dialogue reality.

6. The method of claim 1, further comprising suspending the dialogue manager from managing the user machine dialogue while the virtual agent manager is invoked.

7 The method of claim 1, further comprising

receiving a release signal from the virtual agent manager indicating completion of the user machine dialogue in the virtual scene;

suspending the virtual agent manager upon receiving the release signal; and

resuming the dialogue manager to manage the user machine dialogue upon release of the virtual agent manager.

8. A system for managing a user machine dialogue comprising:

a dialogue manager configured for receiving information related to a user machine dialogue in a dialogue scene involving a user, wherein the user machine dialogue is managed in accordance with an initial dialogue strategy;

an information state updater configured for updating an information state based on the information to facilitate adaptation of the initial dialogue strategy stored in the information state to generate an updated dialogue strategy; and

the dialogue manager is further configured for

determining, based on the updated dialogue strategy, whether the user machine dialogue is to continue in an augmented dialogue reality having a virtual scene rendered in the dialogue scene, and

in response to the determination that the user machine dialogue is to continue in the augmented dialogue reality, activating a virtual agent manager to create the augmented dialogue reality and manage the user machine dialogue therein.

9. The system of claim 8, wherein the information related to the user includes at least one of:

a user profile,

a representation of a state of the user;

a dialogue history;

dialogue context representation, and

a representation of a dialogue policy associated with the user machine dialogue.

10. The system of claim 8, wherein the information related to the dialogue scene includes data about a surrounding of the dialogue scene in one or more modalities.

11. The system of claim 8, wherein the virtual agent manager comprises at least one of:

a virtual scene determiner configured for determining the virtual scene based on the information;

a virtual agent determiner configured for determining, based on the information and the virtual scene, a virtual agent that is to carry out a designated dialogue task in the virtual scene in accordance with a designated dialogue policy;

a virtual object selector configured for selecting at least one virtual object to be rendered in the virtual scene; an augmented reality launcher configured for creating the augmented dialogue reality by rendering the virtual scene in the dialogue scene;

a dynamic policy generator configured for determining the designated dialogue policy based on the designated dialogue task; and

a policy enforcement controller configured for managing the virtual agent in carrying out the designated dialogue task based on the designated dialogue policy, wherein

the designated dialogue task is determined based on the updated dialogue strategy.

12. The system of claim 11, wherein the at least one object is

selected and/or rendered based on the information;

utilized by the virtual agent in carrying out the designated dialogue task with respect to the user in the augmented reality dialogue scene.

13. The system of claim 8, wherein the dialogue manager is suspended from managing the user machine dialogue while the virtual agent manager is invoked.

14. The system of claim 8, wherein the dialogue manager is further configured for: receiving a release signal from the virtual agent manager indicating completion of the user machine dialogue in the virtual scene;

suspending the virtual agent manager upon receiving the release signal; and

resuming to manage the user machine dialogue upon release of the virtual agent manager.

15. A method implemented on at least one machine including at least one processor, memory, and communication platform capable of connecting to a network for managing a user machine dialogue, the method comprising:

receiving, from a dialogue manager, a request to carry out a designated dialogue task with respect to a user in a virtual scene to be embedded in a dialogue scene in which the user is presently engaged in the user machine dialogue;

selecting a virtual agent to be rendered in the virtual scene based on an information state characterizing the user and the dialogue scene;

creating the virtual scene in the dialogue scene with the virtual agent rendered therein to create an augmented dialogue reality;

carrying out, by the virtual agent in the augmented dialogue reality, the designated dialogue task based on a designated dialogue policy;

exiting the virtual scene when at least one condition associated with the designated dialogue task is satisfied.

16. The method of claim 15, wherein the information state characterizing the user includes at least one of:

a user profile;

a representation of a state of the user;

a dialogue history;

dialogue context representation; and

17. The method of claim 15, wherein the information state characterizing the dialogue scene includes data representing a surrounding of the dialogue scene in one or more modalities.

18. The method of claim 15, wherein the virtual scene further includes at least one virtual object that

is determined based on the designated dialogue task;

is rendered based on the information state; and

is to be utilized by the virtual agent to carry out the designated dialogue task.

19. The method of claim 15, wherein the step of carrying out comprises:

conducting a designated dialogue with the user in the augmented dialogue reality consistent with the designated dialogue task and based on the designated dialogue policy;

determining whether a secondary virtual scene is to be created based on updated information state generated dynamically based on the designated dialogue;

in response to a determination that the secondary virtual scene is to be created, invoking a virtual agent manager to create the secondary virtual scene.

20. The method of claim 19, wherein

the secondary virtual scene includes at least one of a secondary virtual agent and a secondary virtual object; and

the secondary virtual agent is to carry out a new designated dialogue task.

21. The method of claim 20, wherein the secondary virtual scene replaces the virtual scene in the augmented dialogue reality to create an updated augmented dialogue reality.

22. The method of claim 20, wherein the secondary virtual scene co-exists with the virtual scene to create a further augmented dialogue reality.

23. The method of claim 15, further comprising

suspending the dialogue manager in managing the user machine dialogue when the virtual agent starts the designated dialogue task; and

resuming the dialogue manager in managing the user machine dialogue when the virtual agent completes the designated dialogue task.

24. A system for managing a user machine dialogue comprising:

a virtual scene determiner configured for receiving, from a dialogue manager, a request to carry out a designated dialogue task with respect to a user in a virtual scene to be embedded in a dialogue scene in which the user is presently engaged in the user machine dialogue;

a virtual agent determiner configured for selecting a virtual agent to be rendered in the virtual scene based on an information state characterizing the user and the dialogue scene;

an augmented reality launcher configured for creating an augmented dialogue reality having the virtual scene embedded in the dialogue scene with the virtual agent rendered therein; a policy enforcement controller configured for

carrying out, via the virtual agent in the augmented dialogue reality, the designated dialogue task based on a designated dialogue policy, and exiting the virtual scene of the augmented dialogue reality when at least one condition associated with the designated dialogue task is satisfied.

25. The system of claim 24, wherein the virtual scene further includes at least one virtual object that

is determined based on the designated dialogue task;

is rendered based on the information state; and

26. The system of claim 24, wherein the virtual agent determiner is further configured for:

27. The system of claim 26, wherein

the secondary virtual agent is designated to carry out a new designated dialogue task.

28. The system of claim 26, wherein the secondary virtual scene replaces the virtual scene in the augmented dialogue reality to create an updated augmented dialogue reality.

29. The system of claim 26, wherein the secondary virtual scene co-exists with the virtual scene to create a further augmented dialogue reality.

30. The system of claim 24, further comprising