WO2006003542A1

WO2006003542A1 - Interactive dialogue system

Info

Publication number: WO2006003542A1
Application number: PCT/IB2005/052025
Authority: WO
Inventors: Thomas Portele; Barbertje Streefkerk; Jürgen TE VRUGT
Original assignee: Philips Intellectual Property & Standards Gmbh; Koninklijke Philips Electronics N.V.
Priority date: 2004-06-29
Filing date: 2005-06-21
Publication date: 2006-01-12

Abstract

The present invention provides a method of controlling of an interactive dialogue system, an interactive dialogue system and a corresponding computer program product. In particular, the invention makes use of modeling of dialogues that is based on the conversational function of successive dialogue turns. Semantic content and conversational function of user input are strictly separated and a communication act is identified for a user's input. The interactive dialogue system generates response to the user's input on the basis of a predicted successive communication act in order to appropriately match to the conversational function of a user's request and to mimic human like dialogue behavior.

Description

Interactive dialogue system

The present invention relates to the field of interactive dialogue systems and in particular without limitation to dialogue management and dialogue control.

Interactive dialogue systems are usually adapted to process a user input, e.g. recorded speech of a user, and to provide an audible output in response. Interactive dialogue systems are widely implemented in telephone based travel - or ticket booking or timetable queries, such as train timetables. When for example a user queries an interactive dialogue system by entering corresponding speech command, the interactive dialogue system has to extract the semantic content of the entered speech and to provide the requested information. In particular, when a query of a user is insufficient in such a way that additional information has to be provided in order to be able to successfully process the request, the interactive dialogue system has to autonomously generate appropriate questions in order to obtain missing information from the user.

However, a purely semantic based operation of an interactive dialogue system might not be able to mimic human like communication or dialogue behavior. This is due to the fact that in human communication a distinct semantic content may be expressed in a plurality of different ways. In a travel booking scenario for example, a client wants to travel to Hamburg. Here, a travel agent or an interactive travel booking system has to extract the information that the travel target is Hamburg. The client now can express his request in a plurality of different ways. In a first case the client might speak out: "I want to travel to Hamburg."

This utterance can be classified as a statement in which a client expresses his request. In a second case the client might say: "Can I travel to Hamburg?" In this second case the client expresses his request as a question. In both cases the travel agent or the booking system may sufficiently extract necessary information of the requested travel target. Nevertheless, the information provided by the client is insufficient to perform a complete travel booking. Additional information, such as travel date or travel time as well as the traveling means have to be provided to the travel agent or to the travel booking system. Therefore, the client has to be motivated to provide the missing information in an ordered way by e.g. providing decided questions as "When do you want to travel?", "From where do you want to travel?", to the client. In human like communication or human like dialogues, these additional questions are provided to the client adjacently after an appropriate reaction of the client's initial request. In particular, this reaction has to refer to the concrete expression of the client's initial request and therefore depends on the conversational function of the client's initial request. Referring to the above mentioned examples in the first case it would be natural that the statement of the client is somehow commented before an additional question to the client is raised. For example, the travel booking system or the travel agent might say: "Alright, when do you want to travel?". In this example a comment on the client's request is skillfully combined with a successive question.

Referring to the above mentioned second case, the travel booking system or the travel agent first have to answer the question of the client before an additional question can be raised. A possible answer to the client's question might be: "Yes you can. When do you want to travel?".

Therefore, in order to mimic human like and polite dialogue behavior by means of an interactive dialogue system not only the semantic content of a user input but also its conversational function has to be properly analyzed in order to generate appropriate dialogue system output.

The conversational function of expressions and utterances of users and clients can be classified by various communication acts. In the prior art these communication acts are also referred to: dialogue act, dialogue move, illocutionary act, response and initiative.

The US patent application 2003/0130841 Al discloses a system and method for improving automatic speech recognition in a spoken dialogue system. This method comprises partitioning speech recognizer output into self contained clauses, identifying a dialogue act in each of the self contained clauses, qualifying dialogue acts by identifying a current domain object and/or a current domain action, and determining whether further qualification is possible for the current domain object and/or current domain action. If further qualification is possible, then the method comprises identifying another domain action and/or another domain object associated with the current domain object and/or current domain action, reassigning the another domain action and/or another domain object as the current domain action and/or current domain object and then recursively qualifying the new current domain action and/or current object. This process continues until nothing is left to qualify.

In this disclosure dialogue acts are identified in order to improve speech understanding in a spoken dialogue system. However dialogue act identification is not used any further.

The present invention therefore aims to provide an improved interactive dialogue system and an improved method of controlling of an interactive dialogue system for mimicking human like dialogue reaction and dialog behavior.

The present invention provides a method of controlling of an interactive dialogue system. The method of controlling comprises the steps of receiving of a recognizer output that represents an utterance of a user, identifying a first communication act for the received recognizer output, determining a second communication act on the basis of the first communication act and finally generating a response to the user's utterance on the basis of the second communication act.

The inventive method focuses on identifying of the conversational function of an utterance of a user. Having classified an utterance of a user allows to determine and to predict the conversational function, hence the communication act, that matches the user's utterance. The response of the interactive dialogue system is then generated on the basis of this predicted communication act. Referring to the above mentioned example, the second case reflects a question of the yes/no type "Can I travel to Hamburg?". The inventive method is in fact able to classify this question as a yes/no question that requires an appropriate answer of a yes/no type. Hence, in this example, the second communication act determined by the inventive method either has to refer to a communication act of accept or reject type. A corresponding response to the user's utterance may then be as follows: "Yes you can..." or "No, you can't...". In general, communication acts can be classified in numerous different types of communication acts. A rather coarse classification of various communication acts can be realized by means of a distinction of communication acts as forward looking or backward looking communication acts. A forward looking communication act elicits an act, whereas the backward looking communication act represents a reaction. Examples of forward looking communication acts are questions and statements, whereas backward looking communication acts may be represented by answers.

In human like communication and human like dialogues it is typically observed that a question is followed by an answer and that a statement is followed by an acceptance or rejection. This allows to describe dialogues as adjacent pairs of communication acts.

The present invention effectively exploits knowledge of adjacent pairs of communication acts as they appear in human like communication and dialogues. In this way, the second communication act can be predicted and determined on the basis of the identified first communication act. Consequently, the response of the interactive dialogue system can be generated on this predicted communication act in order to imitate human like dialogue behavior and to render the interactive dialogue system to react and act more polite and coherent.

The classification and distribution of communication acts is by no means restricted to forward looking and backward looking communication acts. In general, the inventive method can be adapted to a large spectrum of communication acts that may incorporate additional classes of communication acts, such as conversational communication acts or that provides a finer division and classification of the above mentioned top level communication acts. For example, forward looking and initiative communication acts may further be split into requests and statements. Furthermore, requests can be split into various types of request, such as imperative requests that initiate a specific action and WH - questions, that incorporate all types of questions making use of question words, such as who, what, where, when, why, how, ... Another type of communication act may refer to a narrative or conversational communication act. For example, when a communication partner narrates a story, in principle, no interaction of a listener has to be involved. However, in human communication a listener typically expresses his attention during the narration of the story teller by some kind of confirmatory statement such as :" Oh yeah, Mmh, right, O.K.,... ". Even the narrator does not require a content related reaction of the listener, he frequently expects some kind of utterance from the later one as indication of his mental presence. Especially, for telephone based dialogues, it might be very distracting for a narrator, when the communication partner does not frequently confirm the receipt of the narrators speech. By properly recognizing a conversational communication act, the inventive interactive dialogue system may frequently provide such confirmation statements to a user, thereby confirming recognition of the provided speech input. According to a further preferred embodiment of the invention, the determination of the second communication act further comprises application of a dialogue model. Typically, the dialogue model is indicative of adjacent pairs of communication acts or entire sequences of communication acts as they can be observed in human like communication. Typically, by identifying the first communication act for the recognizer output, an appropriate dialogue model is selected. The selected dialogue model may then provide a probability that the first communication act is followed by a distinct second communication act. Hence, the dialogue model represents an effective means for determining and predicting the second communication act that adjacently has to follow the first communication act that has been identified for the recognizer output. According to a further preferred embodiment of the invention, the inventive method further comprises extracting of a semantic content of the recognizer output for determining of the second communication act. In this way the second communication act is not only determined on the basis of the first communication act but also on the basis of the received recognizer output. Consequently, the conversational function of a user utterance and the semantic content of the user utterance can be effectively combined in order to predict and to determine the second communication act for generating an appropriate and natural response to the user's utterance.

The impact of the first communication act and the semantic content of the recognized user utterance on the determination of the second communication act might be arbitrarily specified by the user or by the interactive dialogue system itself. For example, both recognized semantic content of the user utterance as well as the first communication act may be provided with separate confidence values indicating the reliability of the recognized semantic content and the identified first communication act, respectively.

According to a further preferred embodiment of the invention, the dialogue model is a content specific dialogue model. Such a content specific dialogue model for determining the second communication act is selected in response to the extraction of semantic content of the recognizer output. This feature is particularly applicable when the interactive dialogue system is universally applied and deployed within different dialogue domains, such as travel booking, ticket booking, timetable querying, etc.... In this way a plurality of different dialogue models, each of which being specific of a particular application domain, can be universally implemented.

This accounts for the fact that different types of dialogues referring to different domains are typically characterized by different domain specific sequences of communication acts. Hence a content specific dialogue model therefore accounts for a domain specific sequence of communication acts. Consequently, semantic content information of the user utterance can be effectively exploited in order to select the appropriate dialogue model. Moreover, by making use of the selected content specific dialogue model in combination with the extracted semantic content of the user utterance, the subsequent second communication act can be predicted and determined with a higher precision.

Additionally, the entire interactive dialogue system becomes universally applicable to a plurality of domain specific dialogue scenarios. Generally, in this way the inventive interactive dialogue system can become domain-, task- and language independent, provided that a respective dialogue model is always applicable. According to a further preferred embodiment of the invention, the dialogue model is representative of a sequence of communication acts. Additionally, the dialogue model is based on communication rules and/or a predefined training corpus and/or a pertinent training corpus. Hence, a dialogue model, also a content specific dialogue model, can be generated on the basis of a distinctive communication rule that might be implemented as a corresponding algorithm. A rather simple algorithm may specify that a particular communication act, such as a yes/no question always has to be followed by a communication act of the type yes/no answer. The decision whether an answer is yes or no depends on the semantic recognition of the user utterance and the specific question. The inventive method specifies that in response to identifying a yes/no question a corresponding yes/no answer is generated by the interactive dialogue system. As this is only an illustrative example, in general more sophisticated communication rules can be derived on the basis of a training corpus or a pertinent training corpus. Alternatively, a provided training corpus might be directly indicative of various communication act sequences of the dialogue model.

According to a further preferred embodiment of the invention, the dialogue model can be further trained on the basis of the first communication act. This means that during execution of the inventive method by means of the interactive dialogue system, the training of a dialogue model or a content specific dialogue model might continue. In this way, a training corpus or a pertinent training corpus can be further improved by analyzing the identified first communication act with previous identified communication acts of previous recognizer output.

According to a further preferred embodiment of the invention, utterances of the user might be verbal and/or non verbal. In this way the inventive method is by no means restricted to purely speech based dialogue systems. Moreover, by making use of image acquisition and image processing means, also visual utterances of the user, like e.g. nodding, head shaking, pointing, signaling or other types of human like non verbal communication might be detected and appropriately identified to a distinct type of communication act. In this way, the inventive method and interactive dialogue system is by no means restricted to implementations into speech based dialogue systems. For example, it might be universally implemented in various human - machine communication applications, such as e.g. robots.

According to another aspect, the invention provides and interactive dialogue system that comprises means for receiving of a recognizer output that represents an utterance of a user, means for identifying a first communication act for the recognizer output, means for determining a second communication act on the basis of the first communication act and means for generating a response to the user's utterance on the basis of the second communication act. In still another aspect, the invention provides a computer program product for controlling of an interactive dialogue system. The inventive computer program product comprises computer program means that are adapted to receive a recognizer output that represent an utterance of a user, to identify a first communication act for the recognizer output, to determine a second communication act on the basis of the first communication act and to generate a response to the user's utterance on the basis of the second communication act.

Further, it is to be noted that any reference signs in the claims are not to be construed as limiting the scope of the present invention. In the following preferred embodiments of the invention will be described in greater detail by making reference to the drawings in which: Figure 1 is illustrative of a flowchart of the inventive method,

Figure 2 shows a block diagram of an inventive interaction dialogue system,

Figure 3 illustrates of flowchart for predicting and determining a communication act.

Figure 1 is illustrative of a flowchart of processing output generated by an automatic speech recognizer in order to predict a successive communication act and to generate an appropriate response that has to be provided to the user of an interactive dialogue system. In principle, the inventive method of predicting and determining a successive communication act can be universally implemented into a variety of existing dialogue systems.

In a first step 100 output from an automatic speech recognition (ASR) system is received. Hence speech inputted by a user into the interactive dialogue system is recognized and transformed into electrical signals that can be further processed. After recognition of speech of a user in step 100, the method continues with steps 102 and 104. Typically, these successive steps 102, 104 are performed in parallel, i.e. simultaneously or alternatively in a sequential order.

In step 102, based on the output of the ASR system, a communication act of the recognizer output is identified. Hence, the user input is classified according to its conversational function. Irrespectively of the semantic content, the recognized speech input is for example classified as a forward- or backward looking communication act. This means, that determination of a communication act type of a user input is entirely separated from a determination of a semantic content of inputted speech.

In step 104 a semantic content of the recognizer output, i.e. the recognized speech, is extracted. Extraction of a semantic content refers to an identification of the user's intention. Irrespectively whether the user expresses his request as a statement or a question like "I want to travel to Hamburg", or "Can I travel to Hamburg?", the semantic content that has to be extracted is basically the same. In both cases the following information can be extracted: the user's general request refers to traveling and the travel target is Hamburg. After identification of a communication act in step 102 and after extraction of the semantic content of a recognized user input in step 104, the method continues with step 106, where an appropriate dialogue model is selected. Here, a dialogue model refers to a dialogue specific sequence of communication acts. For example, a dialogue model may provide the information, that a sequence of communication acts of a dialogue consists of pairs of yes/no — questions and appropriate yes/no - answers. A dialogue model can be generated on the basis of dialogue specific rules or algorithms, on the basis of a training corpus or by means of a pertinent training corpus.

Moreover, a dialogue model may account for the semantic content of a recognized user input. In this case the dialogue model is a content specific dialogue model that is applicable to an appropriate dialogue and dialogue situations. For example, a dialogue model may be specific for a travel booking dialogue or for a timetable querying dialogue. Hence, a specific dialogue model might be selected of a plurality of available dialogue models based on an extracted semantic content of a recognized user input.

Also, a dialogue model might be selected in response to an identified communication act of the output of the ASR system. In this way, a dialogue model may not only be selected on the basis of a recognized semantic content of a dialogue, but also on the basis of previously identified communication acts, or an entire sequence of previous communication acts.

After an appropriate dialogue model has been selected in step 106, the selected dialogue model is applied in step 108 in order to predict a successive # communication act. Application of a dialogue model, that is performed in step 108, refers to comparison of the identified communication act with the selected dialogue model. Additionally, application of the dialogue model may not only refer to a comparison with the last identified communication act but also to an entire sequence of previous identified communication acts.

Based on the applied dialogue model and the comparison with an identified communication act, a communication act that follows the identified communication act can be predicted. Prediction of a communication act may refer to a generation of probabilities that an identified communication act is followed by a distinct successive communication act. Such a probability may further be based on the selected dialogue model and corresponding confidence values. When for example an interactive dialogue system has identified a user input as a yes/no — question, application of an appropriate dialogue model shall provide a communication act of yes/no - answer type. Hence, after the dialogue model has been applied in step 108 in the successive step 110 a corresponding successive communication act is determined.

Thereafter, on the basis of the determined successive communication act and on the basis of the extracted semantic content of the user input, a response to the user input is generated in the last step 112. On the one hand the response has to satisfy the semantic content of the user's request and on the other hand the generated response has to match the conversational function of the request of the user. Referring to the above mentioned example, where a user expresses his request of traveling to Hamburg, the system may provide a positive answer and depending on the identified communication act of the user request, the interactive dialogue system either provides an answer of: "Yes, you can travel to Hamburg.", or "Okay, ..." and then ask for further information.

In this way the interactive dialogue system becomes much more polite and user friendly and approaches human like dialogue behavior.

Figure 2 illustrates a schematic block diagram of the inventive interactive dialogue system 200. A user 202 may interactively communicate with the interactive dialogue system 200. The interactive dialogue system 200 has an output generator 202, an automatic speech recognition (ASR) module 206, a semantic content recognizer 208, a communication act recognizer module 210, a communication act predictor module 212, a dialogue model module 214, a dialogue controller 216 as well as a communication act history module 218.

The user 202 typically provides speech to the interactive dialogue system 200 that is inputted into and processed by the ASR module 206. Hence, an audible speech signal is transformed into electrical signals that can further be processed in order to generate an appropriate response. After recognition of the user's input, the ASR module 206 provides input to both the semantic content recognizer 208 and to the communication act recognizer module 210. The semantic content recognizer 208 serves to extract semantic content of the output of the ASR module 206 and the communication act recognizer module serves to identify a communication act for the recognizer output. Hence, the semantic content recognizer module 208 performs step 104 of figure 1 and the communication act recognizer module 210 provides the functionality of step 102 of figure 1.

Once a communication act of a user's input has been identified, the identified communication act is provided to the communication act history module 208 that basically stores the identified communication act. Additionally, the identified communication act is also provided to the communication act predictor module 212.

The communication act history module 218 provides an effective means to store a sequence of identified communication acts of a dialogue. The communication act history module 218 is further adapted to communicate with the dialogue model module 214 in order to provide training data to the dialogue model module 214 and to compare a sequence of identified communication acts with sequences of communication acts provided by a distinct dialogue model that is stored by means of the dialogue model module 214. The output of the semantic content recognizer module 208 is also provided to the communication act predictor module 212. The communication act predictor module 212 provides one of the key functionalities of the inventive interactive dialogue system 200. The communication act predictor module 212 selects a dialogue model by querying the dialogue model module 214, it serves to apply a selected dialogue model in order to predict a successive communication act of the last user input. Prediction of a successive communication act may be based on the semantic content of the last communication act provided by the semantic content recognizer module 208, on the last communication act provided by the communication act recognizer module 210, on a sequence of the last communication acts provided by the communication act history module 218 and on the basis of a dialogue model provided by the dialogue model module 214. Once a successive communication act has been predicted by the communication act predictor 212, the predicted or determined communication act is provided to the dialogue controller 216. The dialogue controller 216 also receives input from the semantic content recognizer 208. The dialogue controller module 216 generates a response to the user's input on the basis of the semantic content of the user input provided by the semantic content recognizer 208 and on the basis of the predicted successive communication act. In this way, the interactive dialogue system 200 generates a response to the user with respect to the semantic content and the conversational function of the user input. Hence, the interactive dialogue system 200 is enabled to generate a polite and human like response to the user. When the dialogue controller module 216 has generated an appropriate dialogue turn, the generated signals are provided to the output generator 204. The output generator 204 serves to transform the electrical signals received from the dialogue controller 216 into e.g. audible speech that is provided to the user 202. For example, the output generator 204 may comprise a speech synthesizing module and a speaker in order to provide a speech to the user 202. Alternatively, the output of the interactive dialogue system may be provided in a visual form to the user 202. In this case the output generator 204 may comprise some kind of display means. Moreover, when the inventive dialogue system is implemented into a autonomous movable system, such as a robot like device, the output generator 204 may be further adapted to generate human-like non verbal communication gestures, such as nodding, head shaking, pointing, eyebrow movement, etc.

The embodiment illustrated in figure 2 is based on speech input of the user 202. Generally, the inventive method and interactive dialogue system is not limited to speech recognition. Moreover, the interactive dialogue system 200 may comprise a plurality of input and output modules that even provide non verbal communication. For example, the interactive dialogue system 200 may comprise image acquisition means and corresponding image analysis means for appropriately detecting communicative gestures of the user 202, like nodding, pointing, head shaking, ...

Figure 3 is illustrative of a flowchart for predicting and determining a successive communication act. In particular, the flowchart illustrated in figure 3 is representative of the functionality of the communication act predictor module 212 of figure 2. In two initial steps 300 and 302, a communication act of a current speech recognizer output and a semantic content of a current recognizer output is received, respectively. Therefore, step 300 refers to aspects of communication acts, whereas step 302 refers to the semantic content aspect of the interactive dialogue system. After in step 300 an identified communication act is received from the communication act recognizer module 210, in a successive step 304 it is checked whether a dialogue model is already selected. Depending on whether a distinct dialogue model is applicable, either step 306 or step 308 are successively performed. In case that no dialogue model has been selected yet in step 304, step 306 is executed, where a specific dialogue model is selected. Selection of a dialogue model is preferably based on the identified communication act and extracted semantic content of the current recognizer output, i.e. user input.

In the opposite case, when a dialogue model has already been selected by the communication act predictor 212, the method directly continues with step 308 after execution of step 304. Alternatively, when after step 304 a dialogue model has to be selected by means of step 306, the method also continues with step 308. In step 308 the identified current communication act is compared with the selected dialogue model, with a previous sequence of communication acts and with the content of the current and the previous recognized user inputs. This comparison may be based on a plurality of various comparison steps or by means of a combined comparison of all respective parameters. Therefore, step 308 directly follows step 304 or alternatively step 306, step 302 and step 310, where a sequence of previous communication acts is received from the communication act history module 218.

Finally, as a result of the comparison performed in step 308 in the last step 312, a successive communication act of the current communication act is determined. This predicted and determined successive communication act then serves as a basis for generating the next dialogue turn or dialogue move of the interactive dialogue system 200.

The invention therefore provides an approach to control an interactive dialogue system with respect to both semantic content of user input and to the conversational function of user input. In this way, the interactive dialogue system not only accounts for the recognized information content provided by an automatic speech recognition module but also to the syntactic expression of a user's request. Consequently, the interactive dialogue system behaves in a more polite, user friendly, natural and human like way.

LIST OF REFERENCE NUMERALS

200 interactive dialogue system

202 user

204 output generator

206 automatic speech recognition system

208 content recognizer

210 communication act recognizer

212 communication act predictor

214 dialogue model module

216 dialogue controller

Claims

CLAIMS:

1. A method of controlling of an interactive dialogue system, the method comprising the steps of: receiving of a recognizer output representing an utterance of a user, identifying a first communication act for the recognizer output, - determining a second communication act on the basis of the first communication act, generating a response to the user's utterance on the basis of the second communication act.

2. The method according to claim 1, wherein determining of the second communication act further comprises applying of a dialogue model.

3. The method according to claim 2, further comprising extracting of a semantic content of the recognizer output for determining of the second communication act.

4. The method according to claim 3, wherein the dialogue model is a content specific dialogue model, the content specific dialogue model for determining the second communication act being selected in response to the extraction of the semantic content of the recognizer output.

5. The method according to claim 2, wherein the dialogue model is representative of a sequence of communication acts and wherein the dialogue model is based on communication rules and/or a predefined training corpus and/or a pertinent training corpus.

6. The method according to claim 2, further comprising training of the dialogue model on the basis of the first communication act.

7. The method according to claim 1, wherein the utterance of the user is verbal and/or non-verbal.

8. An interactive dialogue system, comprising: means for receiving of a recognizer output representing an utterance of a user, means for identifying a first communication act for the recognizer output, - means for determining a second communication act on the basis of the first communication act, means for generating a response to the user's utterance on the basis of the second communication act.

9. The interactive dialogue system according to claim 8, further comprising: means for extracting of a semantic content of the recognizer output, means for applying a dialogue model for determining of the second communication act on the basis of the first communication act and the semantic content of the recognizer output.

10. A computer program product for controlling of an interactive dialogue system, the computer program product comprising computer program means being adapted to: - receive a recognizer output representing an utterance of a user, identify a first communication act for the recognizer output, determine a second communication act on the basis of the first communication act, generate a response to the user's utterance on the basis of the second communication act.