WO2006016307A1

WO2006016307A1 - Ontology-based dialogue system with application plug-and-play and information sharing

Info

Publication number: WO2006016307A1
Application number: PCT/IB2005/052521
Authority: WO
Inventors: Thomas Portele; Jürgen TE VRUGT; Barbertje Streefkerk
Original assignee: Philips Intellectual Property & Standards Gmbh; Koninklijke Philips Electronics N. V.
Priority date: 2004-08-06
Filing date: 2005-07-27
Publication date: 2006-02-16

Abstract

The present invention relates to a dialogue system enabling a user to control multiple applications based on mono-modal or multi-modal input from a user, comprising speech input, wherein said system is an intermediate layer between said applications and said user, said system comprising: - means for receiving and storing application-specific knowledge sources for each of said applications, wherein said knowledge sources for each of said applications are represented in a common application-independent way, - means for receiving said input from said user, - means for processing in an application-independent way said input comprising using said stored application-specific knowledge sources for determining one or more of said applications for which said received input is intended, - means for forwarding said processed input to said determined applications.

Description

ONTOLOGY-BASED DIALOGUE SYSTEM WITH APPLICATION PLUG-AND- PLAY AND INFORMATION SHARING

The present invention relates to a dialogue system enabling a user to control multiple applications based on mono-modal or multi-modal input from a user, comprising speech input. The invention further relates to a method of controlling multiple applications based on mono-modal or multi-modal input from a user, comprising speech input. The invention further relates to a computer readable medium having stored therein instructions for causing a processing unit to execute a method according to the invention. The invention further relates to an application which is adapted to be connected to a dialogue system according to the invention.

Earlier spoken dialogue systems usually performed one task, such dialogue systems are described in e.g. US 5754736 and WO 9741521. Nowadays, spoken dialogue systems can serve as user interfaces to an intelligent environment like a connected home with multiple services and devices; these dialogue systems are based on application specific input interpretation.

Further, a problem with the prior art is that the applications are not able to (automatically) share data easily — making it cumbersome for the user having to re- enter data in certain cases. An example could be an electronic programming guide and a hard disk recorder. The electronic programming guide provides a list of broadcasts (movies, shows, news, ...) on the TV. If the user selects a movie in the program guide and says: "Record it", the selection of a certain broadcast is not available to and interpretable by the recorder even though the content of the selection originates from the electronic programming guide and can therefore be considered to be reusable by the recorder device.

It is therefore an object of the invention to provide a dialogue system having the above mentioned functionalities and thereby improving prior art dialogue systems.

This is obtained by a dialogue system enabling a user to control multiple applications based on mono-modal or multi-modal input from a user, comprising speech input, wherein said system is an intermediate layer between said applications and said user, said system comprising: means for receiving and storing application-specific knowledge sources for each of said applications, wherein said knowledge sources for each of said applications are represented in a common application-independent way,

. - means for receiving said input from said user, means for processing in an application-independent way said input comprising using said stored application-specific knowledge sources for determining one or more of said applications for which said received input is intended, means for forwarding said processed input to said determined applications.

In an embodiment said means for processing said input, further comprises using a history based on previous input received and processed by said dialogue system and said applications.

In an embodiment said knowledge sources are represented by data ' comprising ontology-based descriptions, defining the degrees of freedom of said application.

In an embodiment said knowledge sources are further represented by data comprising the grammar of said application.

By a dialogue system according to the above the following advantages are obtained in combination:

1. Multiple applications can be accessed in parallel by a user using the same dialogue system, i.e. a user can e.g. control several applications through the same dialogue system one after another or simultaneously.

2. A thin interface is obtained between the core dialogue system and the applications

3. It is possible to dynamically add or subtract applications to the dialogue system. 4. It is possible to perform automatic and consistent integration of discourse-knowledge into the information provided by the user.

5. Components of the dialogue system can be stateless except for the dialogue history

6. It is possible to (automatically) exchange and/or reuse information between applications 7. It is possible to make use of modularised knowledge sources, especially ontological descriptions, to represent the application data

8. It is possible to provide meta-functionality like navigation in dialogue history and access to lists provided by the output components in a generic way. 9. Flexible dialogue initialisation, since both user and system can initiate a conversation.

The invention also relates to a method of controlling multiple applications based on mono-modal or multi-modal input from a user, comprising speech input, wherein said method comprises: - receiving and storing application-specific knowledge sources for each of said applications, wherein said knowledge sources for each of said applications are represented in a common application-independent way, receiving said input from said user, processing in an application-independent way said input comprising using said stored application-specific knowledge sources for determining one or more of said applications for which said received input is intended, forwarding said processed input to said determined applications. In an embodiment said processing said input, further comprises using a history based on previous input received and processed by said dialogue system and said applications.

In an embodiment said knowledge sources are represented by data comprising ontological descriptions representing the degrees of freedom of said application.

The invention further relates to a computer readable medium having stored therein instructions for causing a processing unit to execute a method as described above.

The invention further relates to an application which is adapted to be connected to a dialogue system according to the above, wherein said application comprises application-specific knowledge sources adapted to be shared with said dialogue system when connected with said dialogue system. In the following preferred embodiments of the invention will be described with reference to the figures, where

figure 1 illustrates a dialogue system according to the present invention which is implemented as an intermediate layer between the user and various applications, figure 2 illustrates a specific embodiment of the dialogue system in Fig. 1, and figure 3 illustrates in a more detailed way how a current user input may be integrated with the dialogue history shown in Fig. 2.

Figure 1, illustrates a dialogue system 101 according to the present invention which is implemented as an intermediate layer between a user 107 and various applications 109, 111, 113, 115, 117, 119 so that said applications may be controlled in parallel. The applications may be e.g. devices such as a TV 109, DVD/VCR 111, home computer or home entertainment system 113, stereos 115 or a security system 117. The applications could also be services such as EPG or a MP3 music service provided by a server 119 or a service provider via a network. The procedure of accessing said applications via a single dialogue system is accomplished by receiving, processing and storing application-specific knowledge sources for each of the applications which are represented in a common application-independent way. In one embodiment the dialogue system is also equipped with knowledge sources that may at least partly be independent of applications and enable the access to generic functionalities (e.g. navigation in the discourse) and/or provide generic knowledge on which the application-dependent knowledge sources can be built (e.g. courtesy phrases in the grammar that can be reused). When a mono-modal or multi-modal input from a user comprising speech input such as a speech command 103 from the user 107 is received, it is processed. The processing comprises using the application-specific knowledge sources, and based thereon the application for which the received speech command 103 is intended is determined, and the data is forwarded to the determined application. This enables data to be processed, represented or a combination of both in the system 101 independently of the individual application 109, 111, 113, 115, 117, 119. Therefore, due to the application-dependent knowledge sources which enable the applications to perform some operation based on the user input, the processing does not need to be changed when new applications are controlled via the dialogue system 101. Therefore, multiple applications are enabled to run in parallel, meaning that multiple applications can be accessed through one dialogue system in a consistent way. The setup of the applications that can be controlled by the dialogue system can be changed in a dynamical way.

In one embodiment the knowledge sources comprise data defining ontology-based descriptions of the domain of an application, where the ontology-based representation is trees defining the degrees of freedom of an application, and through task nodes give a hint which task might be performed on a concrete tree. But the application has to decide which operation/action to carry out based on the provided information. Each application sends the knowledge sources to the application manager. As an example, the ontology domain for a TV 109 may comprise one or more trees, where e.g. one tree may comprise a TV device as the top of the tree, and the three nodes below comprise representations for e.g. the power switch, tuner for different channels, and volume. The tuner could contain something like storage for one or more stations, again separated in station names, e.g. NBC, CNN etc., and the associated channel number in the TV 109. More details relating to the tree structure will be discussed in further detail later. Each of the applications shown in Fig. 1, which are connected to dialogue system 101, supplies ontology-based description of their domain, where they are centrally stored 105, along with information about the transformation from words to semantic atoms contained in the ontology that could be in form of a grammar for some analysis component as well as information about the reverse transformation, i.e. from a semantic representation, in one embodiment being the ontological description, to output. These transformations may contain language-dependent parts, while the core ontology is language-independent. Therefore, the conversation of an application from one language to another language is greatly facilitated. In one embodiment the output transformation is an XSL stylesheet which transforms an ontology-based hierarchical object tree represented in XML into either or one of a string of text to be transformed into speech by speech synthesis, or an description for a visual presentation, that can be a HTML document which can be displayed by any web browser. The ontology-based description may even be based on or incorporate other ontological descriptions, thereby creating more special ontologies when based on other descriptions using inheritance or reusing existing descriptions in a parent-child relation.

Thus there are two ways to 'reuse' ontological descriptions: 1. A more general ontological description is used to specify a more specialized description, like the TV-device being a consumer-electronics-device being a device. Through such a relation, the more specialized description inherits the properties from the more general description (inheritance, "is-a").

5 2. As the structures of the ontologies can be seen as trees, there are parent-child relations in the trees. A 'tuner' that is defined elsewhere can be used as part of a TV-device and DVD-recorder-device, therefore being a child in the description for both of these devices ("has-a").

Further, information between two applications can be (automatically)

10 interchanged between them. As an example, a "broadcast slot" from the electronic program guide (EPG) specifying a certain broadcast can be reused at the hard disc recorder to program a recording. In one embodiment, to support the information interchange, relations on the units of the ontological descriptions are derived from the ontological descriptions using information on the reuse of existing descriptions.

15 In the example presented in Fig. 1, the TV 109, the VCR 111, the stereos

115, the security system 117 and the service 119 have predefined ontology domains which supply description of different actions to be performed through the specification of some degrees of freedom of an application. These domains are forwarded to the application management layer which is a part of the dialogue system 101, which 0 enables the user 107 to interact with the various applications through one dialogue

■ system. Furthermore, a new application can easily be integrated into the system, when a • new device or service is purchased. This new device would comprise a new additional knowledge sources like an ontology, so that when connecting the new device to the system 101, the application manager can integrate the new knowledge sources and 5 provide the modified knowledge sources to the dialogue system components where necessary. Generally, the plug-and-play of applications into/from the dialogue system may be used to restrict the active knowledge sources in the system (e.g. vocabulary of speech recognizer) and can thereby improve the performance of components (like the automatic speech recognition). Other examples of applications could also be MP3- 0 players or image browsers. Applications can be, but are not limited to devices and services.

Figure 2 illustrates a specific embodiment of the dialogue system 101 in Fig. 1. The application manager (Appl. Man.) 201 serves as a thin interface to a set of applications (Appl.l) 217, (Appl.2) 219, (Appl.3) 221, (Appl.N) 223, which can be 5 connected or disconnected during operation of the system. These applications may e.g. be the applications presented in Fig. 1. At connection time, an application forwards specific knowledge sources like its ontology, its grammar, its lexicon and its style¬ sheets to the application manager. Ontologies, grammars, lexica and style-sheets correspond to the application, where each ontological item is modeled by e.g. a Java class. Sub-applications have their own data, which is incorporated by the main ontology, grammar, lexicon and style-sheets in ways suitable for the pertinent format, e.g. an "import" declaration in OWL, a separate style-sheet, and a separate grammar is provided. By employing the same sub-application for e.g. "time", applications can share and may exchange commonly used data. An application can send an updated grammar or lexicon, or style sheet anytime it seems reasonable, thus the overall vocabulary size in the system can be kept minimal and optimal suited to the available applications for optimal performance.

The input from the user 107 to the dialogue system 101 is processed by the speech-recognition and language-understanding module (Recogn. & Anal.) 203, where the input is processed into a set of ontology atoms (i.e. single ontology classes) and preferably one 'task'. A task describes what a user could be intended to do, and the atoms denote the objects on which the task could operate. The set of tasks is limited and could e.g. comprise the following tasks: info (give information about the item), switch (change a discrete value), adjust (change a numerical value), select (select one or more items), create (create an item), delete (delete an item), help (provide help for an item or in general). The hypothesis generation (Hypoth. Gener.) 209 uses the ontology module to connect the atoms and the tasks to coherent object trees based on the relations from the ontology. As an example, if the user's input comprises "see pictures from John in Italy 2004", one possible result from the speech-recognition and language- understanding process might be the connection of "John" to the ontology class "firstname" (derived from "from John"), "Italy" to "country" (from "in Italy"), "2004" to "year" and "see pictures" to "task(select)". Using the ontological-based descriptions, (tree-)structures can be derived that formulate the relations between "firstname", "country", "year" and "task(select)". The resulting structures are equal to or substructures of the ontological-based descriptions and can in one embodiment be one or more trees. The ontology module is not restricted to generate exactly one solution to the problem of creating relations between the semantic atoms, there can be more than one (or even no) solutions to this problem. This could as an example comprise 20 different ways of combining the semantic atoms in one or more tree. Each of the potential representations of the user input is a so-called hypothesis consisting of one or more trees. In the hypothesis selection (Hypoth. SeI.) 211 each hypothesis obtains a "score" based on predefined criteria. The criteria could e.g. comprise the number of atoms from the user input contained in the trees, the relative path length between these atoms or ratings from previous processing steps. As an example, a hypothesis combining "firstname=John" and "country=Italy" in one tree would have a higher score than a hypothesis having both "fϊrstname=John" and "country=Italy" in separate trees. The hypothesis having the highest score would in one embodiment typically be the one having all the atoms, "firstname=John", "country=Italy","year=2004" _an(j "task(select)" represented in one tree as close as possible to each other in terms of number of edges necessary in the connecting tree, i.e. which are arranged in a compact way. Therefore, the hypothesis selection (Hypoth. SeI.) 211 selects the hypothesis having the best score, i.e. the best representation of the user input from the dialogue-system point of view. This hypothesis containing one or more (sub-)trees is passed to the dialogue manager (Dial. Man.) 213. In this case the tree with "firstname=John" and "country=Italy", "year=2004" and "task (select)" is sent to the appropriate application, e.g. an image browser application, through the application manager (Appl. Man) 201. The image browser application or an appropriate ■ extension of an existing image browser application is preprogrammed in such a way i that the image or images may easily be extracted based on said trees. The dialogue manager 213 handles the selected hypothesis, and based on the results of the previous processing steps it decides to contact the application manager (usually the case) or not. Also based on the previous processing steps and, as the case may be, based on the ' response of the application, the dialogue manager computes how to interact with the user in the next generated output. When an application is contacted by the hypothesis being sent to the application, the application can enhance this hypothesis. In addition to the enhanced hypothesis, it can also return processing results to the dialogue manager (through the application management). Such a processing result could in the above example be a list of trees each representing one single image.

The processing result of the initial user request, i.e. "see pictures from John in Italy 2004" is now stored in the dialogue history (Dial. Hist.) 207 which collects information produced during the interaction with the user. Generally, the dialogue history (Dial. Hist.) 207 integrates collected information with the recent user input. As an example, if the returned list of pictures contains 200 pictures, the user could be more specific and add the month, e.g. "January", to the representation that resulted from the initial request, i.e. a tree derived from "firstname=John" and "country=Italy" and "year=2004". In that way the collected history tree is integrated with the present tree, resulting in a tree comprising "firstname=John" and "country=Italy" and "year=2004" and "month= January". If there is some kind of failure or the user is to receive some information, e.g. "request is being processed" an output may be generated (Outp. Gen.) 215 and presented to the user. The dialogue manager can also trigger output based on the hypothesis or result from the application, e.g. to inform the user what the system thinks (fϊrstname=John etc.), present the result to the user (e.g. if only 20 pictures have been selected), inform (too much pictures, if 200 pictures were selected) or request additional information from the user (too much pictures, if 200 pictures were selected: ask the user to specify a month).

In one embodiment the application manager 201 constructs a coherent grammar based on the unification of the individual grammars from the applications or more generally it constructs application-specific knowledge sources. The application manager also distributes the interpreted user input, i.e. hypothesis to the applications and collects the results.

An application analyzes the user speech command and performs the required action. This can be to turn on the TV, or provide the dialogue system with said requested pictures from Italy, etc., in order to present said pictures to the user. Both steps are represented in the application's response which consists of two parts. The first part is mandatory and contains information about the analysis of the user input. Each value can be augmented by one of three tags: "user" for user supplied information taken "as is", "computed" for a more specific representation derived from the user speech • command after processing, e.g. if a user has specified "today" as date, the application replaces this with e.g. "15-02-2004", and "default" if default was used to augment user speech command, e.g. by assuming the current date for a query about the TV program. Thus, the core system can inform the user about the actual query values in a transparent way. If essential information is missing, the application can add a new leaf to the has-a tree with the tag "request", indicating that the core system should ask the user to supply a value for a specific item. If a crucial action, e.g. a money transfer, requires extra validation, the application can supply a tag "verify", and the core system should then ask a pertinent verification question. If a set of possible alternatives exists for one value, the application can supply the set and add a "select" tag. Thus, the application can influence the behavior of the core system while maintaining a thin interface based on the ontological description.

If the application has sufficient information to perform an action, the result is expressed in a second part of the response as a sequence of has-a trees according to the ontology. Each tree can represent one result. An example of this is a TV program matching the restrictions imposed by the user, or said list of pictures. The application result representations proceeds to a method that converts the abstract ontology-based representations to other representations suitable for output, depending on further processing of the dialogue manager. Since these outputs are still organized in trees compliant to the ontologies, the dialogue history uses these ontology- based representations to update its knowledge of discourse, i.e. these trees are integrated into the internal representations of the discourse. As an example, existing knowledge in the discourse knowledge storage can be updated or new knowledge added. Besides providing knowledge for the update of the dialogue history, in one embodiment the output representations formulated in XML are converted by XSL style sheet to a textual representation suitable for speech synthesis, and to HTML for graphical representations. To automatically update the HTML page in a web browser each time new information has to be presented, an active component (e.g. JavaApplet) automatically triggers the reload of the pages.

The stateless design of the component - except for the dialogue history - enables the flexible initialization of a dialogue by both, the user and applications.

Figure 3 illustrates in a more detailed way how a current user speech command may be integrated with the dialogue history 207 shown in Fig. 2. Both the user input and the dialogue so far are presented as "knowledge trees" which are hierarchically organized (trees representing has-a relations; properties is the nomenclature of the ontology). Figure 3 (a) shows trees for integration 301, 307 along with a relation of their root elements 309. In order to integrate these at least two_' knowledge trees 301, 307 (can be considered to represent current and previous information), the relation between these trees is acquired by computing the relations of their tree-root elements 309. This relation is calculated resulting in that a skeleton tree for the integration structures is provided, in one embodiment based on the ontological descriptions. Depending on properties of the elements of the knowledge trees (e.g. age of the root elements: youngest first), these trees are integrated one by one into the tree providing the skeleton for the integration of the remaining knowledge trees not integrated so far. Each node from the knowledge tree that is integrated into the skeleton tree may replace an existing node, enhance an existing node, replace an existing node, while adopting information from the existing node, or be discarded. This is outlined in Fig. 3 (b) and (c). Here the following assumption has been made: Tl _I-1 and T2_2-i and also Tl_2-2 and T2_3-2 are nodes having the same functional type, along with the root relation 309. As shown in Fig. 3 (b) the potential results are shown when T2 overlaps Tl 303 and elements of T2 overlap Tl, whereby children of overwritten nodes are not copied 311. In Fig. 3 (c) Tl _\._\ and T2_2-i are non-unique children 305 and Tl_2-2 and T2_3- ₂ are non unique children (lacking a special maker; e.g. lists demand non-unique siblings as list elements). This could mean that new child is positioned as a brother of the existing child, and therefore their common parent node has to be adapted to this set¬ up. This integration algorithm can consider all possible integrations of the knowledge trees into the skeleton tree, thus creating a potentially larger number of different interpretations of the user input embedded in the history.

During the construction of a coherent interpretation of the user input including knowledge of discourse, various possible readings are obtained resulting from different ways of integrating existing knowledge into the interpreted current user input. The dialogue history and the hypothesis selection component rate each possible reading by computing a number of scores out of the features that can be derived from the interpretation and by taking into account scores of preceding processing stages (speech recognition, speech understanding, new part in an interpretation vs. old parts, means ages), the existence of information (e.g. the occurrence of a special element in the resulting tree like 'task') and the relation of information newly provided by the user to the existing information (e.g. if the information given by the user fits to information requested by the system during previous interaction). The age of information to be integrated into the coherent interpretation might be restricted, e.g. considering at most . the last three turns. Information could become outdated and/or worthless for the interpretation of the current user's speech command after a certain time or after a, certain number of interaction turns. Taking into account all the information gathered • from the interactions with the user so far, this could lead to significant decrease in terms of processing speed due to the large amount of possible integration results with the current user speech command. The handling of information originating from applications other than the one(s) being in focus of the current user speech signal is based on relationships between said ontologies derived by the ontology model 205. In one embodiment a dialogue history 207 component provides a linear list of atomic elements storing the information from the hierarchical interpretations. Each atomic unit carries information on its children in the hierarchical form, and also stores the temporal modifications of this atomic unit during the discourse. This enables the user to navigate through the discourse history in a browser-like way to restore previous discourses states. The resulting interpretation of the user input forms the basis of the communication with the applications.

Generally, the root illustrated 309 may e.g. be a power switch root represented in an appropriate ontology. This root may be used as a child in various applications which require such power switches, e.g. TV or VCR by means of importing it to the various ontology domains of various devices. Therefore, the power switch, which is typically used in all the applications, does not have to be defined for each individual application. Thereby, a tree in the TV ontology having TV device as the top of the TV and the node there below (the child) is power switch characterizes the power-switch for the TV device. In a similar way, a general representation of some other property, e.g. time, may be used for different applications as a child.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word 'comprising' does not exclude the presence of other elements or steps than those listed in a claim. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:

1. A dialogue system enabling a user to control multiple applications based on mono-modal or multi-modal input from a user, comprising speech input, wherein said system is an intermediate layer between said applications and said user, said system comprising: - means for receiving and storing application-specific knowledge sources for each of said applications, wherein said knowledge sources for each of said applications are represented in a common application-independent way,

- means for receiving said input from said user,

- means for processing in an application-independent way said input comprising using said stored application-specific knowledge sources for determining one or more of said applications for which said received input is intended,

- means for forwarding said processed input to said determined applications.

2. A dialogue system according to claim 1 , wherein said means for processing said input, further comprises using a history based on previous input received and processed by said dialogue system and said applications.

3. A dialogue system according to claim 1 -2, wherein said knowledge sources are represented by data comprising ontology-based descriptions, defining the degrees of freedom of said application.

4. A dialogue system according to claim 1 -3, wherein said knowledge sources are further represented by data comprising the grammar of said application.

5. A method of controlling multiple applications based on mono-modal or multi-modal input from a user, comprising speech input, wherein said method comprises:

- receiving and storing application-specific knowledge sources for each of said applications, wherein said knowledge sources for each of said applications are represented in a common application-independent way, - receiving said input from said user,

- processing in an application-independent way said input comprising using said stored application-specific knowledge sources for determining one or more of said applications for which said received input is intended, - forwarding said processed input to said determined applications.

6. A method according to claim 5, wherein said processing said input, further comprises using a history based on previous input received and processed by said applications.

7. A method according to claim 5-6, said knowledge sources are represented by data comprising ontological descriptions representing the degrees of freedom of said application.

8. . A method according to claim 5-7, wherein said knowledge sources are further represented by data comprising the grammar of said application.

9. A computer readable medium having stored therein instructions for causing a processing unit to execute a method according to claim 5-8.

10. An application which is adapted to be connected to a dialogue system according to claim 1-4, wherein said application comprises application-specific knowledge sources adapted to be shared with said dialogue system when connected with said dialogue system.