WO2001091110A1

WO2001091110A1 - Stand-alone device comprising a voice recognition system

Info

Publication number: WO2001091110A1
Application number: PCT/EP2001/005945
Authority: WO
Inventors: Nour-Eddine Tazine; Frédéric SOUFFLET; Goulven Querre; Serge Le Huitouze; Christophe Delaunay; Pierrick Jouet; Izabela Orlac
Original assignee: Thomson Licensing S.A.
Priority date: 2000-05-23
Filing date: 2001-05-23
Publication date: 2001-11-29
Also published as: AU2001267478A1

Abstract

The invention concerns a stand-alone device comprising a voice recognition system and a mass storage device. The device comprises moreover a natural language processing, at least one semantic network which defines a domain, a database containing information with attributes. The natural language processing receives recognized command of a user and sends a request to the semantic network. The semantic network is constituted as a graph defining the elements of the domain and the links between these elements. The semantic network searches the answer to the request in the domain.

Description

Stand-alone device comprising a voice recognition system

The present invention concerns a voice-controlled device for the home, comprising a flexible voice-controlled user interface.

The object of the invention is a stand-alone device containing a voice recognition system, a mass storage device, characterized in that it comprises moreover a natural language processing, at least one semantic network which defines a domain, a database containing information with attributes, the natural language processing receiving recognized command of an user and sending a request to the semantic network, the semantic network being constituted as a graph defining the elements of the domain and the links between these elements, the semantic network searching the answer to the request in the domain.

Other characteristics and advantages of the invention will appear through the description of a preferred embodiment of the invention. This embodiment will be described in relation with the drawings among which: Figure 1 is an illustration representing two possible appearances of the device.

Figure 2 is an in illustration detailing the external features of a device of figure 1 .

Figure 3 is a UML diagram of a context manager interface. Figure 4 is a diagram of a collectiviser structure of the context manager of figure 3.

Figure 5 is a diagram of a stocker structure of the context manager of figure 3.

Figure 6 is a diagram of a nominator structure of the context manager of figure 3. Figure 7 is a block diagram of a semantic network representing the knowledge in one particular domain, in this case for an Electronic Program Guide (EPG) application.

Figure 8 is a block diagram of a semantic network for a device command and control application.

Figure 9 is a diagram of a first basic relation in a semantic network.

Figure 1 0 is a diagram of a second basic relation in a semantic network. Figure 1 1 is a diagram of a third basic relation in a semantic network.

Figure 1 2 is a diagram of a variant of relation in a semantic network.

Figure 1 3 is a diagram of a 'role' relation. Figure 1 4 is a diagram of the iterative steps required to create a new context according to the embodiment.

Figure 1 5 is a flowchart of a tool used to create a new context.

Figure 1 6 is diagram of different steps used to generate language models for different languages. Figure 1 7 is a diagram of a tool software architecture.

1 .1 . Introduction and link between Home Assistant and the voice-based EPG

In our former project "voice-based EPG", we have demonstrated, by the building of a prototype, that a voice-operated application for a complex albeit restricted domain is feasible.

However, we also have noticed that the development of such an application from scratch is quite costly.

Building on top of this experiment, we propose to define and build a framework to rationalize and ease the development of voice-controlled applications. Central to this framework is the architectural split between generic components implementing the common tasks of a voice-based application (voice signal processing, speech recognition, text to speech, speaker authentication, ...) and pluggable modules for tasks specific data. This will lead to a much shorter development time for a particular voice- operated application, because specific modules will be more high-level and generic components will be reused. This separation will also open the way to multiple applications being simultaneously active.

The Home Assistant (HA) is an implementation of this framework. The HA will be a physical device able to run many voice-operated applications and to dynamically load/unload them from distant servers.

1 .2. Introduction

Home assistant is a stand-alone device you can talk to spontaneously, almost as you would do with a human being. It responds in the same way, through a text to speech module, or it may display information on a screen. Home assistant can be moveable, it can work alone. It contains communication means to communicate with a network and download information. With such a simple description, it can be thought of as a robot, and to avoid confusions, it is important to nail down the main differences between Home Assistant and apparently similar devices from other companies.

1 .3. Features of Home Assistant

Let's start by describing the physical embodiment of HA. Figure 1 shows a possible embodiment of a home assistant. Figure 2 shows some of its externally visible features.

HA is a voice-operated device, basically consisting of a display, a microphone and loudspeakers. Weighing a few kilograms, it can be easily moved by hand from place to place, but is not designed to be portable. Thus, it will presumably be somewhere in the living-room.

Two-way vocal interaction with the HA will be possible by speaking right in front of it at a distance of 1 or 2 meters, and also everywhere in the home, through small, remote-control-like, devices with microphones and loudspeakers that will be lying on docking stations placed in most rooms.

The more salient features of HA will be: . a display to present results to the user or animate HA's face;

• Text-To-Speech technology to allow distant interactions through wireless devices;

• high-quality audio system, to allow for audio CD and MP3 playing; . speaker recognition, for greater customization of the interaction; memory of past interaction, for a more natural dialog; connection to various digital links, which allows: easy download of new applications for the HA via dedicated servers, very natural interaction to access data on the internet, control of every digital device present at home (IEEE 1 394 link), distant interaction with HA through classical telephony.

1 .3.1 .The notion of knowledge modules

The HA's knowledge is organized into independent modules. Many modules may coexist simultaneously in the HA and new ones may be downloaded at any moment, increasing its "intelligence" (i.e. its ability to understand discourse domains) accordingly.

There are technical and economic reasons for that: • to keep individual module to a manageable size, thus allowing parallel development by different teams;

. to allow early availability of the product, through careful choice of target application domains; . to shorten time to market for new, trendy, application domains;

• to allow new features to be integrated at any time, thus avoiding the premature obsolescence of the product;

• to allow dedicated Thomson servers to provide HA's application, thus keeping a close commercial link with end users.

1 .3.2. Typical interactions: some use cases

To illustrate how a user could interact with HA, below is a possible sequence of interactions with it (it assumes that some modules are loaded: "EPG", "Device Control", "Weather Forecast", "Encyclopedia").

User: "Any french comedy playing tonight?"

[HA's EPG module scans the TV program and responds]

HA: "Jour de fete, from Jacques Tati, will play at 1 1 , tonight on Cinestar" User: "Great! I don't have any copy yet. Record it for me, please"

[HA's Device Control module sets up the TV receiver and video recorder to record the movie at 1 1 :00pm]

User: "Will I need my raincoat tomorrow?"

[HA's Weather Forecast module retrieves the forecast from the Internet and informs the user]

User: "Have I any appointment with my dental surgeon in the coming week?"

[HA's Diary module consults the speaker's personal diary and answers accordingly] User: "Retrieve the article on Schleswig-Holstein you showed me yesterday" [HA's Encyclopedia module finds the article and displays it on the HA]

User: "Well, that's really a good one! Send it to Marylene!" [HA's Encyclopedia module composes an e-mail with the article and sends it to Marylene]

1 .3.3. Large Vocabulary, Spontaneous Speech Recognition

HA will recognize spontaneous speech (i.e. complete sentences, not only isolated words) with its associated large vocabulary. It will also treat some hesitations (mumbling and silence), but not change of mind. From user tests we have conducted, we have observed that mumbling and silence cover more than 50% of all human speech hesitations.

Interactions with the HA will be possible via wireless microphones anywhere in the house, close-talking when standing in front of the HA, and also telephony (mobile or not) when away from home.

1 .3.4. Speaker Dependency and Identification When first bought, HA will use a user-independent recognition profile to cope with any possible user.

Users will be able to initiate an explicit training session, in order to increase the recognition rate in a significant manner. This higher recognition rate will be more and more useful with the number of different modules loaded in the HA. In addition to increasing the recognition rate, this initial training will automatic, fine-grained, adjustment of the recognition rate for the identified user. It will also allow speaker identification, which is highly desirable to adapt the interactions to the particular user currently using the HA. This adaptation may even include particular modules that would be usable only by certain people in the home. 1 .3.5. Text to Speech

The HA will interact visually through its display, but also vocally, through a Text-To-Speech interface. This interface will be available with multiple voice "styles", allowing a particular user to choose its preferred voice.

Of course, given the ability to identify the speaker (see above), each user at home will choose his/her own preferred voice for greater usage comfort.

1 .3.6. Main Board

1 .3.6.1 . Platform

• A microprocessor,

• A RAM,

• A mass storage device for the Operating System, the recognition engine and TTS feature plus the Home Assistant application, all grammars (" 1 -7 Mb each), associated semantic network, etc.

• Several Home Assistant microphone / display sets connected to the home network where is connected the Home Assistant main system.

1 .3.6.2. Operating System

This may be a commercially available system used for personal computers.

1 .3.6.3. Speech recognition engine This may be a commercially available system used with personal computers.

1 .3.7. Home Assistant Microphones As we saw it previously, the Home Assistant system is not a mobile, the user must be able to use it in all the free space which he has, that is why the technique of the wireless microphone offers several important features : • It allows the user to pilot the Home Assistant without obstacle and wherever it is with regard to the system,

• It allows to localize the user in the space and there to offer automatically information or to supply wanted information there where is the user, • Due to the technique of the " Close Talking " integrated in every mono-directional microphone, arranged in different places and offering each the same features, the problems associated to the " Distant Talking" can, at first, not be taken into account while authorizing this future feature in a real background noised environment. The microphone does not inevitably have to be of very high quality, nevertheless its characteristics will mark the training phase and will determine there the rate of recognition of the system. A loss of 20 % of the recognition rate can be attributable in a bad use of the microphone or in an inadequacy between the microphone and the training voice model (bad pairing).

1 .3.8. Feedback systems

Associated to the location of the user in its space due to microphones, the offer of contextual information or the supply of information asked by the user is made there where he places through one or several media following ones:

• A display on one or several screens of identical or different characteristics (low resolution, high resolution, TV screen, etc.),

• A vocal synthesis through one text to speech (TTS) engine, • Any actions controlled by the Home Assistant (TV zapping,

MP3 playing, etc.). See use cases of the Home Assistant. 1 .4. Software description

1 .4.1 . Introduction : the problem to be solved

The Home Assistant must be so based on software architecture taking into account the notion of domains (TV, Internet, domotic, etc.) and to offer possibility for the user to move intuitively between several domains and to be able to return to one of them without losing the historic of route in this domain.

This can be obtained by developing a generic kernel including one or several voice recognitions engines and an manager of domains. The domain in that case can be represented as the set constituted with a specific semantic network, with a context, with a grammar and a typical vocabulary in the domain and, additionally, of the associated data.

The manager of domains in that case administers the load and dumping within the application of the complete sets - contexts, grammars, etc. - and the passage from a domain to the other one as at the level of the recognition engine (grammars, analyser, etc.) as elements not associated to the part voice recognition (Data base, display, TTS, etc.).

The other important part to develop is the tool of construction of these domains: it has to allow, by creating new capacities (new semantic network, etc.), to offer necessary features to cover varied domains of the Home Assistant system.

1 .4.2. Software Architecture 1 .4.2.1 . Global Architecture

1 .4.2.2. The Context Manager 1 .4.2.2.1 . Introduction

The context manager (CM), is responsible for handling a smooth dialog. The context manager analyses and stores at least the previous request and defines the context of the dialog. In order to further explain how the CM works, we first recall the three main steps of a dialog. Then we show a few sample man-machine exchanges which illustrate the different kind of actions which must be performed by the CM.

We then classify those typical actions and give a first idea of the structures which have to be implemented in order to perform these actions. Then, we show how the CM interfaces with the other Home

Assistant modules with these structures.

Finally, we give a few further details about the internal structure of the CM.

1 .4.2.2.2. The three main steps of a dialog

A dialog can be divided into dialog exchanges. Each dialog exchange is made of the following steps:

1 . The user presents a demand to the system ;

2. The system evaluates the demand, extracting its meaning, verifying its coherence, and retrieving all the items which answer the demand ;

3. The system sorts the retrieved items in the appropriate order, chooses some of them and presents them to the user.

In this situation, the role of the context manager consists in the following :

1 . It receives the user demand from the recognition module after this demand has been parsed and its meaning has been extracted ;

2. It analyses the demand to determine how it can be answered ;

3. It performs the appropriate actions to give the appropriate answer ;

4. It sends the answer to the Features Manager (FEM), for it to be displayed to the user.

The most important work of the context manager resides in step 3. Hence, determining what are the appropriate actions to perform to give the appropriate answer to a demand is crucial to exhibit the internal structure of the CM. π

1 .4.2.2.3. A few sample dialog exchanges

So, let's analyze the four typical dialog exchanges below:

1 . User : "I'd like a movie". The system answers.

User : "And what's after this" ?

2. User : "Is there a football match" ? The system answers.

User : "And is there another one" ? 3. User : "What's on the first channel right now" ?

The system answers. User : "And on the second one" ? 4. User : "I want a western please". The system answers. User : "I'd like to see cycling".

The system answers. User : "Is there a tennis match" ? The system answers.

User : "Could you show me the previous western" ?

1 .4.2.2.4. The main CM actions classification In the first example, the user is considering the answer given by the system as a reference item and so, in his second question, he is asking for another item which is related to this reference item. In NLP terminology, this reference item is called "the nominator" because it has the same "name" for both locutor. Therefore, in order to give an appropriate answer to the second user's question, the context manager must perform a "switch on the nominator". The second example illustrates the fact that the user supposes the system is able to give not only one answer to his first question but all of them. In order to do so, the context manager must store them into a list we call "the stocker". In his second question, he is asking for another item of this list. The appropriate action to be taken in this situation is called a "switch on the stocker".

In the third example, the user supposes that the system recalls his last demand. The place where the context manager stores this demand is called "the collectiviser" because it generally defines an ordered set of items by giving its corresponding collectivising relation and the sort order in which the items of the set should be presented. In his second question, the user does not reformulate his whole demand (which should be "What's on the second channel right now"), but only the part of the collectivising relation which is different from the first one. Therefore, the appropriate action which must be taken by the CM is called a "switch on the collectiviser".

In the last example, the user supposes that the system is able to recall previously displayed answers without having to reformulate the whole demand which generated them. In order to do so, the context manager must record all the items which became the nominator at different times in what we call "the nominators". The action which must be performed by the CM to answer the last user question is called a "context callback".

Hence, the three main element we need to handle a smooth dialog are: 1 . The most recent user demands ;

2. The last ordered set of items in which the system choose the most recent answer ;

3. The most recent answers which were given to the user. These three elements form what we call "the context" of the dialog. In order to properly handle the context, the CM has the following structures:

• The collectiviser which contains a representation of the user demand ; • The stocker which contains the possible answers to this demand ;

• The memory, (or nominators), which contains a list of items which were referred to by both the user and the system at different times.

1 .4.2.2.5.The CM interface

As the collectiviser contains a representation of the user demand, it is given to the CM by the recognition module.

The CM then converts the collectiviser into one or more complete and consistent requests. These requests are then sent one by one to a specialised module called "the query manager".

For each request, the query manager fills the stocker with the item which satisfy it.

After all the requests are treated by the query manager, the stocker is sorted according to the order given into the collectiviser. Finally, one or more items are chosen in the stocker and sent to the module specialised in presenting the answers to the user, the Features Manager (FEM). As soon as an item is sent to the FEM, it is stored in a "nominator" so that it can be recalled later by the CM.

Figure 3 is a graphical representation of the context manager interfacing.

1 .4.2.2.6.The internal structure of the CM The collectiviser is made of two kinds of data :

• The pertaining criteria which define the collectivising relation itself ; • The ascriptive criteria which define the sort order of the set which has to be produced in the stocker.

• Figure 4 is a graphical representation of the collectiviser.

• The stocker is represented by two lists : • The items list which contain the set of items which have been retrieved by the query manager ;

• The ascriptive list which contain all the ascriptive criteria which have been extracted from the collectiviser and are used to sort the items list. Figure 5 is a graphical representation of the stocker.

The context manager memory stores each item independently as soon as it becomes the nominator. Hence, in this document, the term "nominator" further designates an individual item which is stored in the context manager memory. As the memory cannot grow indefinitely, we choose to represent it as a list of items which has a fix size. When the list is full, each new item replaces the least recently used one. Hence, a nominator is represented with the following attributes:

• An individual item; • A stamp which represents the last time and date at which the item was referred to in the dialog.

Figure 6 is a graphical representation of a nominator.

1 .4.2.3. Interface to each Context : The Semantic Network SEMNET is an abbreviation for Semantic Network.

A SEMNET is a synthetic representation of the knowledge for one domain. With one SEMNET, we claim we can cover a large part of one domain. A domain is similar to an application (for example an EPG). As shown in the global architecture, a domain is associated with a database, a SEMNET, a Context Manager and grammars. The database contains elements such as name of movie, name of actor, a day of the week, a town, an identifyer of a document (the title), etc. These elements are associated with attributes which define the topic of the element, such as "actor", "sport", etc. A whole of selected attributes can define a domain. The domain contains among other things, the whole of elements of the database that have one of the selected attributes.

SEMNET insure the consistency between these entities. As domains are very disparate, it's very important to define a generic way to formulate request and to implement event. SEMNET is the generic solution.

Here are the basics of SEMNET: α A criterion is a basic element. It can be an element of a database or an attribute. α An event is an association of criterions. An event can define a whole of elements of the database that respect a list of attributes contained in a list of criteria. Q Criterions are linked with relations.

1 .4.2.3.1 .The SEMNET architecture.

Three basic relations are defined in SEMNET: □ The ls_A relation, see figure 9. This relation links an element of the database or at least one attribute of elements. For example, Woody Allen IS A actor, Woody Allen is an element of the database and actor is an attribute. Others links can exist, for example, in certain movies, Woody Allen is also a producer. α The Is AKindOf relation see figure 1 0. This relation links an attribute with another attribute. For example, "football" and "sport" are two attributes, the football is a sport. α The role (ROLE) see figure 1 1 . This relation links two criteria that are equivalent, or in other word, synonymous. For example, a user can say as well "serial" and "movie" to designate the same concept

The list of these relations are not exhaustive.

These relations link different criterions.

As there is different way to join a criterion, SEMNET is a graph defining the elements of the domain and the links between these elements.

When the user asks anything to the HA, the semantic network searches the answer in the graph to the request in the domain. If a criterion corresponding to the request is reached, the HA or the semantic network work according to the status of the reached criterion.

A status is associated to a criterion. This status influences the behavior of the criterion during the execution of the request and the searching of the answer.

Five basic status are defined in SEMNET: Q The Displayable Status. α The Implicit Status. □ The Input Point Status. α The Main Status. Q The Non Displayable Status.

Figure 7 is a representation of a SEMNET for EPG and Figure 8 is a view of a SEMNET for Cmd&Ctrl. 1 .4.2.3.1 .1 .The Is A relation.

The Is A relation allows the implementation of a Non Displayable criterion.

1 .4.2.3.1 .2.The ls_AkindOf relation.

The ls_AKindOf relation increases the granularity.

1 .4.2.3.1 .3.The Exclusive entity.

It is a variant of the ls_A relation, illustrated by the figure 1 2. An exclusive entity allows only one selected criterion.

1 .4.2.3.1 .4.The role.

An implicit criterion is attached to another criterion by a role. A role is an aggregation.

1 .4.2.3.1 .5. The Status.

• The displayable status.

This is a non-abstract criterion (for example film is non-abstract). Displayable criterions are directly searchable. Displayable criterions are linked with ls_AkindOf relations.

• The implicit status. A role is ended by an implicit criterion. Implicit criterions aren't searchable alone. Implicit criterions are linked with Is A relations.

• The Input Point Status.

These kinds of criterions are directly linked to the main criterion with an ls_AkindOf relation. In fact, this status defines a sub folder in a domain. • The main status.

There is only one criterion with the main status. This one holds all the SEMNET.

The criterion with the main status holds all the master roles.

• The non-displayable status.

This is an abstract criterion and also an implicit criterion. Non- displayable criterions aren't searchable.

This criterion doesn't contain pertinent information.

1 .4.2.3.2. How does it work?

SEMNET implements methods for building the events and especially good requests.

A grammar is decorated by several points of generation. These points reach criterions to compose requests or events.

An event is build with well-known information. On the other hand, the request is the result of grammar analysis. It's very difficult to ensure a good and non-ambiguous request. This is why SEMNET implements methods to solve these problems. The major features are:

• SEMNET allows incompletely request.

For example, suppose the request " I want something with Julia Roberts".

Julia Roberts is only an actor of cinema. The Database's engine searches in the cinema subfolder.

• SEMNET detects ambiguous requests.

For example, suppose the request " I want something with Woody Allen". W. Allen is an actor of cinema but also an actor of theatre. The returns of the request can be all the events with Allen as theatre's actor and as cinema's actor.

In fact, SEMNET allows interactive dialogs .-"Allen as a theatre actor's or Allen as a cinema' s actors".

• SEMNET detects bad request.

Some wrongs requests could be generated, SEMNET filters them. The HA does not send any information in answer to the bad request. A variant consists in sending an interrogative question in answer to the wrong request.

1 .4.3. Creating and Developing a new context : user test loops

In order to develop a new context, our methodology is based on iteration loops: user tests - improvement of the Language Model - second user tests - etc.

We need at first a small LM, built according to what we think represents a minimal set of queries that should be admitted in this context.

A first internal test allows us to improve this first LM and to make it more robust. It is then possible to build a mock up of the future application, which will be tested with external users. These tests will allow us the constitution of a linguistic corpus, with which we will define the corresponding semantic templates. A series of iterative loops (tests and improvement of templates and LM) will then be made until satisfaction of the users. This step is dependent on a given language.

See figure 1 3.

In order to create a new context, we have the following elements:

• The corpus collected during the user tests: WAV files, • The results of recognition: XLS files, containing the questions made by users, the corresponding recognized sentences, the requests sent to the database and the responses given by the database,

• Our tool. • With these elements, we proceed like shown in figure 14.

• For each sentence, there are two possibilities:

• The sentence is recognized: the sentence is kept and we shall verify after evat each loop that it remains in the LM. This sentence will also allow us, after grouping between similar sentences, to build the corresponding semantic template.

• The sentence is not recognized: two possibilities again:

* the sentence is not included in the LM : it must be added in the LM or not, if it is out of context,

* the sentence is included in the LM : it is an engine recognition's problem.

1 .4.4. 'No regression' requirement

After each loop, we re-inject the whole corpus of the previous steps (WAW format) in the system: all the sentences that were OK must remain OK. The evolution must be the one illustrated by figure 15.

1 .4.5. Language independence

The second step in our methodology consists in the internationalization. When the semantic templates are established, it is possible to build the LM for another languages than the language used in the initial step. Here again, we make some iterative loops (user tests and improvement of LM) until good results are obtained. See figure 16.

1 .4.6. HA Tool Software Architecture The actual problem is that all the work described previously is

"handcraft" made. The tool should help us to know, for each rejected sentence whether it is a LM problem or a recognition problem. It should also help us to verify if a recognized sentence is still recognized after each loop in the LM. See figure 17

Claims

1 . Stand-alone device (HA) containing a voice recognition system, a mass storage device ; characterized in that it comprises moreover a natural language processing (NLP), at least one semantic network (SEMNET) which defines a domain, a database containing information with attributes, the natural language processing (NLP) receiving recognized command of an user and sending a request to the semantic network (SEMNET), the semantic network being constituted as a graph defining the elements of the domain and the links between these elements, the semantic network searching the answer to the request in the domain.

2. Stand-alone device according to the claim 1 , characterized in that the semantic network (SEMNET) comprises criteria which are the basic element of the graph, and relations which link the criteria; a criteria is an element of the database or at least one attribute of elements.

3. Stand-alone device according to the claim 2, characterized in that the semantic network (SEMNET) defines a relation between an element and at least one attribute (ls_A).

4. Stand-alone device according to the claim 2, characterized in that the semantic network (SEMNET) defines a relation between an attribute with another attribute (Is AKindOf).

5. Stand alone device according to the claim 2, characterized in that the semantic network (SEMNET) define a relation between two criteria indicating that they are equivalent (ROLE),

6. Stand-alone device according to the claim 2, characterized in that a status is associated to a criterion, the status defining the behavior of the device or the semantic network when the request of the user reaches this criterion.

7. Stand-alone device according to the claim 1 , characterized in that it comprises a context manager (CM), the context manager stores at least the previous request and informs the semantic network about the context of these previous requests.

8. Stand-alone device according to one of the previous claims, characterized in that the semantic network detects the bad request, and filtering them.

9. Stand-alone device according to one of the previous claims, characterized in that it comprises a means for identifying the user speaking to the device.