GB2379787A - Spoken interface systems - Google Patents
Spoken interface systems Download PDFInfo
- Publication number
- GB2379787A GB2379787A GB0210891A GB0210891A GB2379787A GB 2379787 A GB2379787 A GB 2379787A GB 0210891 A GB0210891 A GB 0210891A GB 0210891 A GB0210891 A GB 0210891A GB 2379787 A GB2379787 A GB 2379787A
- Authority
- GB
- United Kingdom
- Prior art keywords
- network
- updated
- reconfiguration
- speech recognition
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000004044 response Effects 0.000 claims abstract description 12
- 238000000034 method Methods 0.000 claims description 18
- 230000001419 dependent effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 230000009471 action Effects 0.000 claims description 2
- 238000012423 maintenance Methods 0.000 claims description 2
- 238000012986 modification Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 claims description 2
- 230000006399 behavior Effects 0.000 description 20
- CHBOSHOWERDCMH-UHFFFAOYSA-N 1-chloro-2,2-bis(4-chlorophenyl)ethane Chemical compound C=1C=C(Cl)C=CC=1C(CCl)C1=CC=C(Cl)C=C1 CHBOSHOWERDCMH-UHFFFAOYSA-N 0.000 description 18
- 238000012545 processing Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 8
- 239000000470 constituent Substances 0.000 description 7
- 239000000126 substance Substances 0.000 description 7
- 238000013519 translation Methods 0.000 description 7
- 241000408659 Darpa Species 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- 241000219492 Quercus Species 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000005282 brightening Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005429 filling process Methods 0.000 description 1
- 238000010438 heat treatment Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- URWAJWIAIPFPJE-YFMIWBNJSA-N sisomycin Chemical compound O1C[C@@](O)(C)[C@H](NC)[C@@H](O)[C@H]1O[C@@H]1[C@@H](O)[C@H](O[C@@H]2[C@@H](CC=C(CN)O2)N)[C@@H](N)C[C@H]1N URWAJWIAIPFPJE-YFMIWBNJSA-N 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
A system for the speech control of a network of devices, 400, is disclosed in which a device can be added to or removed from the network or updated, the system containing speech recognition components, 600, that are capable of automatic reconfiguration in response to a linguistic server, 180, being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
Description
<Desc/Clms Page number 1>
Spoken Interface Systems
This invention relates to spoken interface systems, and is particularly concerned with systems that use speech as an input and/or output medium through the use of speech recognition and speech generation systems. The invention has particular relevance to the control of apparatus such as washing appliances, heating, lighting, television and so forth in a home environment. However, the invention is of applicability to control interaction in other environments. The invention is directed at improving the ability of a system to accommodate new apparatus and to load the necessary speech recognition or generation information automatically so as to provide a so called"plug and play" system.
A plug and play architecture for home control devices is desirable for the successful implementation of spoken interfaces for home devices. A home device control speech recognition application that is initially compiled will try to incorporate as much grammar as is practical for a given knowledge of the home device applications at that moment in time. As new devices are developed then the grammars and lexicon of this speech recognition application will have to be reconfigured to enable the application to recognise and use the new features that are unique to these devices, manufacturers and their locations. Plug and play reconfiguration has the following two important properties: the network of devices is dynamically reconfigurable as devices are brought online or disappear offline ; zero re-configuration is required by the user.
Without a plug and play arrangement the speech recogniser would have to be reconfigured either by a statistical method or by rewriting and recompiling the grammars and lexicon in the application each time a new device is released onto the market. The statistical method involves the collation of a large representative corpus of possible user utterances, which is then used to train a statistical model from it. This must be gathered by a generally expensive experiment. Rewriting a grammar and lexicon for the complete home device application is not practical every time a new device is released onto the market. The problem becomes intractable if each house has its own speech recogniser tailored to the set of devices currently plugged into it.
Domestic users will not have the skills to update the application themselves, and they
<Desc/Clms Page number 2>
will not want to pay an expert to update their application for them. Rather, the devices themselves must take responsibility for reconfiguring the system when they join it.
The basic idea of so-called'plug and play platforms for networks of devices that do not include a speech interface is known. These include Microsoft's Universal Plug and Play architecture and the Jini platform (as described in"Jini in a Nutshell"by Oaks and Wong, published by O'Reilly 2000). However there are particular problems involved in implementing speech recognition and generation interfaces. The present invention aims to address the requirement for reconfiguring speech interfaces for Plug and Playable networks of devices.
There are also extant descriptions of spoken dialogue interfaces in which various component pieces of the spoken dialogue interface (e. g. the speech recogniser, the dialogue manager, the speech synthesizer) themselves can be unplugged and alternative components plugged in to replace them. For example, one commercial speech recognizer might be replaced by another without affecting the functionality of the system. The most prominent example of this is the Darpa Communicator architecture (described, for example, in"The role of the Darpa Communicator architecture as a human computer interface for distributed simulations"by Goldschen and Loehr, in Proceedings of 1999 SISO Spring Simulation Interoperability Workshop, Orlando, Florida). The Darpa Communicator architecture is not designed to address the special requirements of a plug and play network of devices. That is, what is plug and playable in the Darpa Communicator architecture is the spoken language processing components themselves.
There is also an extant literature on other aspects of the reconfiguration of spoken dialogue systems. A recently published review article on spoken dialogue interfaces by Zue and Wong called"Conversational Interfaces: Advances and Challenges" (in Proceedings of the IEEE, Special Issue on Spoken Language Processing) discusses both cross-domain and cross-language porting. Once one has built a cinema ticket booking service, for example, one may examine the effort required for booking train tickets, or for e-shopping in general or even the"database access"scenario. These are examples of cross-domain porting. Similarly, once one has built a cinema ticket booking service in English, one may examine the effort required for building the same
<Desc/Clms Page number 3>
service in French. This is an example of cross-language porting. There are also various toolkits, architectures and methodologies for rapidly and/or semi-expertly generating new instances of dialogue systems, for example by abstracting away from domain or application dependent features of particular systems, (e. g. as described in "Vocalist : A robust portable spoken language dialogue system for telephone applications"by Fraser and Thomton in Proceedings of Eurospeech, Madrid, 1995; or as described in"Universal Dialogue Specification for Conversational Systems"by Kolzer in Proceedings of IJCAI 1999 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Stockholm Sweden), or'bottom-up'by aggregation of useful re-configurable components, (as described, for example in"Universal Speech Tools: the CSLU Toolkit"by Sutton et al in Proceedings ofICSLP 1998 ; or as described in"Information State and Dialogue Management in the TRINDI Dialogue Move Engine Toolkit", in Natural Language Engineering, volume 6 2000). However, none of these disclosures describes automated within-domain plug and play reconfiguration.
It will be important to have a fast and relatively inexpensive route for incorporating new functionality or new devices robustly into the speech interface. Reconfigurability also includes the notion of user-reconfiguration in support of personalised and adaptable interfaces. At the moment, when people own comparatively few devices, they can just about tolerate learning a different interface for each new device. With speech, people will want to talk to all of their devices and the user interface will become a unified object. Facilitating a unified structure and appearance either through standards or through the possibility of reconfiguration will therefore be very important for speech-based service providers.
Preferably, a plug and play architecture should solve these inherent problems with run time systems using a standard architecture, and will therefore be a key feature in allowing home device applications to become technically and commercially viable.
Viewed from one aspect, the present specification discloses an invention which comprises a system for the speech control of a network of devices in which a device can be added to or removed from the network or updated, the system containing speech recognition components that are capable of automatic reconfiguration in
<Desc/Clms Page number 4>
response to a linguistic server being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
Thus it may be seen those skilled in the art that in accordance with the present invention a network of speech-controlled devices can exhibit plug and play functionality by virtue of each device effectively having responsibility for making the appropriate adjustments to the speech recognition components of the system when it is added, removed or updated.
It will further be appreciated that in contrast, for example, with the Darpa Communicator architecture mentioned above, rather than simply allowing the plug and play of spoken language processing components, at least preferred embodiments of the present invention allow the appropriate updating of those components when the network of devices to be controlled or queried is itself dynamically changing.
Preferably, the system also includes dialogue management components that are also capable of automatic reconfiguration in response to the linguistic server being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of dialogue management information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
Preferably, a device in the network has an associated device grammar in the form of data describing the grammatical rules and declarations which define a set of valid sentences that may be used to query, command and control the device, the format of the data comprising one of more of the novel features of the device grammar exemplified in the embodiments below.
Preferably, a device in the network has an associated device dialogue management specification in the form of data describing the dialogue management behaviour of the device, the format of the data comprising one of more of the novel features of the device dialogue management specification exemplified in the embodiments below.
<Desc/Clms Page number 5>
The reconfiguration of the speech recognition components could consist just of one or more new lexical entries, such as for the name of the device. Preferably however, the reconfiguration also applies to the grammatical rules. In preferred embodiments, for example, the grammatical rules used throughout the system are arranged into nested modules of differing levels of generality. Thus in such embodiments the speech recognition components are divided into a plurality of modules applicable to differing numbers of said devices such that each device is associated with a set of linked modules of decreasing levels of generality.
Also preferred is that the speech recognition components comprise a unification grammar implementing a plurality of core grammar rules which are not updated or removed by said reconfiguration.
Preferably the speech recognition components further comprise other unification grammars implementing grammatical information relevant to an environment in which the network resides.
To give a specific example, a light may be associated with a module capable of recognising the utterance"light". However it would also be associated with a more general module containing the grammatical rules applying to devices which may be turned on and off and to another applying rules relating to devices which can be dimmed. Finally the light would be associated with the core grammar module, which might, for example, recognise the word"and"therefore divide an utterance into two commands.
Thus in preferred embodiments, some grammatical information is supplied by devices, other grammatical information other grammatical information is unchanging "core"grammatical knowledge about the structure of the language (e. g."that'and' joins two sentences, clauses, nouns or verbs", that"a prepositional phrase such as"in the bedroom"can attach to a noun or a verb") and further grammatical information concerns the environment of the network e. g. that the network is in a house which has 4 rooms called"bedroom","bathroom"etc.
Preferably the system comprises a device in the network having an associated device grammar in the form of data describing the grammatical rules and declarations which,
<Desc/Clms Page number 6>
either alone or in combination with the other grammatical rules and declarations made available to the system, define a set of valid sentences that may be used to query, command and control the device. Thus it will be seen that the grammatical information pertinent to a device may define a set of sentences. Equally however, such information could simply comprise a single declaration to the effect that a device, e. g."television", is a noun. That information, taken together with the"core" information and"environment"information should define a set of valid sentences.
Automatic reconfiguration amounts to"plug and play"functionality. In its weakest form,"plug and play"may refer only to the ability to add a device to a network without requiring any manual configuration. The distribution of knowledge within the network is not included. Plug and play for personal computer peripherals for example simply automates the matching up of physical devices with software device-specific drivers in the PC. Communication links between them are established by reserving resources such as shared memory and interrupt request numbers. In a strong sense, plug and play can refer also to modular, distributed knowledge. Devices not only set up communication channels but publish information about themselves over the network. Other devices can obtain and use this information. For example, a new printer device can register its printing service (and the java code for invoking methods in the device) on a network. A word-processing application can then find the service and configure itself to use it.
The strong-weak contrast is not a sharp division. For example, a word-processing application may already know an industry agreed standard print interface which the installed printer device conforms to. It is therefore able to display a print button which happens to be greyed out in the absence of any networked printer. However, a new printer may also supply additional print options which the word processing application knows nothing about. The printer may supply a"print colour"button itself.
The strong and weak senses of plug and play can apply also to spoken language dialogue interfaces. In the weakest sense, a dialogue system might be entirely preconfigured to deal with all possible devices and device-combinations. The required
<Desc/Clms Page number 7>
knowledge is already present in the network but simply needs activating. Plug and play then consists of identifying which particular devices are indeed present in the network and establishing appropriate communication channels with them so that a command to switch on the light actually ends up switching on the light. In the stronger sense, the components of the spoken language dialogue interface acquire the knowledge pertinent to particular devices from those devices. For example, the speech recogniser may not have the word"TV"in its vocabulary until a TV is plugged into the network. The dialogue manager may not be capable of uttering"That device is not dimmable"until a dimmable device is plugged into the network.
A strongly plug and play system may therefore be distinguishable from a weaker one by its behaviour in the absence of certain device specific knowledge. If the relevant knowledge is present, one cannot be certain whether it was pre-configured or uploaded"on demand". Equally important, however, is the simple modularity enforced by plug and play. Since devices must declare the information required to update the dialogue components, a clear interface is provided for re-configuring the system for new types of device as well as a clearer picture of the internal structure of those dialogue components. Indeed, with this perspective, it becomes a design choice whether device knowledge is in fact installed only when the device is. One may, for example, choose to optimise recognition performance on the set of devices actually installed by not loading information about other devices. Alternatively, one might prefer to recognise the names of devices not installed so that helpful error messages can be delivered.
A standard set of components for a spoken dialogue system might include the following: speech recognition, parsing, context independent semantic interpreter, context dependent semantic interpreter, domain dependent semantic interpreter, action executor, generation and speech synthesiser. Potentially, each component can be updated by information from a device in a plug and play domain. Furthermore, different instantiations of these components may require very different sorts of update. To take a very simple example, if recognition is carried out by a statistically trained language model, then updating this with information pertinent to a particular device will evidently be a significantly different task from updating a recogniser which uses a grammar-based language model.
<Desc/Clms Page number 8>
In the home devices domain, there will be portions of a grammar that are particular to devices (e. g. that"light"is a noun describing a light, that"dim"is a verb that applies to lights and sub-categorises for a prepositional phrase which supplies a dimming amount), portions that are particular to the house (e. g. that"bathroom"is a noun describing a house location) and portions that are entirely general (e. g. that utterances consist of commands, yes-no questions and wh-questions, that a command is a Nonfinite verb phrase, that a non-finite verb-phrase can be realised by a verb and a noun phrase.
In order to generate a modular grammar of this sort, a suitably expressive grammar formalism is required, for example unification grammar. Current speech recognisers however do not permit language models to be specified in this format. Therefore, a further process of language model generation is required in order to automate plug and play re-configuration.
For plug and play at the level of speech recognition, each device should contribute the knowledge necessary for recognising the language relevant to that device. There are several possible scenarios. At the simple end of the spectrum, the command vocabulary offered by the speech interface may just consist of a list of fixed phrases.
In this case, plug-and-play speech recognition becomes trivial: each device contributes the phrases it needs, and they are combined into a single grammar.
However, it is desired to be able to combine language appropriate to several different devices in the same utterance, e. g."turn on the radio and the living room light"or "switch on the cooker and switch off the microwave".
Three approaches to deliver the plug and play concept may be possible.
In the first, the linguistic description of the device could be embedded at the source of manufacture. Specific code would be stored on the chip. This linguistic description would need to be written by a highly skilled computational linguist at the unification grammar level (high level description). The advantages of using this solution would be that the device's linguistic description is hard coded which would minimise the possibility of the device being incorrectly controlled and also that it does not depend
<Desc/Clms Page number 9>
on an external network to connect to a server if the client resides in the home. The disadvantages would be that this is a relatively costly way of distributing upgraded grammar code.
At the other extreme to the first solution would be the possibility to have the linguistic description of any new devices distributed from a central server to all the local clients.
It could be a pre-condition that before a device is released onto the commercial market that the device linguistic description is submitted to a home device control authority for dissemination to the clients. This has obvious cost advantages but relies on a network connection and there would also be the risk that devices may not have had their linguistic description sent to all the relevant clients The third possibility would be to enable the device to store an embedded URL address that would point to the location of its linguistic description for retrieval by the client in order to update its grammar. The advantages of this approach would be that it is a cost effective way of updating grammars for devices and would offer the possibility to revise the linguistic description of a particular device at a later date. This could be used to release dormant device functionality at a later stage, say perhaps when a complementary device was released. The disadvantage would be that updating of grammars at the client side would be dependent on there being a constant and reliable network connection.
It will be appreciated by those skilled in the art that the foregoing description of the invention has been in terms of'devices'. This term may apply to physical devices such as televisions, lights etc. for domestic applications as exemplified above.
Equally, however, a"device"may be an entity existing purely in software. Indeed, the principles set out above may also be applied to pieces of information, not associated with any device. Thus when viewed from a further aspect the invention provides a system for the update and maintenance of a speech interface to a computer based information system in which information in the information system can be added to or removed from the network or updated, the system containing speech recognition components that are capable of automatic reconfiguration in response to a linguistic server being informed by the network that a piece of information in the network has been added, removed or updated, the reconfiguration
<Desc/Clms Page number 10>
comprising the addition, removal or update of speech recognition information pertinent to the information or device whose addition, removal or updating has initiated the reconfiguration.
The invention also extends to the computer software for such a system.
An example application of the above is the ability to talk to an Electronic
Programme Guide (EPG) supplied with a satellite television Set-Top box which is updated with new programme information on a weekly basis. The EPG is not a command-and-control application-at least one does not command and control the objects that you talk about (the TV programmes). It is just an "information"system. One just searches for information and asks questions.
Nevertheless the grammar for talking about programmes still needs updating in the same modular way. One will want to associate bits of grammar with particular television programmes and load them up and unload them depending on the current contents of the TV listings schedule.
Suppose you can ask different things about different programs (similar to: you can do different things to different devices in the home networkbecause you can't"dim"televisions, only lights) For example,"Are Chemical brothers featured on Top of the Pops today ?" should be allowed but **not**"Are Chemical brothers featured on Top Car today ?" In this example, one would also want to be able to say: Here's a new object"Chemical Brothers"and update your EPG grammar without re-writing the whole grammar.
One solution would be to have a general EPG grammar and add, dynamically, "Chemical Brothers" : construct a grammar of the form"Are X on Y today", where X is any valid"appearer"and Y is any valid news programme. Then you just say"Chemical Brothers"is a new appearer and Top of the Pops is a programme.
However this could lead to poor recognition performance and lots of dialogues like :"The Chemical Brothers are not featured on Top Car
<Desc/Clms Page number 11>
today"... and no-one in their right mind would ever issue that question.
In accordance with the invention however, the grammar can enforce the constraint that Chemical Brothers goes with Top of the Pops (and other music programmes).
The preferred features of the earlier aspects of the invention also apply, where appropriate, to the aspect set out above.
The invention also extends to a computer software product, which when run on a standard data processing means, provides a system in accordance with the earlier aspects of the invention.
Embodiments of inventions disclosed in this specification will now be described by way of example only and with reference to the accompanying drawings, in which: Figure 1 is a diagram of a spoken dialogue system; Figure 2 is an example of a grammatical rule format; Figure 3 is an example of a grammatical declaration format; and Figure 4 is an extract from a sample device grammar.
Figure 1 represents a novel architecture for a natural language spoken dialogue system with the following properties: The dialogue system is designed for the query, command and control of a network of devices (400) each of which is queried, commanded and controlled through a set of interface functions that each provides both to the network (400) and to the dialogue system. The devices can be domestic devices such as a TV, washing machine, burglar alarm system, lighting controls and so forth The dialogue system contains speech recognition (120) and dialogue management (130) components that are capable of reconfiguration at runtime without the need for
<Desc/Clms Page number 12>
user intervention, which reconfiguration is initiated upon the Linguistic Server (180) being informed by the network of devices (400) of the addition, removal or update of a device in the network.
The reconfiguration of the speech recognition (120) and dialogue management (130) components consists in the addition, removal or update of information pertinent (in the sense to be defined below) to the device whose addition, removal or update in the device network (400) initiated the reconfiguration.
Figure I illustrates a device network (400) containing two devices only (410 and 430) but the network may contain any number of devices.
Each device (for example, 410) in the device network (400) is associated with a Device Grammar (411). A Device Grammar (411) is data describing the grammatical rules and declarations which define the set of valid sentences that may be used to query, command and control the associated device (410) through its interface functions. The rules and declarations also define the set of possible Semantic Values for each valid sentence. The data format for the grammatical rules is a Unification Grammar format, this being a standard data format for the description of grammatical rules used in research environments. The precise definition of the rule format for one embodiment is shown in Figure 2. The precise definition of the grammatical declarations for this embodiment is shown in Figure 3. An extract from an example of a device grammar for a particular device, a dimmable light switch, is shown in Figure 4.
The present invention concerns amongst other things the format of the Device Grammar (411). The contents of particular Device Grammars (411) which satisfy that format are not will generally be provided by users of the data format, e. g. a device manufacturer.
A Device Grammar (411) associated with a device (410) in a device network (400) is communicated to the Linguistic Server (180) component of the dialogue system when that device is plugged into the device network and registered therein. There are existing platforms capable of achieving such automatic registration of new devices in
<Desc/Clms Page number 13>
a network including JINI promoted by Sun Microsystems and Universal Plug and Play promoted by Microsoft. In the present embodiment, this communication is achieved by having the device (410) broadcast a message on the network (400) in which there is another specialist device (440), called a Control Point, whose function it is, amongst other things, to receive such broadcast messages and register devices.
The Control Point will retrieve an electronic address in the form of a Universal Resource Locator from the device (410), this address being the address of a Device Grammar (411) associated with the device (410). The Control Point will retrieve the Device Grammar (411) from the address and transfer it to the Linguistic Server (180) using an Internet Protocol transmission.
The Device (410) also transmits a record containing four fields to the Linguistic Server (180) when it is plugged into the Device Network (400). The first field is an identifier that uniquely identifies it amongst all other possible devices on the Device Network (400). The second field is a location identifier, which uniquely identifiers the location of the device. In the present embodiment, a location identifier could be one of at least the following symbols: kitchen, hall, living room, bathroom, bedroom. The third field is a Device Type Classifier. In the present embodiment, this is an atomic symbol and one of : dimmable, switchable and sensor. The fourth field represents the current state of the device. In the current embodiment, it is an integer from 0 to 100 where 0 indicates"off", 100 indicates"on"and all other numbers indicate a degree.
The Linguistic Server (180) contains a Data Store of all Device Grammars that have been communicated to it. The Linguistic Server (180) also contains in its Data Store other grammars also defined in the same Unification Grammar format. These other grammars may include grammatical information relevant to an environment in which the device network (400) resides. In one preferred embodiment, the environment is a house in which a network of consumer devices resides. The consumer devices may include at least televisions, video recorders, lights, refrigerators and ovens. In other embodiments, the environment may be a software environment such as a computer operating system in which a number of computer software programs reside. In other embodiments, the environment may be a software application such as a media player for playing audio outputs in which a number of audio objects reside. These other
<Desc/Clms Page number 14>
grammars may also include other grammatical information not specific either to the devices (410) or to their environment.
The Linguistic Server (180) contains a WW (Who is Where) Data Store of all unique device identifiers, their locations, their Device Type Classifiers and initial status that have been communicated to it. This Data Store is updated and maintained in a precisely similar manner as the Device Grammar Data Store (described below).
When the Linguistic Server (180) receives a transmission of a Device Grammar (411), it updates the Data Store of Device Grammars by adding in newly received Device Grammars. The Linguistic Server (180) may also maintain a history log of Device Grammars for devices that have been removed from the device network. When the Linguistic Server (180) receives a Device Grammar (411), it generates a list of all Device grammars valid for the device network (400). A Device Grammar (411) is valid if the associated device (410) is currently registered in the device network (400).
A Device Grammar (411) may also be valid if the associated device (410) has been registered in the device network (400) at some time in the past and a system flag within the Linguistic Server (180) is set to indicate that previously registered devices are to be deemed valid. The system flag can be set by the User through a Linguistic Server Interface. The Linguistic Server (180) communicates all the valid Device Grammars and all other grammars in the data repository to the Linguistic Compiler (190) component.
The Linguistic Compiler (190) translates the set of grammars it receives into a processed vendor specific grammar data format suitable for a vendor specific automatic speech recognition system. The translation includes a first stage of translation into a grammar specification format defined by the vendor. There may also be a second stage of translating the vendor grammar specification format into another data format defined by the vendor in order to configure the vendor's runtime speech recognizer. This second translation step may be undertaken by software supplied by the vendor specifically for this purpose. If the vendor's speech recognition system can be configured directly by input data in the vendor's grammar specification format then this second stage of translation is not necessary. In the preferred embodiment, the first stage of translation is into Grammar Specification Language, the specified data
<Desc/Clms Page number 15>
format for user grammars which can serve as input to the Nuance (Trade Mark) nuance-compile software program. The nuance-compile program is a member of the Nuance speech recognition software suite. The second stage of processing in the preferred embodiment includes execution of the nuance-compile program upon the Grammar Specification Language formatted data which generates a recognition package, this being a software input to the runtime Nuance speech recognition software (120).
The first translation stage into vendor specific grammar specification format is undertaken by a software component designed for translating Unification Grammars into a semantically annotated Context Free Grammar formalism. The notion of a Context Free Grammar formalism is well established in mathematical linguistics. The central feature is a set of rewrite rules of the form A -7 B, C,..., D where all the symbols A, B, C,... D are atomic symbols, some of which are Terminal and the others of which are NonTerminal. The symbol A on the left hand side of the rule may be called the mother of the rule and the symbols on the right hand side are daughters.
Any symbol that appears as a mother in a rule is a NonTerminal. All other symbols are terminals. A rule A -7 B, C,..., D may be read as A rewrites to the sequence of symbols B, C,... D. One symbol is denoted the TOP symbol. By rewriting the TOP symbol and then iteratively rewriting any NonTerminal symbols that one writes, one ends up with a sequence of Terminal symbols representing a string of words. The set of possible strings that can result from such a rewriting is called the language generated by the grammar. A semantically annotated Context Free Grammar is also a standard one in mathematical linguistics. In this case, the symbols are annotated with semantic values. There are a variety of possible ways to detail precisely the format of the semantic annotation. In the preferred embodiment, the semantic annotation format is that of Nuance's Grammar Specification Language. The algorithmic detail of the translation is not within the scope of the current invention. The algorithm has the property that the input and output grammars generate the same language and assign the same semantic value (s) to each string in the language.
The output of the Linguistic Compiler (190) is input to the runtime speech recognizer (120) and thereby reconfigures it. In the present embodiment, this is performed by stopping one of the Nuance runtime processes called'recserver'and then starting a
<Desc/Clms Page number 16>
new instance with the processed grammar data format as input. In other embodiments, the vendor specific speech recognition software may supply alternative means for reconfiguration.
Turning now to the dialogue management interface, the output of the speech recognizer (120) upon a given spoken user utterance is a Semantic Value. A Semantic Value is a list of Key Features and their Values. A Key Feature can be any atomic symbol. In one embodiment, there are seven Key Features: device, location, on-offlevel, on-off-direction, on-off-change, operation and spec. The types of value that these Key Features can take are listed in the table below Table 1: Key Features and Value Types for Semantic Values
Key Feature | Type of Value device Any atomic value location Any atomic value on-off-level An Integer between 0 & 100 on-off-direction Either"on"or"off on-off-change Integer between 0 & 100 operation Either"query"or"command" spec One of :"all","any","a","the" Each device (for example, 410) in the device network (400) is associated with a Device Dialogue Management Specification (412) hereinafter referred to as a DDMS.
A DDMS (412) is data describing the dialogue management behaviour of the device (410). The data format for the DDMS is a set of records containing four fields.
The first two fields contain symbols that specify a classification for the device. The first field specifies a Semantic Value Classifier. The second field specifies a Device Type Classifier. The third field specifies a Behaviour Key and the fourth field specifies a Behaviour Value.
A Semantic Value Classifier can be any atomic value. The value of the Semantic Value Classifier for a particular device (410) is intended to be the same as the value for the device Key Feature specified by the Device Grammar (411). In one
<Desc/Clms Page number 17>
embodiment, the possible Semantic Value Classifiers include at least the following: "light","heater","temperature".
A Device Type Classifier can be any atomic value. In one embodiment, the possible Device Type Classifiers include at least the following :"switchable","dimmable"and "sensor".
A Behaviour Key can be any atomic value. In the preferred embodiment, the possible Behaviour Keys are listed in the following table: Table 2: Behaviour Keys in the Preferred Embodiment
GENDER GENDEVICE GENDEVICEPL GENDEVICEDEF GENADJ GENADJALRDY GENADJNOW GENCOMPON GENCOMPONPL GENCOMPOFF GENCOMPOFFPL GENNEEDON GENNEEDOFF GENONPAST GENOFFPAST GENON GENOFF REFERRABLE SCALAR NODEVICE
<Desc/Clms Page number 18>
NODEVICE INLOC SEVERAL DEVICES A Behaviour Value is a variable length list of data objects. The variable length list of data objects has two sub-types. In one sub-type, the field contains a generation template consisting of a list of objects each being either a string of words or a reserved symbol called a generation symbol. In the second subtype, the field contains the value TRUE or FALSE. In one embodiment, a generation symbol is one of the following: STATUS, COMMON, NEUTER, GENONOFF.
A DDMS (412) associated with a device (410) in a device network (400) is communicated to the Linguistic Server (180) component of the dialogue system when that device is plugged into the device network and registered therein. A preferred embodiment for this communication is of precisely the same form as that for communicating a Device Grammar (411) to the Linguistic Server (180), which process has already been described above.
The Linguistic Server (180) contains a Data Store of all DDMS's that have been communicated to it. The Data Store and a list of valid DDMS's is maintained in a precisely similar manner as that for Device Grammars (411), which process has already been described above. The Data Store and list of valid DDMS's is updated whenever a Device DDMS is communicated to the Linguistic Server (180).
Immediately after, the Linguistic Server (180) communicates all the valid DDMS's to the Dialogue Manager (130) which replaces the currently stored valid DDMS's in its internal memory with the newly valid DDMS's.
Whenever the Linguistic Server (180) communicates a set of valid DDMS's to the Dialogue Manager (130), it also communicates to the Dialogue Manager (130) the set of valid unique device identifiers, their locations, their Device Type Classifiers and their initial status from the WW Data Store. Upon receipt, the Dialogue Manager (130) updates its internal memory World State Table by the following algorithm:
<Desc/Clms Page number 19>
First, for each unique identifier communicated by the Linguistic Server (180), if it is not present in the current World State Table, then add a new entry for that identifier, its location, its Device Type Classifier and its initial status into the World State Table; if the identifier is already present in the current World State Table, then do nothing.
Secondly, for each identifier in the current World State Table which is not in the list of identifiers communicated by the Linguistic Server, delete the entry in the table for that identifier.
Thus, there has been described reconfiguration of the runtime dialogue manager (130) by information that is pertinent to a device (410) when the network (400) that includes the device (410) is updated.
Reference will now be made to the behaviour of the reconfigured utterance processing subsystem (600). System behaviour is initiated by the user uttering something, Speech Input, into a microphone, telephone or other similar audio input device which is connected to the speech recognizer (120). The speech recognizer (120) is, as has been explained above, configured according to the Device Grammars (411) for each device present in the Device Network (400) and according to any other general grammars stored in the Data Store (200) attached to the Linguistic Server (180). Consequently, the speech recognizer will process the Speech Input using only these sources of grammatical knowledge. Consequently, and for example only, if no television is plugged into the Device Network and no other general grammar or Device Grammar contains grammatical information about the word"T. V." then the speech recognizer will not recognise a user utterance of"Switch on the T. V". One advantage of this invention is that, if the user owns a Compact Disc Player and the Device Grammar contains grammatical information about the word"C. D.", then a user utterance of "Switch on the C. D." will not be confusable with"Switch on the T. V.". In the absence of this invention, a speech recognizer will need to know all the names of many possible devices and it is very likely to confuse them, especially similar sounding names such as"T. V." and"C. D".
Once a television is plugged into the Device Network (400) then, so long as the Device Grammar for the television includes grammatical information concerning the word"T. V", such an utterance will be capable of being recognised correctly.
<Desc/Clms Page number 20>
Therefore, the linguistic capability of the speech recognizer is capable of automatic and appropriate extension according to what is required to be recognised. It is not necessary to pre-configure the speech recognizer for every possible Device Network (400) configuration.
By way of example, when a television is plugged in, a Device StateLine is passed to the Linguistic Server (180) and thence to the Dialogue Manager (130) which results in an entry in the World State Table (150) recording the fact that there is a device whose unique identifier is TV001, whose location is LIVINGROOM, whose Device Type Classification is SWITCHABLE and whose initial status is 0. By way of example, the speech recognizer will generate a Semantic Value for the utterance"Switch on the T. V." of the following form :
'ce=7T rdevice = TV ) operation = COMMAND spec = THE on-off -level = 100
This Semantic Value is processed by the Dialogue Manager (130) in three stages : contextual resolution, response and generation. In Contextual Resolution, the Semantic Value is matched against information in the World State Table (150) with the objective of filling in a set of named SLOTS with values. The SLOT names are the same as the Key Features for Semantic Values except there are two additional names: resolution and identifier. The precise details of the Contextual Resolution algorithm are not significant to the current invention. For the purpose of the current example, we note that there is only one TV in the World State Table, therefore the Resolution algorithm fills the location slot with LIVINGROOM (the location of the only known television), the identifier slot with TV001 (the identifier of the only known television), the resolution slot with"OK"thereby indicating that this processing stage successfully completed and all other slots are filled with their corresponding value in the input Semantic Value (unless the input Semantic Value assigned no value to the corresponding Key Feature in which case the SLOT name will not be assigned any value). The resulting SLOT structure is therefore :
<Desc/Clms Page number 21>
device = TV location = LIVINGROOM operation = COMMAND spec = THE on-off-level = 100 identifier = TVOOI resolution = OK
The second stage of Dialogue Management (130) processing is response. In this stage, queries and command are sent to particular devices in accordance with the contents of the current SLOTS. By way of example, in one embodiment a command to switch off is sent to device TV001. It is expected that the vendor supplied interface to the television which contains the switch-off command will return a status marker indicating whether or not the command succeeded. Depending on the type of query or command sent to the device and its status, the Dialogue Manager fills in a Message Structure. A message structure contains five fields: quant, predicate, device, deviceclass, location and status. The quant field can contain one of the following symbols: NOTEXISTS, EXISTS, THE. The predicate field can contain one of the following symbols: GENADJ GENADJNOW GENADJALRDY NOT UNDERSTANDABLE TRYAGAIN SENSORNOTACCESSIBLE LOCATED NOTSCALABLE ALTERABLE NODEVICESOFSTATEDTYPE NO~DEVICEOFSTATED~TYPEINSTATEDLOCATION SEVERALDEVICESOFSTATEDTYPE
<Desc/Clms Page number 22>
The device field contains a device Key Feature value, as explained above, which in one embodiment may be one of at least LIGHT, HEATER, TV. The deviceclass field will contain a Device Type Classifier, in the sense explained above. In this embodiment, this will be one of SWITCHABLE, DIMMABLE and SENSOR. The location field contains a location identifier. In the present embodiment this will be one of at least LIVINGROOM, KITCHEN, BEDROOM. The status field will contain a number between 0 and 100.
In the current example, the message generated by switching on TU001 will therefore be:
quant = THE predicate = GENADJNOW device = TV deviceclass = SWITCHABLE location = LIVINGROOM status = 100
In the third stage of Dialogue Management (130), the message is used as an input to a language generation process. There are at least two possible outputs depending on a system configuration choice. In one embodiment, the output can be a simple string of words which can be passed to a text-to-speech synthesiser such as the Nuance Vocalizer (Trade Mark) system. In another embodiment, the output is a list of names of recorded audio files which, when played in sequence, results in a string of words being uttered through an audio output device. The language generation process is a template filling process. A template is chosen depending on particular values in the Message Structure. This template is a sequence of items. Some of the items are words to be uttered to the user. Some of the items are reserved symbols indicating that the procedure should look up the value of a behaviour key in the DDMS List (140) in the Dialogue Manager (13). Some of the items may be further templates to fill in.
The main choice of template to be filled in depends on the value of the predicate in the Message Structure. For example, if the value of predicate is NOT~UNDERSTANDABLE, then the template chosen is one which needs nothing further and the string of words"Sorry I didn't understand what you said"is generated.
<Desc/Clms Page number 23>
If audio files name are used, the name of an audio file containing a recording of someone saying"Sorry I didn't understand what you said"could be generated. In the current example, the predicate is GENADJNOW which results in a template being chosen which itself contains two (sub) templates. The first sub-template is filled in according to the value of quant in the Message Structure. If the value of quant is NOTEXISTS then the string of words"There is no"is generated, followed by the value of the GENDEVICE behaviour key in the DDMS List (140). In the current example, the value of quant is THE, which results in a lookup of the GENDEVICEDEF behaviour key in the DDMS List (140) for the current device and deviceclass (TV and SWITCHABLE, respectively). The relevant behaviour value is "the television". The second sub-template results in a lookup of the GENADJNOW behaviour key in the DDMS List (140) for the current device and deviceclass and this results in a behaviour value of another template. This template contains a string of words"is now"and the generation symbol GENONOFF. GENONOFF is a generation symbol which is interpreted by the Dialogue Manager as"find whether the Message Structure status is On or Off and lookup the behaviour value for GENON or GENOFF accordingly. The television is now on, therefore the Dialogue Manager looks up the value of GENON and finds the behaviour value of"on". Therefore the string of words "The television is now on"is generated. In an alternative embodiment which uses audio file names, a list of files is generated. The first contains a recording of"the television". The second contains a recording of"is now". The third contains a recording of"on").
We now describe in more detail how we have addressed the issues that arise when we attempt to apply the strong Plug and Play scenario to the tasks of speech recognition and language processing. Each device provides the knowledge that the speech interface needs in order to recognise the new types of utterance relevant to the device in question, and convert these utterances into well-formed semantic representations.
Modem speech interfaces supporting complex commands are typically specified using a rule-based grammar formalism defined by a platform like Nuance or SpeechWorks.
The type of grammar supported is some subset of full Context Free Grammar (CFG), extended to include semantic annotations. Grammar rules define the language model that constrains the recognition process, tuning it to the domain in order to achieve high performance. (They also supply the semantic rules that define the output
<Desc/Clms Page number 24>
representation ; we will return to this point later). If we want to implement an ambitious Plug and Play speech recognition module within this kind of framework, we have two top-level goals. On the one hand, we want to achieve high-quality speech recognition. At the same time, standard software engineering considerations suggest that we want to minimize the overlap between the rule-sets contributed by each device: ideally, the device will only upload the specific lexical items relevant to it. It turns out that our software engineering objectives conflict to some extent with our initial goal of achieving high-quality speech recognition. Consider a straightforward solution, in which the grammatical information contributed by each device consists purely of lexical entries, i. e. entries of the form < Nonterminal > - > < Terminal > In a CFG-based framework, this implies that we have a central device-independent CFG grammar, which defines the other rules which link together the non-terminals that appear on the left-hand-sides of the lexical rules. The crucial question is what these lexical non-terminal symbols will be. Suppose, for concreteness, that we want our set of devices to include lights with dimmer switches, which will among other things accept commands like"dim the light". We might achieve this by making the device upload lexical rules of the rough form
TRANSITIVEVERB- > dim NOUN- > light where the LHSs are conventional grammatical categories. (We will for the moment skip over the question of how to represent semantics). The lexical rules might combine with general grammar rules of the form COMMAND- > TRANSITIVEVERB NP NP- > DETNOUN DET-- > the This kind of solution is easy to understand, but experience shows that it leads to poor speech recognition. The problem is that the language model produced by the grammar
<Desc/Clms Page number 25>
is under constrained: it will in particular allow any transitive verb to combine with any noun phrase (NP). However, a verb like"dim"will only combine with a restricted range of possible NPs, and ideally we would like to capture this fact. What we really want to do is parameterise the language model. In the present case, we want to parameterise the TRANSITIVE VERB"dim"with the information that it only combines with object NPs that can be used to refer to dimmable devices. We will parameterise the NP and NOUN non-terminals similarly. The obvious way to do this within the bounds of CFG is to specialise the rules approximately as follows: COMMAND- > TRANSITIVEDIMVERB DIMMABLENP DIMMABLENP- > DET DIMMABLENOUN TRANSITIVE DIMVERB- > dim DIMMABLENOUN- > light DET- > the
Unfortunately, however, this defeats the original object of the exercise, since the "general"rules now make reference to the device-specific concept of dimming. What we want instead is a more generic treatment, like the following: COMMAND- > TRANSITIVEVERB : [semobjtype=T] NP: [sem~type=T]
NP: [semtype=T]- > DETNOUN : [semtype=T] DET-- > the
TRANSITIVE-VERB : [semobjtype=dimmablc]- > dim
NOUN : [semtype=dimmable]- > light This kind of parameterisation of a CFG is not in any way new: it is simply unification grammar (one example of which is described in"Generalized Phrase Structure Grammar"by Gazdar, Klein, Pullum and Sag, Harvard University Press, 1985). Thus our first main idea is to raise the level of abstraction, formulating the device grammar at the level of unification grammars, and compiling these down into the underlying CFG representation. There are now a number of systems which can perform this type of compilation (see for example,"Using natural language sources knowledge sources
<Desc/Clms Page number 26>
in speech recognition"by Moore, Proceedings of the NATO Advanced Studies Institute, 1998; also"A context free approximation of head-driven phrase structure grammar"by Kiefer and Kruger, Proceedings of the 6th Int. Workshop on Parsing Technologies, Trento, Italy 2000). The basic methods for the compilation that we use in our system are described in detail in"A baseline method for compiling typed unification grammars into context free language models"by Rayner, Dowding and Hockey, Proceedings of Eurospeech, Aalborg 2001). Here, we will focus on the aspects directly relevant to the"distributed"unification grammars needed for Plug and Play.
We start with a general device-independent unification grammar, which implements the core grammar rules. In our current English language prototype, there are 34 core rules. Typical examples are the NP conjunction and prepositional phrase (PP) modifications rules, schematically NP- > NPCONJNP NP- > NP PP These rules are likely to occur in connection with any kind of device. These rules are parameterised by various features. For example, the set of features associated with the NP category includes grammatical number (singular or plural), WH (plus or minus) and sortal type (multiple options). Each individual type of device can extend the core grammar in one of three possible ways:
1. New lexical entries. A device may add lexical entries for device-specific words and phrases; e. g. , a device will generally contribute at least one noun used to refer to it.
2. New grammar rules A device may add device-specific rules; e. g. , a dimmer switch may include rules for dimming and brightening, like"another X percent"or"a bit brighter".
3. New feature values. Least obviously, a device may extend the range of values that a grammatical feature can take (see further below).
<Desc/Clms Page number 27>
For usual software engineering reasons, we find it convenient to divide the distributed grammar into modules; the grammatical knowledge associated with a device may reside in more than one module. The grammar in our current demonstrator contains 21 modules, including the"core"grammar described above. Each device typically requires between two and five modules. For example, an on/off light switch loads three modules: the core grammar, the general grammar for on/off switchable devices, and the grammar specifically for on/off switchable lights. The core grammar, as already explained, consists of linguistically oriented device-independent grammar rules. The module for on/off switchable devices contains grammar rules specific to on/off switchable behaviour, which in general make use of the framework established by the general grammar. For example, there are rules of the schematic form
QUESTION is NP : [semtype=device] ONOFFPHRASE PARTICLE-VERB : [particle Jype=onoff]- > switch Finally, the module for on/off switchable lights is very small, and just consists of a handful of lexical entries for nouns like"light", defining these as nouns referring to on/off switchable devices. The way in which nouns of this kind can combine is however defined entirely by the on/off switchable device grammar and core grammar.
The pattern here turns out to be the usual one: the grammar appropriate to a device is composed of a chain of modules, each one depending on the previous link in the chain and in some way specialising it. Structurally, this is similar to the organisation of a piece of normal object-oriented software, and we have been interested to discover that many of the standard concepts of object-oriented programming carry over naturally to distributed unification grammars. In the remainder of the section, we will expand on this analogy. If we think in terms of Java or a similar mainstream 00 language, a major grammatical constituent like S, NP or PP has many of the properties of a method in an 00 interface. Grammar rules in one module can make reference to these constituents, letting rules in other modules implement their definition. For example, the temperature sensor grammar module contains a small number of highly specialised rules
QUESTION what is the temperature PP: [pp~type=location]
<Desc/Clms Page number 28>
QUESTION how many degrees is it PP: [pptype=location] The point to note here is that the temperature sensor grammar module does not define the locative PP construction; this is handled elsewhere, currently in the core grammar module. The upshot is that the temperature sensor module is able to define its constructions without worrying about the exact nature of the locative PP construction.
As a result, one can for instance upgrade the PP rules to include conjoined PPs (thus allowing e. g."what is the temperature in the kitchen and the living room") without in any way altering the grammar rules in the temperature sensor module. In order for the scheme to work, the"interface methods"--the major categories--naturally need to be well defined. In practice, this implies restrictions on the way we handle three things: the set of syntactic features associated with a category, the range of possible values (the domain) associated with each feature, and the semantics of the category. We consider each of these in turn. Most obviously, we need to standardise the feature-set for the category. In the preferred embodiment, we define most major categories in the core grammar module, to the extent of specifying there the full range of features associated with each category. It turns out, however, that it is sometimes desirable not to fix the domain of a feature in the core grammar, but rather to allow this domain to be extended as new modules are added. The issues that arise here are interesting, and we will discuss them in some detail. The problems occur primarily in connection with features mediating sortal constraints. As we have already seen in examples above, most constituents will have at least one sortal feature, encoding the sortal type of the constituent; there may also be further features encoding the sortal types of possible complements and adjuncts. For example, the V category has a feature vtype
~ype encoding the sortal encoding the sortal type of the V itself, a feature obj~~, Fen~np t type of a possible direct object, and a feature vp odi iers~ype encoding the sortal type of a possible post verbal modifier. Features like these pose two interrelated problems. First, the plug and play scenario implies that we cannot know ahead of time the whole domain of a sortal feature. It is always possible that we will connect a device whose associated grammar module requires definition of a new sortal type, in order to enforce appropriate constraints in the language model. The second problem is that it is still often necessary to define grammar rules referring to sortal features before the domains of these features are known. In particular, the core module will contain many such rules. Even before knowing the identity of any specific devices,
<Desc/Clms Page number 29>
general grammar rules may well want to distinguish between"device"NPs and "location"NPs. For example, the general"where-question"rule has the form
QUESTION where is NP Here, we prefer to constrain the NP so as to make it refer only to devices, since the system currently has no way to interpret a where question referring to a room, e. g.
"where is the bathroom". We have addressed these issues in a natural way by adapting the 00-oriented idea of inheritance: specifically, we define a hierarchy of possible feature values, allowing one feature value to inherit from another. In the context of the"where is NP"rule above, we define the rule in the core module; in this
module, the sortal NP feature sem.. np t module, the sortal NP feature sem - np Jype may only take the two values device and location, which we specify with the declaration (We have slightly simplified the form of the declaration for expository purposes.) domain (semnptype, [location, device]) This allows us to write the constrained"where is"rule as
QUESTION where is NP: [semnptype=device] Suppose now that we add modules for both on/off switchable and dimmable devices; we would like to make these into distinct sortal types, called switchable-device and dimmable~device. We do this by including the following declarations in the "switchable"module :
domain (semnptype, [location, device, switchabledevice]) specialises (switchabledevice, device) and correspondingly in the"dimmable"module domain (semnp~type, [location, device, dimmabledevice]) specialises (dimmabledevice, device)
<Desc/Clms Page number 30>
When all these declarations are combined at compile-time, the effect is as follows.
The domain of the sem n Thedomainofthe/MMppe feature is now the union of the domains specified by p t each component, and is thus the set (location, device, switchable-device, dimmable-device). Since switchable-device and dimmable-device are the precise values specialising device, the compiler systematically replaces the original feature value device with the disjunction
switchabledevice V dimmable device Thus the"where is"rule now becomes QUESTION where is NP : [semnptype=switchabledevice V dimmable device] If new modules are added which further specialise switchablc vice, then the rule will again be adjusted by the compiler so as to include appropriate new elements in the disjunction. The important point to notice here is that no change is made to the original rule definition; in line with normal 00 thinking, the feature domain information is distributed across several independent modules, and the changes occur invisibly at compile-time. We have so far said nothing about how we deal with semantics, and we conclude the section by sketching our treatment. In fact, it is not clear to us that the demands of supporting Plug and Play greatly affect semantics. If they do, the most important practical consideration is probably that plug and play becomes easier to realise if the semantics are kept simple. We have at any rate adopted a minimal semantic representation scheme, and the lack of problems we have experienced with regard to semantics may partly be due to this. The annotated CFG grammars produced by our compiler are in normal Nuance Grammar Specification Language (GSL) notation, which includes semantics; unification grammar rules encode semantics using the distinguished feature sem, which translates into the GSL return construction. So for example the unification grammar rules
DEVICE~NOUN : [sem=light]-- > light
DEVICENOUN : [sem=heater]-- > heater translates into the GSL rule
<Desc/Clms Page number 31>
DEVICE~NOUN [light {return (light)} heater {return (heater
Unification grammar rules may contain variables, translating down into GSL variables; so for example,
NP: [sem=[D,N]] -- > DET: [sem=D] NOUN : [sem=N] translates into the GSL rule NP (DET: d NOUN : n) {return ( ($d $n) )} Our basic semantic representation is a form of feature/value notation, extended to allow handling of conjunction. We allow four types of semantic construction:
1. Simple values, e. g light, heater. These are typically associated with lexical entries.
2. Feature/value pairs expressed in list notation, e. g/v/ce,/ (/,/7ocao, kitchen}. These are associated with nouns, adjectives and similar constituents.
3. Lists of feature/value pairs, e. g [[device, light], [location, kitchen]/. These are associated with major constituents such as NP, PP, VP and S.
4. Conjunctions of lists of feature/value pairs, e. g./a// < wce, / ! //, [[device,,heate]]] These represent conjoined constituents, e. g. conjoined
NPs, PPs and Ss.
This scheme makes it straightforward to write the semantic parts of grammar rules.
Most often, the rule just concatenates the semantic contributions of its daughters: thus for example the semantic features of the nominal PP rule are simply
NP: [sem=concat (Np, Pp)]-- > NP: [sem=Np] PP: [sem=Pp] The semantic output of a conjunction rule is typically the conjunction of its daughters excluding the conjunction itself, e. g.
<Desc/Clms Page number 32>
NP: [sem=[and, Npl, Np2]] -- > NP: [sem=Npl] and NP: [sem=Np2] It will be appreciated that many variations are possible to the systems described and that various inventions will be embodied in those systems.
Claims (27)
1. A system for the speech control of a network of devices in which a device can be added to or removed from the network or updated, the system containing speech recognition components that are capable of automatic reconfiguration in response to a linguistic server being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
2. A system as claimed in claim 1 wherein said speech recognition components comprise a unification grammar implementing a plurality of core grammar rules which are not updated or removed by said reconfiguration.
3. A system as claimed in claim 2 wherein said speech recognition components further comprise other unification grammars implementing grammatical information relevant to an environment in which the network resides
4. A system as claimed in claim 1, 2 or 3 wherein said speech recognition components are divided into a plurality of modules applicable to differing numbers of said devices such that each device is associated with a set of linked modules of decreasing levels of generality.
5. A system as claimed in claim 4 arranged wherein said reconfiguration affects more than one of said modules.
6. A system as claimed in any preceding claim further including dialogue management components that are also capable of automatic reconfiguration in response to the linguistic server being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of dialogue management information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
<Desc/Clms Page number 34>
7. A system as claimed in any preceding claim comprising a device in the network having an associated device grammar in the form of data describing the grammatical rules and declarations which, either alone or in combination with the other grammatical rules and declarations made available to the system, define a set of valid sentences that may be used to query, command and control the device.
8. A system as claimed in any preceding claims wherein a device in the network has an associated device dialogue management specification in the form of data describing the dialogue management behaviour of the device.
9. A system as claimed in any preceding claim wherein said network comprises means storing data relating to the operation of a further device not connected to the network such that said further device may be added to said network without requiring any manual configuration thereof.
10. A system as claimed in claim 6 comprising means for generating error messages specific to said further device if said further device is not in fact present on the network.
11. A system as claimed in any preceding claim further comprising at least one device arranged to make available to the linguistic server data relating to the operation of said device.
12. A system as claimed in any preceding claim comprising means for establishing a data connection with a server to acquire information relating to the grammar or dialogue management of devices which are subsequently updated or added to said network.
13. A system as claimed in claim 12 wherein said means for establishing a data connection comprises a wide area network connection such as the Internet, the system being arranged to acquire from said updated or added devices details of the location of said information such as a uniform resource locator.
<Desc/Clms Page number 35>
14. A method of introducing a new or updated device to a system as claimed in any of claims 1 to 11 comprising establishing a data connection between said new or updated device and said network and transferring from said device to said network data relating to a portion of grammar associated with said device.
15. A method of introducing a new or updated device to a system as claimed in any of claims 1 to 11 comprising establishing a data connection between said new or updated device and said network and transferring from said device to said network data relating to a required modification to at least one of : a speech recognition element ; a parsing element; a context independent semantic interpreter; a context dependent semantic interpreter ; a domain dependent semantic interpreter; an action executor; a speech generator; and a speech synthesiser.
16. A method of introducing a new or updated device to a system as claimed in claim 4 comprising adding or updating more than one of said modules.
17. Apparatus for the speech control of a plurality of devices comprising: a network of controllable devices; a linguistic server; speech recognition components configured to recognise speech uttered by a user and to control said devices in response thereto; wherein said network is arranged to inform said linguistic server if a device is added to said network or one of said devices is updated or removed from the network, the linguistic server being arranged thereafter to reconfigure said speech recognition components automatically by means of adding, updating or removing speech recognition information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
18. Apparatus as claimed in claim 17 arranged such that a device being added or updated supplies said speech recognition information.
19. Apparatus as claimed in claim 17 arranged such that a device being added or updated supplies identifying information to allow said linguistic server to acquire said
<Desc/Clms Page number 36>
speech recognition information from a further server using said device-identifying information.
20. Computer software for the speech control of a network of devices in which a device can be added to or removed from the network or updated, the software containing speech recognition components that are capable of automatic reconfiguration in response to a linguistic server being informed by the network that a device in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the device whose addition, removal or updating has initiated the reconfiguration.
21. Computer software as claimed in claim 20 wherein said speech recognition components comprise a unification grammar encoding a plurality of core grammar rules which are not updated or removed by said reconfiguration.
22. Computer software as claimed in claim 20 or 21 wherein said speech recognition components are divided into a plurality of modules applicable to differing numbers of said devices such that each device is associated with a set of linked modules of decreasing levels of generality.
23. Server software adapted, when run on a computer, to cause said computer to carry out the functions of the linguistic server of the apparatus claimed in claims 17 to 19.
24. A system for the update and maintenance of a speech interface to a computer based information system in which information in the information system can be added to or removed from the network or updated, the system containing speech recognition components that are capable of automatic reconfiguration in response to a linguistic server being informed by the network that a piece of information in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the information or device whose addition, removal or updating has initiated the reconfiguration.
<Desc/Clms Page number 37>
25. Computer software for the speech control of a computer-based information system in which information can be added to or removed from the system or updated, the software containing speech recognition components that are capable of automatic reconfiguration in response to a linguistic server being informed by the network that information in the network has been added, removed or updated, the reconfiguration comprising the addition, removal or update of speech recognition information pertinent to the information whose addition, removal or updating has initiated the reconfiguration.
26. A system for the speech control of a network of devices substantially as hereinbefore described with reference to the accompanying drawings.
27. Computer software for the speech control of a network of devices substantially as hereinbefore described with reference to the accompanying drawings.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0111607A GB0111607D0 (en) | 2001-05-11 | 2001-05-11 | Spoken interface systems |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0210891D0 GB0210891D0 (en) | 2002-06-19 |
GB2379787A true GB2379787A (en) | 2003-03-19 |
Family
ID=9914499
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0111607A Ceased GB0111607D0 (en) | 2001-05-11 | 2001-05-11 | Spoken interface systems |
GB0210891A Withdrawn GB2379787A (en) | 2001-05-11 | 2002-05-13 | Spoken interface systems |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0111607A Ceased GB0111607D0 (en) | 2001-05-11 | 2001-05-11 | Spoken interface systems |
Country Status (1)
Country | Link |
---|---|
GB (2) | GB0111607D0 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8812171B2 (en) | 2007-04-26 | 2014-08-19 | Ford Global Technologies, Llc | Emotive engine and method for generating a simulated emotion for an information system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1045374A1 (en) * | 1999-04-13 | 2000-10-18 | Sony International (Europe) GmbH | Merging of speech interfaces for concurrent use of devices and applications |
GB2365145A (en) * | 2000-07-26 | 2002-02-13 | Canon Kk | Voice control of a machine |
-
2001
- 2001-05-11 GB GB0111607A patent/GB0111607D0/en not_active Ceased
-
2002
- 2002-05-13 GB GB0210891A patent/GB2379787A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1045374A1 (en) * | 1999-04-13 | 2000-10-18 | Sony International (Europe) GmbH | Merging of speech interfaces for concurrent use of devices and applications |
GB2365145A (en) * | 2000-07-26 | 2002-02-13 | Canon Kk | Voice control of a machine |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8812171B2 (en) | 2007-04-26 | 2014-08-19 | Ford Global Technologies, Llc | Emotive engine and method for generating a simulated emotion for an information system |
US9189879B2 (en) | 2007-04-26 | 2015-11-17 | Ford Global Technologies, Llc | Emotive engine and method for generating a simulated emotion for an information system |
US9292952B2 (en) | 2007-04-26 | 2016-03-22 | Ford Global Technologies, Llc | Task manager and method for managing tasks of an information system |
US9495787B2 (en) | 2007-04-26 | 2016-11-15 | Ford Global Technologies, Llc | Emotive text-to-speech system and method |
US9811935B2 (en) | 2007-04-26 | 2017-11-07 | Ford Global Technologies, Llc | Emotive advisory system and method |
Also Published As
Publication number | Publication date |
---|---|
GB0111607D0 (en) | 2001-07-04 |
GB0210891D0 (en) | 2002-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11182556B1 (en) | Applied artificial intelligence technology for building a knowledge base using natural language processing | |
EP1891625B1 (en) | Dialogue management using scripts | |
US6513010B1 (en) | Method and apparatus for separating processing for language-understanding from an application and its functionality | |
US7716056B2 (en) | Method and system for interactive conversational dialogue for cognitively overloaded device users | |
EP1772854B1 (en) | Method and apparatus for organizing and optimizing content in dialog systems | |
JP5166661B2 (en) | Method and apparatus for executing a plan based dialog | |
JP5366810B2 (en) | A computer-used method for developing ontology from natural language text | |
KR100764174B1 (en) | Apparatus for providing voice dialogue service and method for operating the apparatus | |
EP0953896B1 (en) | Semantic recognition system | |
US6975983B1 (en) | Natural language input method and apparatus | |
US20050138556A1 (en) | Creation of normalized summaries using common domain models for input text analysis and output text generation | |
WO2008048090A2 (en) | Method, device, computer program and computer program product for processing linguistic data in accordance with a formalized natural language. | |
US20020087310A1 (en) | Computer-implemented intelligent dialogue control method and system | |
US8509396B2 (en) | Automatic creation of complex conversational natural language call routing system for call centers | |
KR20080005745A (en) | Spoken dialog system for human computer interface and response method therein | |
EP1638081A1 (en) | Creating a speech recognition grammar for alphanumeric concepts | |
CN101185116A (en) | Using strong data types to express speech recognition grammars in software programs | |
Rayner et al. | Plug and play speech understanding | |
GB2379787A (en) | Spoken interface systems | |
Hastings | Design and implementation of a speech recognition database query system | |
Rayner et al. | Plug and play spoken dialogue processing | |
Rapp et al. | Dynamic speech interfaces | |
Filipe et al. | Hybrid Knowledge Modeling for Ambient Intelligence | |
Brasoveanu et al. | The Basics of Syntactic Parsing in ACT-R | |
Sunnehall | Robust parsing using dependency with constraints and preference |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |