GB2375210A - Grammar coverage tool for spoken language interface - Google Patents

Grammar coverage tool for spoken language interface Download PDF

Info

Publication number
GB2375210A
GB2375210A GB0110532A GB0110532A GB2375210A GB 2375210 A GB2375210 A GB 2375210A GB 0110532 A GB0110532 A GB 0110532A GB 0110532 A GB0110532 A GB 0110532A GB 2375210 A GB2375210 A GB 2375210A
Authority
GB
United Kingdom
Prior art keywords
sentences
grammar
sentence
user
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0110532A
Other versions
GB0110532D0 (en
GB2375210B (en
Inventor
David Horowitz
Peter Phelan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vox Generation Ltd
Original Assignee
Vox Generation Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vox Generation Ltd filed Critical Vox Generation Ltd
Priority to GB0110532A priority Critical patent/GB2375210B/en
Publication of GB0110532D0 publication Critical patent/GB0110532D0/en
Priority to PCT/GB2002/001962 priority patent/WO2002089113A1/en
Priority to GB0326763A priority patent/GB2391993B/en
Publication of GB2375210A publication Critical patent/GB2375210A/en
Application granted granted Critical
Publication of GB2375210B publication Critical patent/GB2375210B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A grammar coverage tool operates on sets of sentences which can be expressed as ordered sequences of word class tags. Previously, each sentence has been marked as either a positive or negative example with respect to some previously determined preference criterion. The entire set is divided into 2 parts: a training set, and a test set. The training set is re-written as set of binary vectors, where each tag which occurs in the training set is an attribute. The binary vectors are then submitted to an algorithm which will induce a decision tree which classifies the training set into positive and negative examples. A comparison is then made with the original and authentic classification for the test set. Thus, the % error rate can be calculated. If the error rate is too high, the process can be re-run with a larger volume of training data, until an acceptably low error rate is achieved. The induction of decision trees may use either the ID3 algorithm, or the C4.5 algorithm.

Description

<Desc/Clms Page number 1>
GRAMMAR COVERAGE TOOL FOR SPOKEN LANGUAGE INTERFACE This invention relates to spoken language interfaces which enable a user to interact using voice with a computer system. It is more specifically concerned with the writing of grammars for use in such interfaces.
A spoken language interface involves a two way dialogue between the user and the computer system. An automatic speech recognition system is used to comprehend the speech delivered by the user and an automatic speech generation system is used to generate speech to be played out to the user. This may be, for example, a mixture of speech synthesis and recorded voice.
Spoken language interfaces are well known. One example is described in our earlier application GB 01 05 005.3 filed on 28 February 2001.
Spoken language interfaces rely on grammars to interpret user's commands and formulate responses. The grammar defines the sequences of words which it is possible for the interface to recognise. A grammar is a set of rules which defines a set of sentences. Typically the rules are normally expressed in some kind of algebraic notation. The set of grammar rules usually define a much larger set of sentences; few rules cover many sentences. Size constraints apply to these grammars. Automatic speech recognisers can only recognise with high accuracy for grammars of a limited size. This means that there is a very strong motivation to remove any useless expressions from a grammar.
In any computer related application, there will be a limited number of expressions that a user is likely to say.
If one considers an on-line flight reservation service it
<Desc/Clms Page number 2>
can easily be seen that it is possible to predict the majority of expressions that a user is likely to use. It can further be seen that a grammar is specific to a given application. In other words, for each application, a new grammar has to be written for the Spoken Language Interface.
(SLI).
I When constructing a grammar, some sentences may be preferable over others, according to one or more criteria. For example, expressions which are grammatically correct (in the old fashioned sense of the term) might be preferred over this which are not. Alternatively, expressions which users use frequently, based on significant volumes of user-data, might be preferred over those which occur less often. A particular SLI provider might develop a consistent style across applications and favour expressions in that style.
In the ideal case, a grammar should cover all and only the expressions users are likely to say.
At present, grammar writing is a time consuming operation involving the expertise of grammar writers to encapsulate manually the preferred sentences within each grammar and requiring painstaking care. This makes it hard to bring new applications to market quickly and also very expensive.
We have appreciated that there is a need for a tool which can assist in the process of grammar writing.
The invention aims to provide such a tool.
The invention is defined by the independent claims to which reference should be made. Preferred features are set out in the dependent claims.
A grammar coverage tool embodying the invention has the advantage of enabling automatic elimination of unfavoured sentences, on the basis of one or more preference criteria. Unfavoured sentences include those generated by incorrectly
<Desc/Clms Page number 3>
written grammar rules, those which are syntactically undesirable, and those which fail to match the preference criteria.
During the process of writing a grammar, a grammar writer may write a rule that is incorrectly structured.
That is to say, a rule which generates non-sense sentences.
Embodiments of the invention here (Grammar Coverage Tool) automatically detect poorly written grammar rules and corrects them. In this sense, the Grammar Coverage Tool embodying the invention is an automatic debugging tool that saves a significant amount of time in the grammar writing process. It has the further advantage of further assessing the quality of a grammar by assuring that all forms of the likely occurring utterances are captured by the grammar. In this sense, the Grammar Coverage Tool embodying the invention is a design enhancement tool that has the advantage of further reducing the time it takes the grammar writer to design a well-written grammar.
An embodiment of the present invention will now be described, by way of example, in which: Figure 1 is a schematic overview of a Spoken Language Interface; Figure 2 is a logical model of the interface architecture; Figure 3 is a more detailed view of the interface architecture ; Figure 4 is a flow chart illustrating steps in a process embodying the present invention; Figure 5 is a table showing a matrix of vectors representing tag sequences; Figure 6 shows a decision tree generated from the matrix of Figure 5; and Figure 7 shows a vector matrix for a second example.
<Desc/Clms Page number 4>
The system schematically outlined in Figure 1 is a spoken language interface intended for communication with applications via mobile, satellite, or landline telephone.
In the example shown communication is via a mobile telephone 18 but any other voice telecommunications device such as a
conventional telephone can be utilised. Calls to the system / are handled by a telephony unit 20. Connected to the telephony unit are a Voice Controller 19, an Automatic Speech Recognition System (ASR) 22 and a automatic speech generation system 26. The ASR 22 and ASG systems are each connected to the voice controller 19. A dialogue manager 24 is connected to the voice controller 19 and also to a spoken language interface (SLI) repository 30, a personalisation and adaptive learning unit 32 which is also attached to the SLI repository 30, and a session and notification manager 28. The Dialogue Manager is also connected to a plurality of Application Managers AM, 34 each of which is connected to an application which may be content provision external to the system. In the example shown, the content layer includes e-mail, news, travel, information, diary, banking etc. The nature of the content provided is not important to the principles of the invention.
The SLI repository is also connected to a development suite 35 that was discussed previously.
Figure 2 provides a more detailed overview of the architecture of the system. The automatic speech generation unit 26 of figure 1 includes a basic TTS unit, a batch TTS unit 120, connected to a prompt cache 124 and an audio player 122. It will be appreciated that instead of using generated speech, pre-recorded speech may be played to the user under the control of the voice control 19. It the embodiment illustrated a mixture of pre-recorded voice and TTS is used.
<Desc/Clms Page number 5>
The system then comprises three levels: session level 120, application level 122 and non-application level 124.
The session level comprises a location manager 126 and a dialogue manager 128. The session level also includes an interactive device control 130 and a session manager 132 which includes the functions of user identification and Help Desk.
The application layer comprises the application framework 134 under which an application manager controls an application. Many application managers and applications will be provided, such as UMS (Unified Messaging Service), Call connect & conferencing, e-Commerce, Dictation etc. The non-application level 124 comprises a back office subsystem 140 which includes functions such as reporting, billing, account management, system administration, "push" advertising and current user profile. A transaction subsystem 142 includes a transaction log together with a transaction monitor and message broker.
In the final subsystem, an activity log 144 and a user profile repository 146 communicate with an adaptive learning unit 148. The adaptive learning unit also communicates with the dialogue manager 128. A personalisation module 150 also communicates with the user profiles repository 146 and the dialogue manager 128.
Referring back to Figure 1, the various functional components are briefly described as follows:
Voice Control 19 This allows the system to be independent of the ASR 22 and TTS 26 by providing an interface to either proprietary or non-proprietary speech recognition, text to speech and telephony components. The TTS may be replaced by, or supplemented by, recorded voice. The voice control also
<Desc/Clms Page number 6>
provides for logging and assessing call quality. The voice control will optimise the performance of the ASR.
Spoken Language Interface Repository 30 In contrast to the prior art, grammars, that is
constructs and user utterances for which the system listens, j prompts and workflow descriptors are stored as data in a database rather than written in time consuming ASR/TTS specific scripts. As a result, multiple languages can be readily supported with greatly reduced development time, a multi-user development environment is facilitated and the database can be updated at anytime to reflect new or updated applications without taking the system down. The data is stored in a notation independent form. The data is converted or compiled between the repository and the voice control to the optimal notation for the ASR being used.
This enables the system to be ASR independent.
ASR & ASG (Voice Enqine) 22, 26 The voice engine is effectively dumb as all control comes from the dialogue manager via the voice control.
Dialoaue Manager 24 The dialogue manager controls the dialogue across multiple voice servers and other interactive servers (eg WAP, Web etc). As well as controlling dialogue flow it controls the steps required for a user to complete a task through mixed initiative-by permitting the user to change initiative with respect to specifying a data element (e. g. destination city for travel). The Dialog Manager may support comprehensive mixed initiative, allowing the user to change topic of conversation, across multiple applications while maintaining state representations where the user left off in the many domain specific conversations. Currently, as initiative is changed across two applications, state of
<Desc/Clms Page number 7>
conversation is maintained. Within the system, the dialogue manager controls the workflow. It is also able to dynamically weight the users language model by adaptively controlling the probabilities associated with the likely speaking style that the individual user employs dialogue structures in real-time, this is the chief responsibility the Adaptive Learning Engine and the current state of the conversation as a function of the current state of the conversation e user) with the user. The method by which the adaptive learning agent was conceived, is to collect user speaking data from call data records. This data, collected from a large domain of calls (thousands) provides the general profile of language usage across the population of speakers. This profile, or mean language model forms a basis for the first step in adjusting the language model probabilities to improve ASR accuracy. Within a conversation, the individual user's profile is generated and adaptively tuned across the user's subsequent calls. Early in the process, key linguistic cues are monitored, and based on individual user modelling, the elicitation of a particular language utterance dynamically invokes the modified language model profile tailored to the user, thereby adaptively tuning the user's language model profile and individual increasing the ASR accuracy for that user.
Finally, the dialog manager includes a personalisation engine. Given the user demographics (age, sex, dialect) a specific personality tuned to user characteristics for that user's demographic group is invoked.
The dialog manager also allows dialogue structures and applications to be updated or added without shutting the system down. It enables users to move easily between contexts, for example from flight booking to calendar etc, hang up and resume conversation at any point; specify information either step-by-step or in one complex sentence, cut-in and direct the conversation or pause the conversation temporarily.
<Desc/Clms Page number 8>
Telephony The telephony component includes the physical telephony interface and the software API that controls it.
The physical interface controls inbound and outbound calls, handles conferencing, and other telephony related functionality.
I Session and Notification Management 28 The Session Manager initiates and maintains user and application sessions. These are persistent in the event of a voluntary or involuntary disconnection. They can reinstate the call at the position it had reached in the system at any time within a given period, for example 24 hours. A major problem in achieving this level of session storage and retrieval relates to retrieving a session in which a conversation is stored with either a dialogue structure, workflow structure or an application manager has been upgraded. In the preferred embodiment this problem is overcome through versioning of dialogue structures, workflow structures and application managers. The system maintains a count of active sessions for each version and only retires old versions once the versions count reaches zero. An alternative, which may be implemented, requires new versions of dialogue structures, workflow structures and application managers to supply upgrade agents. These agents are invoked whenever by the session manager whenever it encounters old versions in the stored session. A log is kept by the system of the most recent version number. It may be beneficial to implement a combination of these solutions the former for dialogue structures and workflow structures and the latter for application managers The notification manager brings events to a user's attention, such as the movement of a share price by a predefined margin. This can be accomplished while the users are offline through interaction with the dialogue manager or offline. Offline notification is achieved either by the
<Desc/Clms Page number 9>
system calling the user and initiating an online session or through other media channels, for example, SMS, Pager, fax, email or other device.
Application Managers 34 Application Managers (AM,) are components that provide the interface between the SLI and one or more of its content suppliers (i. e. other systems, services or applications). Each application manager (there is one for every content supplier) exposes a set of functions to the dialogue manager to allow business transactions to be realised (e. g. GetEmail (), SendEmail (), BookFlight (), GetNewsItem (), etc). Functions require the DM to pass the complete set of parameters required to complete the transaction. The AM returns the successful result or an error code to be handled in a predetermined fashion by the DM.
An AM is also responsible for handling some stateful information. For example, User A has been passed the first 5 unread emails. Additionally, it stores information relevant to a current user task. For example, flight booking details. It is able to facilitate user access to secure systems, such as banking, email or other.
It can also deal with offline events, such as email arriving while a user is offline or notification from a flight reservation system that a booking has been confirmed. In these instances the AM's role is to pass the information to the Notification Manager.
An AM also exposes functions to other devices or channels, such as web, WAP, etc. This facilitates the multi channel conversation discussed earlier.
AMs are able to communicate with each other to facilitate aggregation of tasks. For example, booking a flight primarily would involve a flight booking AM, but this would directly utilise a Calendar AM in order to enter flight times into a users Calendar.
<Desc/Clms Page number 10>
AMs are discrete components built, for example, as enterprise Java Beans (EJBs) they can be added or updated while the system is live.
Transaction & Message Broker 142 (Fiq. 2) The Transaction and Message Broker records every logical transaction, identifies revenue-generating transactions, routes messages and facilitates system recovery.
Adaptive Learning & Personalisation 32 ; 148, 150 (Fiq. 2) Spoken conversational language reflects quite a bit of a user's psychology, socio-economic background, and dialect and speech style. The reason an SLI is a challenge, which is met by embodiments of the invention, is due to these confounding factors. Embodiments of the invention provide a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features. Before discussing in detail the complexity of encoding this knowledge, it is noted that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any recogniser. User profiling, in part, solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options.
The adaptive learning technique is a stochastic (statistical) process which first models which types, dialects and styles the entire user base of users employ.
By monitoring the Spoken Language of many hundreds of calls, a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, could be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for
<Desc/Clms Page number 11>
the average user would not work well for him. A more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed. The general data of the masses is used initially to set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
The approach, then, embodies statistical modelling across an entire population of users. The stochastic nature of the approach occurs, when new observations are made across the average mass, and language modelling weights are adaptively assigned to tune the recogniser.
Help Assistant & Interactive Traininq The Help Assistant & Interactive Training component allows users to receive real-time interactive assistance and training. The component provides for simultaneous, multi channel conversation (i. e. the user can talk through a voice interface and at the same time see visual representation of their interaction through another device, such as the web).
Databases The system uses a commercially available database such as Oracle 81 from Oracle Corp.
Central Directory The Central Directory stores information on users, available applications, available devices, locations of servers and other directory type information.
System Administration-Infrastructure The System Administration-Applications, provides centralised, web-based functionality to administer the
<Desc/Clms Page number 12>
custom build components of the system (e. g. Application Managers, Content Negotiators, etc.).
Development Suite (35) This provides an environment for building spoken language systems incorporating dialogue and prompt design, workflow and business process design, version control and system testing. It is also used to manage deployment of system updates and versioning.
Rather than having to labouriously code likely occurring user responses in a cumbersome grammar (e. g. BNF grammar-Bachus Nauer Format) resulting in time consuming detailed syntactic specification, the development suite provides an intuitive hierarchical, graphical display of language, reducing the modelling act to creatively uncover the precise utterance but the coding act to a simple entry of a data string. The development suite enables a Rapid Application Development (RAD) tool that combines language modelling with business process design (workflow).
The grammar coverage tool is embodied in the development suite. After a rule is entered in'o he PAD system, the GCT is invoked. It tests the efficiency of the rule (i. e. that it does not generate any garbage utterances
which a user would not say ; due to incorrect syntax ; and i-wc-11-I. d-, lot sa7 ; also evaluates if the rule generates the necessary linguistic coverage adopted by Lhe house-style or conventional wisdom abou he way in which users hrase responses.
Dialogue Subsystem The Dialogue Subsystem manages, controls and provides the interface for human dialogue via speech and sound.
Referring to Figure 1, it includes the dialogue manager, spoken language interface repository, session and notification managers, the voice controller 19, the Automatic Speech Recognition Unit 22, the Automatic Speech Generation unit 26 and telephony components 20. The
<Desc/Clms Page number 13>
subsystem is illustrated in more detailed architecture of the interface shown in Figure 3.
Before describing the dialogue subsystem in more detail, it is appropriate first to discuss what is a Spoken Language Interface (SLI).
A SLI refers to the hardware, software and data components that allow users to interact with a computer through spoken language. The term "interface" is particularly apt in the context of voice interaction, since the SLIacts as a conversational mediator, allowing information to be exchanged between user and system via speech. In its idealised form, this interface would be "invisible"and the interaction would, from the user's standpoint, appear as seamless and natural as a conversation with another person. In fact, one principle aim of most SLI projects is to create a system that is as near as possible to a human-human conversation.
If the exchange between user and machine is construed as a dialogue, the objective for the SLI development team is to create the ears, mind and voice of the machine. In computational terms, the ears of the system are created by the Automatic Speech Recognition (ASR) System 22. The voice is created via the Automatic Speech Generation (ASG) software 26, and the mind is made up of the computational power of the hardware and the databases of information contained in the system. The present system uses software developed by other companies for its ASR and ASG. Suitable systems are available from Nuance and Lernout & Hauspie respectively. These systems will not be described further. However, it should be noted that the system allows great flexibility in the selection of these components from different vendors. Additionally, the basic Text To Speech unit supplied, for example, by Lernout & Hauspie may be supplemented by an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements), streaming of audio data from other source (e. g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
<Desc/Clms Page number 14>
One implementation of the system is given in Figure 3.
It should be noted that this is a simplified description. A voice controller 19 and the dialogue manager 24 control and manage the dialogue between the system and the end user.
The dialogue is dynamically generated at run time from a SLI repository which is managed by a separate component, the development suite.
The ASR unit 22 comprises a plurality of ASR servers.
The ASG unit 26 comprises a plurality of speech servers.
Both are managed and controlled by the voice controller.
The telephony unit 20 comprises a number of telephony board servers and communicates with the voice controller, the ASR servers and the ASG servers.
Calls from users, shown as mobile phone 18 are handled initially by the telephony server 20 which makes contact with a free voice controller. The voice controller contacts the locates an available ASR resource. The voice controller 19 which identifies the relevant ASR and ASG ports to the telephony server The telephony server can now stream voice data from the user to the ASR server and the ASG stream audio to the telephony server.
The voice controller, having established contacts with the ASR and ASG servers now requests a informs the Dialogue Manager which requests a session on behalf of a user in the session manager. As a security precaution, the user is required to provide authentication information before this step can take place. This request is made to the session manager 28 which is represented logically at 132 in the session layer in Figure 2. The session manager server 28 checks with a dropped session store (not shown) whether the user has a recently dropped session. A dropped session could be caused by, for example, a user on a mobile entering a tunnel. This facility enables the user to be reconnected to a session without having to start over again.
The dialogue manager 24 communicates with the application managers 34 which in turn communicate with the internal/external services or applications to which the user has access. The application managers each communicate with
<Desc/Clms Page number 15>
a business transaction log 50, which records transactions and with the notification manager 28b. Communications from the application manager to the notification manager are asynchronous and communications from the notification manager to the application managers are synchronous. The notification manager also sends communications asynchronously to the dialogue manager 24. The dialogue manager 24 has a synchronous link with the session manager 28a, which has a synchronous link with the notification manager.
The dialogue manager 24 communicates with the adaptive learning unit 33 via an event log 52 which records user activity so that the system can learn from the users interaction. This log also provides a series of debugging and reporting information. The adaptive learning unit is connected to the personalisation module 34 which is in turn connected to the dialogue manager. Workflow 56, Dialogue 58 and Personalisation repositories 60 are also connected to the dialogue manager 24 through the personalisation module 554 so that a personalised view is always handled by the dialogue manager 24. These three repositories make up the SLI Repository referred to early.
As well as receiving data from the workflow, dialogue and personalisation repositories, the personalisation can also write to the personalisation repository 60. The Development Suite 35 is connected to the workflow and dialogue repositories 56,58 and implements functional specifications of applications storing the relevant grammars, dialogues, workflow and application manager function references for each the application in the repositories. It also facilitates the design and implementation of system, help, navigation and misrecognition grammars, dialogues, workflow and action references in the same repositories.
The dialogue manager 24 provides the following key areas of functionality: the dynamic management of task oriented conversation and dialogue; the management of synchronous conversations across multiple formats; and the
<Desc/Clms Page number 16>
management of resources within the dialogue subsystem. Each of these will now be considered in turn.
Dynamic Management of Task Oriented Conversation and Dial oque The conversation a user has with a system is determined by a set of dialogue and workflow structures, typically one set for each application. The structures store the speech to which the user listens, the keywords for which the ASR listens and the steps required to complete a task (workflow). By analysing what the users say, which is returned by the ASR, and combining this with what the DM knows about the current context of the conversation, based on current state of dialogue structure, workflow structure, and application & system notifications, the DM determines its next contribution to the conversation or action to be carried out by the AMs. The system allows the user to move between applications or context using either hotword or natural language navigation. The complex issues relating to managing state as the user moves from one application to the next or even between multiple instances of the same application is handled by the DM. This state management allows users to leave an application and return to it at the same point as when they left. This functionality is extended by another component, the session manager, to allow users to leave the system entirely and return to the same point in an application when they log back in-this is discussed more fully later under Session Manager.
The dialogue manager communicates via the voice controller with both the speech engine (ASG) 26 and the voice recognition engine (ASR) 22. The output from the speech generator 26 is voice data from the dialogue structures, which is played back to the user either as dynamic text to speech, as a pre-recorded voice or other stored audio format. The ASR listens for keywords or phrases that the user might say.
<Desc/Clms Page number 17>
Typically, the dialogue structures are predetermined (but stochastic language models could be employed in an implementation of the system or hybrids of the two). Predetermined dialogue structures or grammars are statically generated when the system is inactive. This is acceptable in prior art systems as scripts tended to be simple and did not change often once a system was activated. However, in the present system, the dialogue structures can be complex and may be modified frequently when the system is activated. To cope with this, the dialogue structure is stored as data in a run time repository, together with the mappings between recognised conversation points and application functionality. The repository is dynamically accessed and modified by multiple sources even when active users are on-line.
The dialogue subsystem comprises a plurality of voice controllers 19 and dialogue managers 24 (shown as a single server in Figure 3).
The ability to update the dialogue and workflow structures dynamically greatly increases the flexibility of the system. In particular, it allows updates of the voice interface and applications without taking the system down; and provides for adaptive learning functionality which enriches the voice experience to the user as the system becomes more responsive and friendly to a user's particular syntax and phraseology with time. Considering each of these two aspects in more detail:
Updates U Today we are accustomed to having access to services 24 hours a day and for mobile professionals this is even more the case given the difference in time zones. This means the system must run none stop 24 hours a day, 7 days a week. Therefore an architecture and system that allows new applications and services or merely improvements in interface design to be added with no affect on the serviceability of the system has a competitive advantage in the market place.
<Desc/Clms Page number 18>
Adaptive Learning Functionality Spoken conversational language reflects quite a bit of a user's psychology, sociol-economic background, dialect and speech style. One reason an SLI is a challenge is due to these confounding factors. The solution this system provides to this challenge is a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features-Adaptive Learning. Without discussing in detail the complexity of encoding this knowledge, suffice it to say that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any ASR. User profiling, in part, solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options.
The adaptive learning technique is a stochastic process which first models which types, dialects and styles the entire user base of users employ. By monitoring the Spoken Language of many hundreds of calls, a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, can be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him. A more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed. The general data of the masses is used to initially set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is
<Desc/Clms Page number 19>
immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
The grammar coverage tool will now be described.
It is first useful to define terms used in the generation of grammars: Definitions Tag: a label applied to a class of words which play a similar role in terms of systax.
Word class: a wordclass is a set of words which can play a common syntactic role.
Grammar : a set of algebraic rules which define a set of sentences. The rules can be expressed in terms of the tags which are used to label word classes.
Lexicon: a list of words where each is followed by its tag. In certain cases, a word may have more than one tag in the lexicon, which can make the job of reliably applying tags ambiguous.
Coverage: the set of sentences defined by a set of grammar rules. The set thus'covers'the expressions a user may have recognised by the system.
Decision tree: a notation for representing logical rules as a hierarchical structure of nodes and branches.
Binary vector: a bracketed sequence of ls and Os.
<Desc/Clms Page number 20>
Training set: a set of classified items to be presented to a learning process.
Test set : a set of classified items used to evaluate the performance of a trained process.
Sentence Structure Every sentence in every language has a structure.
This structure can be described in more or less detail. The coverage tool to be described applies to a class of sentences which can be described as ordered sequences of tags. A tag is simply a label which refers to a class of components. A component is generally an individual word. For example, consider the following simple example which relates to an application for purchasing airline tickets: Example (1) : Let x = one of [Please, Kindly] Let y = one of [change, modify] Let t = the Let z = one of destination city > , < airline > ] Let ? denote that a component is optional, so that a sentence may or may not contain it. The question mark applies to bracketed items as if they were a single indivisible unit. Then the rule ? x? y ? t z defines the following set of sentences 1) I want to < destination city > 2) I want to the < destination city > 3) I want to change the destination city > 4) I want to change < destination city > 5) Change < destination city > 6) Change the < destination city >
<Desc/Clms Page number 21>
7) The < destination city > 8) destination city > There is something awkward about sentences 1) & 2). In an appropriate context, sentences 6-8 may be quite natural ; for example, if the grammar rule is supposed to cover what someone might say at an airline reservation desk, after having been asked, "Which of your details do you want to change ?" Sentences 3 to 5 are acceptable as natural in most, if not all circumstances. Sentences 1 & 2 are 'nonsense'or'garbage'utterances. We explain how the tool can propose new grammar rules which will exclude such expressions.
One useful aspect of algebraic notation for grammar rules is that it enables large sets of sentences to be specified by a very compact set of rules. The downside is that these compact rules may not be as precise as we would like, and thus they cover awkward sentences too. Even sentences that are grammatically incorrect might be included even though they would never be spoken. This awkwardness provides us with one possible preference criterion.
Preference Criteria The collection of sentences listed above comes from expanding the grammar rule. As discussed above, some sentences may be preferred to others, based on one or more preference criteria. It does not matter what the content of such a criterion is, provided that it can be applied consistently. Clearly there is no argument about a frequency of usage criterion. No sentence can be both frequently occurring and infrequently occurring. More subjective measures, such as'intuitive acceptability'may be less
<Desc/Clms Page number 22>
consistent. Significant inconsistency in the training data will be disastrous.
In the following descriptions, it does not really matter what preference criterion is used, as long as it can be applied consistently. The point is for any set of sentences obtained by expanding a grammar rule, for example sentences 1-8 above, some can be categorised as positive examples, and some as negative examples.
Possible preference criteria may include: Frequency of usage; House style; Disfavouring'garbage'-sentences with poor syntax; Intuitive acceptability.
Other criteria, as yet unspecified, may also be selected.
Defining the preference criterion or criteria to classify each member of a set of sentences is the first step towards improving the coverage defined by the grammar.
The Tagging Procedure The discussion above introduced the notion of sentence structure, and defines a simple scheme for describing that structure.
<Desc/Clms Page number 23>
It is axiomatic for the invention here described that it is possible to assign tags accurately to each word in a set of sentences. We need reasonably high levels of accuracy, but since the GCT can cope with a small amount of ambiguity, it need not be 100% in every case. There are standard tools and techniques available which can be easily adapted to work well with corpora from particular contexts and domains. For example, the Brill tagger is a program freely available in the public domain, which is readily trainable.
The Grammar Coverage Tool Given a set of sentences, divided into positive and negative examples, tagged as per our method above, the procedure below can be implemented. This is illustrated in the flow chart of figure 1. Take a set of sentences relating to a particular grammar. According to a defined preference criterion, categorise each sentence as 'good'or'bad' (120). Ideally the set should reflect as many distinct kinds of sentence which are preferable, as possible. This part of the process may be performed manually.
2. Divide the set into a training set, and a test set (140). As a guide, use around 20% of the cases for training and 80% for testing). Other proportions may be used.
3. Use a computer to run an algorithm which can induce a decision-tree to classify the sentences of the training set into positive and negative examples, on the basis of the sequence of tags which describes each sentence (160). Standard algorithms
<Desc/Clms Page number 24>
such as ID3 algorithm, described below, or its variant C4.5, which is robust with respect to incomplete data, can be used. The ID3 algorithm is in J. R > Quinlan's 1985 paper: The Induction of Decision Trees (described fully in Machine Learning and Data Mining by Michalski et al. Pub. J Wily).
4. Use the decision tree, to propose a classification, either positive or negative with respect to the preference criterion, for the sentences in the test set (180).
5. Calculate the % error (200), E, over the predicted classification for the test set, given by the new rules, against the actual classification originally given to each member of the test set 6. If E is high (for example over 5%) (220), consider repeating the procedure from (100), with a larger set. Otherwise the algorithm terminates.
7. A set of rules constituting a new grammar can be easily obtained from the decision tree. These new rules can be offered to a (human) grammar writer, or they may even be deployed directly.
The new rules define a grammar of higher quality, because they include fewer garbage utterances.
Inducing a Decision Tree In step 3 above, the algorithm begins with a training set of good and bad examples of sentences which have previously been given appropriate tag-sequence descriptions.
These examples are used to produce a classification scheme, called a decision tree.
<Desc/Clms Page number 25>
The Algorithms ID3 The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes Cl, C2, .., Cn, the categorical attribute C, and a training set T of records. function ID3 (R: a set of non-categorical attributes, C: the categorical attribute, S : a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified] ; Let D be the attribute with largest Gain (D, S) among attributes in R; Let {djl j=1, 2,.., m} be the values of attribute D; Let {Sjl j=1, 2,.., m} be the subsets of S consisting respectively of records with value dj for attribute D ; Return a tree with root labeled D and arcs labeled dl, d2,.., dm going respectively to the trees
<Desc/Clms Page number 26>
ID3 (R- {D}, C, Sl), ID3 (R- {D}, C, S2),.., ID3 (R- (Dl, C, Sm); end IDS ; 2. C4.5 Extensions C4.5 introduces a number of extensions of the original ID3 algorithm.
In building a decision tree we can deal with training sets that have records with unknown attribute values by evaluating the gain, or the gain ratio, for an attribute by considering only the records where that attribute is defined.
In using a decision tree, we can classify records that have unknown attribute values by estimating the probability of the various possible results.
Cases of attributes with continuous ranges are dealt with as follows. Say that attribute Ci has a continuous range. Examine the values for this attribute in the training set. Say they are, in increasing order, Al, A2,.., Am. Then for each value Aj, j=l, 2,.. m, partition the records into those that have Ci values up to and including Aj, and those that have values greater than Aj. For each of these partitions compute the gain, or gain ratio, and choose the partition that maximizes the gain.
Data Representation ~ In order to be used by the algorithm (s) outlined herein, the sequences of tags which represent the sentence structures need to be encoded as binary vectors. A matrix is used as follows: Consider again the following simple example (I) : Let x = one of [Please, Kindly]
<Desc/Clms Page number 27>
Let y = one of [change, modify] Let t = the Let z = one of [ < destination city > , < airline > ] Let? denote that a component is optional, so that a sentence may or may not contain it. Then the rule ? x? y? t z defines a set of sentences which includes these examples (the training set): 1) Please destination city > 2) Please the destination city >
3) Please change the destination city > 4) Please change destination city > 5) Change destination city > 6) Change the < destination city > 7) The destination city > 8) destination city > In the binary representation scheme of figure 5, a 1 indicates that the tag is present in that example, a 0 that it is not.
The classification tree is then made as follows: l) If, at a given node, all the examples are the same class (all positive or all negative) then label the node as a leaf and terminate.
2) If there are both positive and negative examples, then select the most informative feature (in our case, one of the tags) and partition into branches according to the value of that tag (i. e. present, or null).
3) Use Shannon's information heuristic to identify the probable most informative feature: this is given by : Ep log p for each feature, where p is probability of a particular value occurring in the feature slot across all the examples at this node.
<Desc/Clms Page number 28>
4) Keep iterating the tree from step 1 until all bottom level nodes are leaf nodes.
The decision tree is shown in Figure 6.
Each path through the tree corresponds to a rule. For the example matrix above, the ID3 algorithm produces the following set of rules : 1) ? x? y? t d = > Good sentence 2) x ? t d = > Bad sentence (garbage) 3) y? t d = > Good sentence In simple terms, these new rules capture our intuition that if an x (Please) is present, it must be followed by a y (change) if it is to be a good sentence on our selected preference criterion.
Consider further the following test set of sentences, drawn from the same rule which produced the training set.
Example 2: 1) Kindly airline (Bad) 2) Kindly modify the airline (Good) 3) Modify airline (Good) 4) Please airline (Bad) 5) Please modify the airline (Good) 6) Kindly destination (Bad) 7) Change airline (Good) 8) Change the airline (Good) We can apply the rules derived from the decision tree induced for the training set to these test sentences.
1)? x? y ? t d = > Good sentence 2) x? t d = > Bad sentence 3) y? t d = > Good sentence The matrix for this example is shown in Figure 7.
<Desc/Clms Page number 29>
So, in the example Figure 7, the induced rules give 100% correct classification on the test set (i. e. 0% error rate). The induced rules identify which sentences are good, and which bad, in terms of the preference criterion. The rules express structural patterns which determine whether a sentence is a good example or not.
In the description discussed above, for the class of sentences which can be represented as ordered sequences of tags, a procedure has been described for identifying which sequences of tags define good examples, and which define the bad. Good and bad are determined on the basis of a consistently applied preference criterion. Most of this procedure is implemented in software, thus providing a tool which can improve the coverage quality of a grammar, with respect to the preference criterion. The improvement is manifested in a modified set of grammar rules which is applied by a spoken language interface. For example, rules can be adopted which identify the good examples as the new grammar for an ASR. The ASR thus no longer risks reduced accuracy by having to listen out for'bad'sentences.
The embodiment described enables unfavourable elements to be eliminated automatically in contrast to the prior art in which all processing is manual.
In the embodiment described, the steps of selecting the set of sentences, choosing the preference criterion separating the sentences into a test and a training set and applying the preference criterion are performed manually. However, these steps may also be automated given a sufficiently large database of sentences to select from.

Claims (9)

  1. CLAIMS 1. A method of writing a grammar for a class of sentences expressed as an ordered sequence of tags, comprising the steps of: a) acquiring a set of sentences or candidate
    sentences in a grammar. b) selecting a preference criterion and defining definitive acceptable and unacceptable sentences according to that criterion within the acquired set of sentences; c) dividing the set of sentences into a training set of sentences and a test set of sentences; d) assigning the words of each sentence a descriptor reflecting its syntactic class to form a tag sequence for each sentence; e) inducing a set of grammar rules by running an algorithm on a computer system to form a decision tree on the basis of the tags sequence; f) applying the set of induced grammar rules to the test sets of sentences to yield a proposed classification of sentences as acceptable or unacceptable examples with respect to the preference criterion; g) comparing the proposed classification application of the set of grammar rules to the test set of sentences with the definitive classification of the test set; and, depending on the result of the comparison either, accepting the rules from the decision tree; or repeating steps a) to g) with an expanded set of sentences or a different division between test and training sets.
  2. 2. A method according to claim 1, comprising encoding the sequence of tags representing sentence structures as binary vectors prior to running the algorithm.
    <Desc/Clms Page number 31>
  3. 3. A method according to claim 1 or 2, wherein the algorithm is an ID3 or C4.5 algorithm.
  4. 4. A method according to claim 1,2 or 3, wherein the preference criterion is selected from the group comprising frequency of usage, house style and/or intuitive acceptability.
  5. 5. A method according to any of claims 1 to 4, wherein the formation of the decision tree comprises the steps of: a) labelling a node as a leaf and terminating if all examples have the same classification. b) where both acceptable and unacceptable examples are present, selecting the most informative tag and partition into branches according to the value of that tag; and c) repeating steps a) and b) until all bottom level nodes are leaf nodes.
  6. 6. A method according to claim 5, wherein the step of selecting the most informative tag comprises identifying the probable most informative tag.
  7. 7. A method according to claim 6, comprising applying Shannon's information heuristic to identify the probable most informative tag.
  8. 8. Apparatus for writing a grammar for a class of sentences expressed as an ordered sequence of tags in which a set of sentences or candidate sentences in a grammar is acquired, a preference criterion is selected to define definitive acceptable and unacceptable sentences according to that criterion within the acquired set of sentences, the sentences are divided into a training set and a test set,
    <Desc/Clms Page number 32>
    and the words of each sentence are assigned a descriptor reflecting its syntactic class to form a tag sequence for each sentence; the apparatus comprising means for inducing a set of grammar rules by comprising an algorithm for running on a computer system to form a decision tree on the basis of the tag sequence; means for applying the set of induced grammar rules to the test set of sentences to yield a proposed classification of sentences as acceptable or unacceptable examples with respect to the preference criterion; means for comparing the proposed classification by application of the set of grammar rules to the test set of sentences with the definitive classification of the test set, and for accepting or rejecting the rules depending on the result of the comparison.
  9. 9. Apparatus according to claim 8, comprising means for automatically acquiring the set of sentences or candidate sentences in the grammar; means for automatically selecting the preference criterion and for defining definitive acceptable and unacceptable sentences according to that criterion ; means for automatically dividing the sentence set into training and test sets; and means for automatically forming a tag sequence for each sentence.
GB0110532A 2001-04-30 2001-04-30 Grammar coverage tool for spoken language interface Expired - Fee Related GB2375210B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB0110532A GB2375210B (en) 2001-04-30 2001-04-30 Grammar coverage tool for spoken language interface
PCT/GB2002/001962 WO2002089113A1 (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system
GB0326763A GB2391993B (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0110532A GB2375210B (en) 2001-04-30 2001-04-30 Grammar coverage tool for spoken language interface

Publications (3)

Publication Number Publication Date
GB0110532D0 GB0110532D0 (en) 2001-06-20
GB2375210A true GB2375210A (en) 2002-11-06
GB2375210B GB2375210B (en) 2005-03-23

Family

ID=9913726

Family Applications (2)

Application Number Title Priority Date Filing Date
GB0110532A Expired - Fee Related GB2375210B (en) 2001-04-30 2001-04-30 Grammar coverage tool for spoken language interface
GB0326763A Expired - Lifetime GB2391993B (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system

Family Applications After (1)

Application Number Title Priority Date Filing Date
GB0326763A Expired - Lifetime GB2391993B (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system

Country Status (2)

Country Link
GB (2) GB2375210B (en)
WO (1) WO2002089113A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208589A1 (en) * 2007-02-27 2008-08-28 Cross Charles W Presenting Supplemental Content For Digital Media Using A Multimodal Application
CN103154936B (en) * 2010-09-24 2016-01-06 新加坡国立大学 For the method and system of robotization text correction
US9733901B2 (en) 2011-07-26 2017-08-15 International Business Machines Corporation Domain specific language design
CN105845134B (en) * 2016-06-14 2020-02-07 科大讯飞股份有限公司 Spoken language evaluation method and system for freely reading question types
US11900928B2 (en) 2017-12-23 2024-02-13 Soundhound Ai Ip, Llc System and method for adapted interactive experiences
EP3502923A1 (en) * 2017-12-22 2019-06-26 SoundHound, Inc. Natural language grammars adapted for interactive experiences
CN111435408B (en) * 2018-12-26 2023-04-18 阿里巴巴集团控股有限公司 Dialog error correction method and device and electronic equipment
CN112069797B (en) * 2020-09-03 2023-09-01 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics
CN112992128B (en) * 2021-02-04 2023-06-06 北京淇瑀信息科技有限公司 Training method, device and system of intelligent voice robot

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5937385A (en) * 1997-10-20 1999-08-10 International Business Machines Corporation Method and apparatus for creating speech recognition grammars constrained by counter examples

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5806032A (en) * 1996-06-14 1998-09-08 Lucent Technologies Inc. Compilation of weighted finite-state transducers from decision trees
CA2376277C (en) * 1999-06-11 2011-03-15 Telstra New Wave Pty Ltd A method of developing an interactive system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5937385A (en) * 1997-10-20 1999-08-10 International Business Machines Corporation Method and apparatus for creating speech recognition grammars constrained by counter examples

Also Published As

Publication number Publication date
GB0110532D0 (en) 2001-06-20
GB2391993B (en) 2005-04-06
GB2391993A (en) 2004-02-18
GB0326763D0 (en) 2003-12-24
WO2002089113A1 (en) 2002-11-07
GB2375210B (en) 2005-03-23

Similar Documents

Publication Publication Date Title
EP1602102B1 (en) Management of conversations
US8000973B2 (en) Management of conversations
AU2019376649B2 (en) Semantic artificial intelligence agent
US10623572B1 (en) Semantic CRM transcripts from mobile communications sessions
US7869998B1 (en) Voice-enabled dialog system
US8645122B1 (en) Method of handling frequently asked questions in a natural language dialog service
US20050033582A1 (en) Spoken language interface
US20210350384A1 (en) Assistance for customer service agents
CN116235245A (en) Improving speech recognition transcription
WO2002089112A1 (en) Adaptive learning of language models for speech recognition
GB2375210A (en) Grammar coverage tool for spoken language interface
Di Fabbrizio et al. AT&t help desk.
AT&T Microsoft PowerPoint - MIQA-workshop-20100325-ver02.pptx
Wilpon et al. The business of speech technologies
Wilpon et al. The Business

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20060430