WO2002089113A1 - System for generating the grammar of a spoken dialogue system - Google Patents

System for generating the grammar of a spoken dialogue system Download PDF

Info

Publication number
WO2002089113A1
WO2002089113A1 PCT/GB2002/001962 GB0201962W WO02089113A1 WO 2002089113 A1 WO2002089113 A1 WO 2002089113A1 GB 0201962 W GB0201962 W GB 0201962W WO 02089113 A1 WO02089113 A1 WO 02089113A1
Authority
WO
WIPO (PCT)
Prior art keywords
grammar
sentences
rules
sentence
language components
Prior art date
Application number
PCT/GB2002/001962
Other languages
French (fr)
Inventor
David Horowitz
Peter Phelan
Original Assignee
Vox Generation Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vox Generation Limited filed Critical Vox Generation Limited
Priority to GB0326763A priority Critical patent/GB2391993B/en
Publication of WO2002089113A1 publication Critical patent/WO2002089113A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Definitions

  • This invention relates to spoken language interfaces which enable a user to " interact using voice with a computer system. It is more specifically concerned with the writing of grammars for use in such interfaces.
  • a spoken language interface involves a two way dialogue between the user and the computer system.
  • An automatic speech recognition system is used to comprehend the speech delivered by the user and an automatic speech generation system is used to generate speech to be played out to the user. This may be, for example, a mixture of speech synthesis and recorded voice.
  • Spoken language interfaces are well known. One example is described in our earlier application GB 01 05 005.3 filed on 28 February 2001.
  • Spoken language interfaces rely on grammars to interpret user's commands and formulate responses.
  • the grammar defines the sequences of words which it is possible for the interface to recognise.
  • a grammar is a set of rules which de ines a set of sentences . Typically the rules are normally expressed in some kind of algebraic notation. The set of grammar rules usually define a much larger set of sentences; few rules cover many sentences. Size constraints apply to these grammars. Automatic speech recognisers can only recognise with high accuracy for grammars of a limited size. This means that there is a very strong motivation to remove any useless expressions from a grammar.
  • some sentences may be preferable over others, according to one or more criteria. For example, expressions which are grammatically correct
  • grammar writing is a time consuming operation involving the expertise of grammar writers to encapsulate manually the preferred sentences within each grammar and 'requiring painstaking care. This makes it hard to bring new applications to market quickly and also very expensive.
  • a grammar formulation mechanism for assisting a grammar writer when formulating a new grammar.
  • the grammar formulation mechanism is operable to apply an adaptive learning algorithm to a predetermined set of language components to determine a set of grammar rules and to apply the set of grammar rules to the new grammar.
  • the grammar formulation mechanism derives the grammar rules from the set of language components .
  • Language components may represent any number of audible sounds, words, sentences, phrases, noises, utterances etc., and may be tagged so as to identify one or more of them during the adaptive learning process .
  • the language components may be, for example, a scored set of sentences (e.g. composed of a sequence of words represented electronically) that give examples of sentences that are good and/or bad and/or intermediate (i.e. somewhere in between good and bad) which the grammar formulation mechanism learns and can subsequently apply to other language components and/or grammars either as they are generated or after their generation.
  • the preference criteria i.e. good/bad/intermediate, are applied consistently to the set of language components during the adaptive learning process. In one example, the preference criteria are based on frequency of use of language components .
  • the first aspect of the invention enables the grammar formulation mechanism to monitor new grammars as they are written, and propose alternative, usually better, grammars to a grammar writer.
  • the grammar formulation mechanism may be configured automatically to re-write new grammars where better alternatives are found. In this way, the writing of new grammars is greatly facilitated, even if the grammar writer is not experienced in the art of writing new grammars .
  • the grammar formulation mechanism may use an adaptive learning algorithm that is trained on a training set of language components obtained from the predetermined set of language components . By using a training set that may be varied, the grammar formulation mechanism can be trained to formulate grammar rules in dependence upon the application and/or system for which a new set of grammars is being formulated.
  • grammar rules may be tested on a test set of language components obtained from the predetermined set of language components in order that the efficacy of the grammar rules can be tested.
  • the adaptive learning algorithm may use an inductive classification scheme, such as, for example, that used by the ID3 algorithm, to classify the language components during a training phase.
  • This has the advantage of providing a compact set of grammar rules .
  • a compact set of grammar rules can lead to quicker processing in a spoken language interface.
  • An information-based heuristic may also be used by the adaptive learning algorithm to select the language components for classification during the training phase.
  • Use of an information-based heuristic such as, for example, one based on Shannon's theory, allows for improved selection of the language components for classification during the training phase, and this in turn can lead to a compact and efficient set of grammar rules .
  • the predetermined set of language components may form part of at least one sentence pre-identified as being good and/or bad and/or intermediate .
  • Such a predetermined set of language components may form a training set for an expert system written by a grammar expert .
  • the language components may have associated identifiers that indicate the grammar expert's opinion of their applicability. Where the identifiers identify language components as being intermediate, they may be selected according to a hierarchical ranking, e.g. those closest to being "good” being selected in preference to any others .
  • Grammar rules derived from intermediate language components may themselves be scored and retained or rejected based upon the scores. This feature allows a fixed sized set of grammar rules to be produced.
  • the grammar formulation mechanism may be implemented using one or more of software, hardware and firmware. It may be provided as a software module that can act as a plug-in to other software to impart the functionality of the grammar formulation mechanism to that other software.
  • the grammar formulation mechanism may be provided as a software module provided on a carrier medium to a user, or the grammar formulation mechanism may be supplied as a component of a larger software package, such as, for example, as part of a grammar coverage tool as herein described.
  • a grammar coverage tool comprising a grammar formulation mechanism as herein described.
  • the grammar coverage tool may be a software tool that provides an approval mechanism operable to propose one or more grammar rule to a grammar writer during the formulation of the new grammar. This allows the grammar coverage tool to suggest grammars to the grammar writer as new grammars are being written, providing the grammar writer with an interactive tool that aids him/her in the task of writing a new grammar. Further, the grammar coverage tool may automatically modify any new grammar (s) as they are being written, if, for example, a poor grammar rule written by the grammar writer is detected.
  • Examples of carrier media on which a grammar coverage tool and/or a grammar formulation mechanism may be supplied includes at least one of the following set of media: a radio-frequency signal, an optical signal, an electronic signal, a magnetic disc or tape, solid-state memory, an optical disc, a magneto-optical disc, a compact disc and a digital versatile disc.
  • a computer system configured to provide a grammar formulation mechanism and/or grammar coverage tool according to any of the aspects of the invention herein described.
  • a method for assisting a grammar writer when formulating a new grammar comprising applying an adaptive learning algorithm to a predetermined set of language components to determine a set grammar rules, said predetermined set of language components comprising at least one sentence composed of at least one further language component, at least one said sentence being pre- identified as being good, bad and/or intermediate, and applying the set of grammar rules to the new grammar to determine whether the new grammar is good or bad.
  • the sentence (s) that form the predetermined set of language components may comprise a logical group of one or more of the following: audible sounds, words, conventional sentences in any language, phrases, noises, utterances etc .
  • the method may additionally comprise method steps corresponding to any component of any of the other aspects of the invention.
  • the method may comprise one or more of: a) training the adaptive learning algorithm on a training set of language components obtained from the predetermined set of language components; b) testing the grammar rules on a test set of language components obtained from the predetermined set of language components; c) classifying the language components during a training phase by applying an inductive classification scheme; and d) selecting the language components for classification during a training phase according to an information-based heuristic measurement .
  • a grammar coverage tool embodying the invention has the advantage of enabling automatic elimination of unfavoured sentences, on the basis of one or more preference criteria.
  • Unfavoured sentences include those generated by incorrectly written grammar rules, those which are syntactically undesirable, and those which fail to match the preference criteria.
  • a grammar writer may write a rule that is incorrectly structured.
  • a Grammar Coverage Tool embodying the invention is an automatic debugging tool that saves a significant amount of time in the grammar writing process.
  • the Grammar Coverage Tool embodying the invention is a design enhancement tool that has the advantage of further reducing the time it takes the grammar writer to design a well-written grammar.
  • Figure 1 is a schematic overview of a Spoken Language Interface
  • Figure 2 is a logical model of the interface archi ecture
  • FIG. 3 is a more detailed view of the interface architecture
  • Figure 4 is a flow chart illustrating steps in a process embodying the present invention
  • Figure 5 is a table showing a matrix of vectors representing tag sequences
  • Figure 6 shows a decision tree generated from the matrix of Figure 5;
  • Figure 7 shows a vector matrix for a second example;
  • Figure 8 shows a computer system that can be used to implement aspects of the invention.
  • the system schematically outlined in Figure 1 is a spoken language interface intended for communication with applications via mobile, satellite, or landline telephone.
  • communication is via a mobile telephone 18 but any other voice telecommunications device such as a conventional telephone can be utilised.
  • Calls to the system are handled by a telephony unit 20.
  • a Voice Controller 19 Connected to the telephony unit are a Voice Controller 19, an Automatic Speech Recognition System (ASR) 22 and a automatic speech generation (ASG) system 26.
  • the ASR 22 and ASG systems are each connected to the voice controller 19.
  • a dialogue manager 24 is connected to the voice controller 19 and also to a spoken language interface (SLI) repository 30, a personalisation and adaptive learning unit 32 which is also attached to the SLI repository 30, and a session and notification manager 28.
  • SLI spoken language interface
  • the Dialogue Manager is also connected to a plurality of Application Managers AM, 34 each of which is connected to an application which may be content provision external to the system.
  • the content layer includes e-mail, news, travel, information, diary, banking etc.
  • the nature of the content provided is not important to the principles of the invention.
  • the SLI repository is also connected to a development suite 35.
  • FIG. 2 provides a more detailed overview of the architecture of the system.
  • the automatic speech generation unit 26 of figure 1 includes a basic text-to-speech (TTS) unit, a batch TTS unit 120, connected to a prompt cache 124 and an audio player 122.
  • TTS text-to-speech
  • pre-recorded speech may be played to the user under the control of the voice control 19. It the embodiment illustrated a mixture of pre-recorded voice and TTS is used.
  • the system then comprises three levels: session level 120, application level 122 and non-application level 124.
  • the session level comprises a location manager 126 and a dialogue manager 128.
  • the session level also includes an interactive device control 130 and a session manager 132 which includes the functions of user identification and Help Desk.
  • the application layer comprises the application framework 134 under which an application manager controls an application. Many application managers and applications will be provided, such as UMS (Unified Messaging Service) , Call connect & conferencing, e- Commerce, Dictation etc.
  • the non-application level 124 comprises a back office subsystem 140 which includes functions such as reporting, billing, account management, system administration, "push" advertising and current user profile.
  • a transaction subsystem 142 includes a transaction log together with a transaction monitor and message broker.
  • an activity log 144 and a user profile repository 146 communicate with an adaptive learning unit 148.
  • the adaptive learning unit also communicates with the dialogue manager 128.
  • a personalisation module 150 also communicates with the user profiles repository 146 and the dialogue manager 128.
  • the system allows the system to be independent of the ASR 22 and TTS 26 by providing an interface to either proprietary or non-proprietary speech recognition, text to speech and telephony components .
  • the TTS may be replaced by, or supplemented by, recorded voice.
  • the voice control also provides for logging and assessing call quality. The voice control will optimise the performance of the ASR.
  • the voice engine is effectively dumb as all control comes from the dialogue manager via the voice control .
  • the dialogue manager controls the dialogue across multiple voice servers and other interactive servers (eg WAP, Web etc) .
  • As well as controlling dialogue flow it controls the steps required for a user to complete a task through mixed initiative - by permitting the user to change initiative with respect to specifying a data element (e.g. destination city for travel) .
  • the Dialog Manager may support comprehensive mixed initiative, allowing the user to change topic of conversation, across multiple applications while maintaining state representations where the user left off in the many domain specific conversations. Currently, as initiative is changed across two applications, state of conversation is maintained. Within the system, the dialogue manager controls the workflow.
  • the method by which the adaptive learning agent was conceived is to collect user speaking data from call data records. This data, collected from a large domain of calls (thousands) provides the general profile of language usage across the population of speakers. This profile, or mean language model forms a basis for the first step in adjusting the language model probabilities to improve ASR accuracy.
  • the individual user's profile is generated and adaptively tuned across the user's subsequent calls.
  • the dialog manager includes a personalisation engine. Given the user demographics (age, sex, dialect) a specific personality tuned to user characteristics for that user's demographic group is invoked.
  • the dialog manager also allows dialogue structures and applications to be updated or added without shutting the system down. It enables users to move easily between contexts, for example from flight booking to calendar etc, hang up and resume conversation at any point; specify information either step-by-step or in one complex sentence, cut-in and direct the conversation or pause the conversation temporarily.
  • the telephony component includes the physical telephony interface and the software API that controls it .
  • the physical interface controls inbound and outbound calls, handles conferencing, and other telephony related functionality.
  • the Session Manager initiates and maintains user and application sessions. These are persistent in the event of a voluntary or involuntary disconnection. They can re- instate the call at the position it had reached in the system at any time within a given period, for example 24 hours.
  • a major problem in achieving this level of session storage and retrieval relates to retrieving a session in which a conversation is stored with either a dialogue structure, workflow structure or an application manager has been upgraded. In the preferred embodiment this problem is overcome through versioning of dialogue structures, workflow structures and application managers. The system maintains a count of active sessions for each version and only retires old versions once the versions count reaches zero.
  • An alternative which may be .implemented, requires new versions of dialogue structures, workflow structures and application managers to supply upgrade agents . These agents are invoked whenever by the session manager whenever it encounters old versions in the stored session. A log is kept by the system of the most recent version number. It may be beneficial to implement a combination of these solutions the former for dialogue structures and workflow structures and the latter for application managers.
  • the notification manager brings events to a user's attention, such as the movement of a share price by a predefined margin. This can be accomplished while the users are offline through interaction with the dialogue manager or offline. Offline notification is achieved either by the system calling the user and initiating an online session or through other media channels, for example, SMS, Pager, fax, email or other device.
  • AM Application Managers
  • SLI Session Initiation Agents
  • Each application manager (there is one for every content supplier) exposes a set of functions to the dialogue manager to allow business transactions to be realised (e.g. GetEmailO, SendEmailO, BookFlight () , GetNewsItemO , etc) .
  • Functions require the DM to pass the complete set of parameters required to complete the transaction.
  • the AM returns the successful result or an error code to be handled in a predetermined fashion by the DM.
  • An AM is also responsible for handling some stateful information. For example, User A has been passed the first 5 unread emails. Additionally, it stores information relevant to a current user task. For example, flight booking details. It is able to facilitate user access to secure systems, such as banking, email or other. It can also deal with offline events, such as email arriving while a user is offline or notification from a flight reservation system that a booking has been confirmed. In these instances the AM's role is to pass the information to the Notification Manager.
  • An AM also exposes functions to other devices or channels, such as web, WAP, etc. This facilitates the multi channel conversation discussed earlier.
  • AMs are able to communicate with each other to facilitate aggregation of tasks. For example, booking a flight primarily would involve a flight booking AM, but this would directly utilise a Calendar AM in order to enter flight times into a user's Calendar.
  • AMs are discrete components built, for example, as enterprise Java Beans (EJBs) they can be added or updated while the system is live.
  • EJBs enterprise Java Beans
  • the Transaction and Message Broker records every logical transaction, identifies revenue-generating transactions, routes messages and facilitates system recovery.
  • Spoken conversational language reflects quite a bit of a user's psychology, socio-economic background, and dialect and speech style. The reason an SLI is a challenge, which is met by embodiments of the invention, is due to these confounding factors .
  • Embodiments of the invention provide a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features .
  • a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any recogniser.
  • User profiling solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
  • the adaptive learning technique is a stochastic
  • a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, could be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him.
  • a more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed.
  • the general data of the masses is used initially to set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
  • the approach then, embodies statistical modelling across an entire population of users . The stochastic nature of the approach occurs, when new observations are made across the average mass, and language modelling weights are adaptively assigned to tune the recogniser. Help Assistant & Interactive Training
  • the Help Assistant & Interactive Training component allows users to receive real-time interactive assistance and training.
  • the component provides for simultaneous, multi channel conversation (i.e. the user can talk through a voice interface and at the same time see visual representation of their interaction through another device, such as the web) .
  • the system uses a commercially available database such as Oracle 81 from Oracle Corp.
  • the Central Directory stores information on users, available applications, available devices, locations of servers and other directory type information.
  • the System Administration - Applications provides centralised, web-based functionality to administer the custom build components of the system (e.g. Application Managers, Content Negotiators, etc.).
  • This provides an environment for building spoken language systems incorporating dialogue and prompt design, workflow and business process design, version control and system testing. It is also used to manage deployment of system updates and versioning.
  • the development suite Rather than having to labouriously code likely occurring user responses in a cumbersome grammar (e.g. BNF grammar - Bachus Nauer Format) resulting in time consuming detailed syntactic specification, the development suite provides an intuitive hierarchical, graphical display of language, reducing the modelling act to creatively uncover the precise utterance but the coding act to a simple entry of a data string.
  • the development suite enables a Rapid Application Development (RAD) tool that combines language modelling with business process design (workflow) .
  • RAD Rapid Application Development
  • the grammar coverage tool is embodied in the development suite. After a rule is entered into the RAD system, the GCT is invoked. It tests the efficiency of the rule (i.e. that it does not generate any garbage utterances which a user would not say; due to incorrect syntax) and also evaluates if the rule generates the necessary linguistic coverage adopted by the house-style or conventional wisdom about the way in which users phrase responses .
  • Dialogue Subsystem manages, controls and provides the interface for human dialogue via speech and sound. Referring to Figure 1, it includes the dialogue manager, spoken language interface repository, session and notification managers, the voice controller 19, the Automatic Speech Recognition Unit 22, the Automatic Speech Generation unit 26 and telephony components 20. The subsystem is illustrated in more detailed architecture of the interface shown in Figure 3.
  • SLI Spoken Language Interface
  • a SLI refers to the hardware, software and data components that allow users to interact with a computer through spoken language.
  • the term "interface” is particularly apt in the context of voice interaction, since the SLI acts as a conversational mediator, allowing information to be exchanged between user and system via speech. In its idealised form, this interface would be "invisible” and the interaction would, from the user's standpoint, appear as seamless and natural as a conversation with another person.
  • one principle aim of most SLI projects is to create a system that is as near as possible to a human-human conversation. If the exchange between user and machine is construed as a dialogue, the objective for the SLI development team is to create the ears, mind and voice of the machine.
  • the ears of the system are created by the Automatic Speech Recognition (ASR) System 22.
  • the voice is created via the Automatic Speech Generation (ASG) software 26, and the mind is made up of the computational power of the hardware and the databases of information contained in the system.
  • the present system uses software developed by other companies for its ASR and ASG. Suitable systems are available from Nuance and Lernout & Hauspie respectively. These systems will not be described further. However, it should be noted that the system allows great flexibility in the selection of these components from different vendors.
  • the basic Text To Speech unit supplied, for example, by Lernout & Hauspie may be supplemented by an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements) , streaming of audio data from other source (e.g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
  • a voice controller 19 and the dialogue manager 24 control and manage the dialogue between the system and the end user.
  • the dialogue is dynamically generated at run time from a SLI repository which is managed by a separate component, the development suite.
  • the ASR unit 22 comprises a plurality of ASR servers.
  • the ASG unit 26 comprises a plurality of speech servers. Both are managed and controlled by the voice controller.
  • the telephony unit 20 comprises a number of telephony board servers and communicates with the voice controller, the ASR servers and the ASG servers.
  • Calls from users, shown as mobile phone 18 are handled initially by the telephony server 20 which makes contact with a free voice controller.
  • the voice controller contacts and locates an available ASR resource.
  • the voice controller 19 which identifies the relevant ASR and ASG ports to the telephony server.
  • the telephony server can now stream voice data from the user to the ASR server and the ASG stream audio to the telephony server.
  • the voice controller having established contacts with the ASR and ASG servers now informs the Dialogue Manager which requests a session on behalf of a user in the session manager. As a security precaution, the user is required to provide authentication information before this step can take place. This request is made to the session manager 28 which is represented logically at 132 in the session layer in Figure 2.
  • the session manager server 28 checks with a dropped session store (not shown) whether the user has a recently dropped session.
  • a dropped session could be caused by, for example, a user on a mobile entering a tunnel. This facility enables the user to be reconnected to a session without having to start over again.
  • the dialogue manager 24 communicates with the application managers 34 which in turn communicate with the internal/external services or applications to which the user has access.
  • the application managers each communicate with a business transaction log 50, which records transactions and with the notification manager 28b. Communications from the application manager to the notification manager are asynchronous and communications from the notification manager to the application managers are synchronous.
  • the notification manager also sends communications asynchronously to the dialogue manager 24.
  • the dialogue manager 24 has a synchronous link with the session manager 28a, which has a synchronous link with the notification manager.
  • the dialogue manager 24 communicates with the adaptive learning unit 33 via an event log 52 which records user activity so that the system can learn from the user's interaction. This log also provides a series of debugging and reporting information.
  • the adaptive learning unit is connected to the personalisation module 34 which is in turn connected to the dialogue manager.
  • Workflow 56, Dialogue 58 and Personalisation repositories 60 are also connected to the dialogue manager 24 through the personalisation module 554 so that a personalised view is always handled by the dialogue manager 24. These three repositories make up the SLI Repository referred to earlier.
  • the personalisation can also write to the personalisation repository 60.
  • the Development Suite 35 is connected to the workflow and dialogue repositories 56, 58 and implements functional specifications of applications storing the relevant grammars, dialogues, workflow and application manager function references for each the application in the repositories. It also facilitates the design and implementation of system, help, navigation and misrecognition grammars, dialogues, workflow and action references in the same repositories.
  • the dialogue manager 24 provides the following key areas of functionality: the dynamic management of task oriented conversation and dialogue; the management of synchronous conversations across multiple formats; and the management of resources within the dialogue subsystem. Each of these will now be considered in turn.
  • the conversation a user has with a system is determined by a set of dialogue and workflow structures, typically one set for each application.
  • the structures store the speech to which the user listens, the keywords for which the ASR listens and the steps required to complete a task
  • the DM determines its next contribution to the conversation or action to be carried out by the AMs .
  • the system allows the user to move between applications or context using either hotword or natural language navigation.
  • the complex issues relating to managing state as the user moves from one application to the next or even between multiple instances of the same application is handled by the DM.
  • This state management allows users to leave an application and return to it at the same point as when they left .
  • This functionality is extended by another component, the session manager, to allow users to leave the system entirely and return to the same point in an application when they log back in.
  • the dialogue manager communicates via the voice controller with both the speech engine (ASG) 26 and the voice recognition engine (ASR) 22.
  • the output from the speech generator 26 is voice data from the dialogue structures, which is played back to the user either as dynamic text to speech, as a pre-recorded voice or other stored audio format.
  • the ASR listens for keywords or phrases that the user might say.
  • the dialogue structures are predetermined
  • Predetermined dialogue structures or grammars are statically generated when the system is inactive. This is acceptable in prior art systems as scripts tended to be simple and did not change often once a system was activated. However, in the present system, the dialogue structures can be complex and may be modified frequently when the system is activated. To cope with this, the dialogue structure is stored as data in a run time repository, together with the mappings between recognised conversation points and application functionality. The repository is dynamically accessed and modified by multiple sources even when active users are on-line.
  • the dialogue subsystem comprises a plurality of voice controllers 19 and dialogue managers 24 (shown as a single server in Figure 3) .
  • the ability to update the dialogue and workflow structures dynamically greatly increases the flexibility of the system. In particular, it allows updates of the voice interface and applications without taking the system down; and provides for adaptive learning functionality which enriches the voice experience to the user as the system becomes more responsive and friendly to a user's particular syntax and phraseology with time. Considering each of these two aspects in more detail :
  • Spoken conversational language reflects quite a bit of a user's psychology, so ⁇ iol-economic background, dialect and speech style.
  • One reason an SLI is a challenge is due to these confounding factors.
  • the solution this system provides to this challenge is a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features - Adaptive Learning. Without discussing in detail the complexity of encoding this knowledge, suffice it to say that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any ASR.
  • User profiling solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
  • the adaptive learning technique is a stochastic process which first models which types, dialects and styles the entire user base of users employ.
  • a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, can be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him.
  • a more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed.
  • the general data of the masses is used to initially set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
  • the grammar coverage tool will now be described. It is first useful to define terms used in the generation of grammars :
  • Tag a label applied to a class of words which play a similar role in terms of syntax.
  • Word class a wordclass is a set of words which can play a common syntactic role.
  • Grammar a set of algebraic rules which define a set of sentences .
  • the rules can be expressed in terms of the tags which are used to label word classes .
  • Lexicon a list of words where each is followed by its tag. In certain cases, a word may have more than one tag in the lexicon, which can make the job of reliably applying tags ambiguous .
  • Coverage the set of sentences defined by a set of grammar rules .
  • the set thus ' covers ' the expressions a user may have recognised by the system.
  • Decision tree a notation for representing logical rules as a hierarchical structure of nodes and branches .
  • Training set a set of classified items to be presented to a learning process.
  • Test set a set of classified items used to evaluate the performance of a trained process .
  • Every sentence in every language has a structure . This structure can be described in more or less detail.
  • the coverage tool to be described applies to a class of sentences which can be described as ordered sequences of tags.
  • a tag is simply a label which refers to a class of components.
  • a component is generally an individual word. For example, consider the following simple example which relates to an application for purchasing airline tickets:
  • sentences 6-8 may be quite natural; for example, if the grammar rule is supposed to cover what someone might say at an airline reservation desk, after having been asked, "Which of your details do you want to change?" Sentences 3 to 5 are acceptable as natural in most, if not all circumstances. Sentences 1 & 2 are 'nonsense' or ' garbage ' utterances . We explain how the tool can propose new grammar rules which will exclude such expressions .
  • Preference Criteria The collection of sentences listed above comes from expanding the grammar rule. As discussed above, some sentences may be preferred to others, based on one or more preference criteria. It does not matter what the content of such a criterion is, provided that it can be applied consistently. Clearly there is no argument about a frequency of usage criterion. No sentence can be both frequently occurring and infrequently occurring. More subjective measures, such as 'intuitive acceptability' may be less consistent. Significant inconsistency in the training data will be disastrous.
  • Possible preference criteria may include:
  • Defining the preference criterion or criteria to classify each member of a set of sentences is the first step towards improving the coverage defined by the grammar .
  • tags may be applied in an adaptive way to deal with various topic specific grammars like news and/or financially-orientated grammars.
  • a single tag may be applied to a group of words as a unit. For example, clitics such as "I'm” and “gonna” as well as fully formed groups such as "I am” and "going to” may be tagged using respective tags.
  • tagged parts of speech need not be purely linguistic elements and could in addition, or alternatively, be tagged according to semantic content.
  • the set should reflect as many distinct kinds of sentence which are preferable, as possible. This part of the process may be performed manually.
  • 3. Use a computer to run an algorithm which can induce a decision tree to classify the sentences of the training set into positive and negative examples, on the basis of the sequence of tags which describes each sentence (160) .
  • Standard algorithms such as ID3 algorithm, described below, or its variant C4.5, which is robust with respect to incomplete data, can be used.
  • the ID3 algorithm is in J.R.
  • a set of rules constituting a new grammar can be easily obtained from the decision tree . These new rules can be offered to a (human) grammar writer, or they may even be deployed directly.
  • the new rules define a grammar of higher quality, because they include fewer garbage utterances. Inducing a Decision Tree
  • step 3 the algorithm begins with a training set of good and bad examples of sentences which have previously been given appropriate tag-sequence descriptions.
  • ID3 The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes CI, C2, .., Cn, the categorical attribute C, and a training set T of records .
  • function ID3 R: a set of non-categorical attributes
  • S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified] ;
  • D be the attribute with largest Gain(D,S) among attributes in R;
  • a 1 indicates that the tag is present in that example, a 0 that it is not.
  • the classification tree is then made as follows: 1) If, at a given node, all the examples are the same class (all positive or all negative) then label the node as a leaf and terminate. 2) If there are both positive and negative examples, then select the most informative feature (in our case, one of the tags) and partition into branches according to the value of that tag (i.e. present, or null) . 3) Use Shannon's information heuristic to identify the probable most informative feature: this is given by: ⁇ p log p for each feature, where p is probability of a particular value occurring in the feature slot across all the examples at this node. 4) Keep iterating the tree from step 1 until all bottom level nodes are leaf nodes . The decision tree is shown in Figure 6.
  • Each path through the tree corresponds to a rule.
  • the ID3 algorithm produces the following set of rules :
  • the matrix for this example is shown in Figure 7. So, in the example Figure 7, the induced rules give 100% correct classification on the test set (i.e. 0% error rate) .
  • the induced rules identify which sentences are good, and which bad, in terms of the preference criterion.
  • the rules express structural patterns which determine whether a sentence is a good example or not.
  • Figure 8 shows a computer system 800 that may be used to implement embodiments of the invention.
  • the computer system 800 may be used to implement a grammar formulation mechanism and/or a grammar coverage tool .
  • the computer system 800 may be used to provide at least one component of a spoken language interface.
  • the computer system 800 may be used by a grammar writer to formulate grammars separately from any part of a spoken language interface .
  • the computer system 800 comprises various data processing resources such as a processor (CPU) 830 coupled to a bus structure 838. Also connected to the bus structure 838 are further data processing resources such as read only memory 832 and random access memory 834.
  • a display adapter 836 connects a display device 818 having screen 820 to the bus structure 838.
  • One or more user- input device adapters 840 connect the user-input devices, including the keyboard 822 and mouse 824 to the bus structure 838.
  • An adapter 841 for the connection of a printer 821 may also be provided.
  • One or more media drive adapters 842 can be provided for connecting the media drives, for example the optical disk drive 814, the floppy disk drive 816 and hard disk drive 819, to the bus structure 838.
  • One or more telecommunications adapters can be provided for connecting the media drives, for example the optical disk drive 814, the floppy disk drive 816 and hard disk drive 819, to the bus structure 838.
  • processing resource interface means for connecting the computer system to one or more networks or to other computer systems or devices .
  • the communications adapters 844 could include a local area network adapter, a modem and/or ISDN terminal adapter, or serial or parallel port adapter etc, as required.
  • the processor 830 will execute computer program instructions that may be stored in one ox more of the read only memory 832, random access memory 834 the hard disk drive 819, a floppy disk in the floppy disk drive 816 and an optical disc, for example a compact disc (CD) or digital versatile disc (DVD) , in the optical disc drive or dynamically loaded via adapter 844.
  • the results of the processing performed may be displayed to a user via the display adapter 836 and display device 818.
  • User inputs for controlling the operation of the computer system 800 may be received via the user-input device adapters 840 from the user-input devices.
  • a computer program for implementing various functions or conveying various information can be written in a variety of different computer languages and can be supplied on carrier media.
  • a program or program element may be supplied on one or more CDs, DVDs and/or floppy disks and then stored on a hard disk, for example.
  • a program may also be embodied as an electronic signal supplied on a telecommunications medium, for example over a telecommunications network.
  • a software-controlled programmable processing device such as a Digital Signal Processor, microprocessor, other processing devices, data processing apparatus or computer system
  • a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention.
  • the computer program may be embodied as source code and undergo compilation for implementation on a processing device, apparatus or system, or may be embodied as object code, for example.
  • object code for example.
  • the term computer system in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and firmware embodied equivalents, whether part of a distributed computer system or not .
  • Software components may be implemented as plug-ins, modules and/or objects, for example, and may be provided as a computer program product stored on a carrier medium in machine or device readable form.
  • a computer program may be stored, for example, in solid-state memory, magnetic memory such as disc or tape, optically or magneto-optically readable memory, such as compact disc read-only or read-write memory (CD-ROM, CD-RW) , digital versatile disc (DVD) etc., and the processing device utilises the program or a part thereof to configure it for operation.
  • the computer program product may be supplied from a remote source embodied on a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave.
  • a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave.
  • carrier media are also envisaged as aspects of the present invention.
  • any communication link between a user and a mechanism, interface, tool and/or system according to aspects of the invention may be implemented using any available mechanisms, including mechanisms using of one or more of: wired, WWW, LAN, Internet, WAN, wireless, optical, satellite, TV, cable, microwave, telephone, cellular etc.
  • the communication link may also be a secure link.
  • the communication link can be a secure link created over the Internet using Public Cryptographic key Encryption techniques or as an SSL link.
  • Embodiments of the invention may also employ voice recognition techniques for identifying a user.

Abstract

A grammar coverage tool operates on sets of sentences which can be can be expressed as ordered sequences of word class tags. A set of sentences is submitted to the tool. Previously, each sentence has been marked as either a positive or negative example with respect to some previously determined preference criterion. The entire set has been tagged so that each word is assigned its appropriate word class tag. The entire set is divided into 2 parts: a training set, and a test set. The training set is re-written as set of binary vectors, where each tag which occurs in the training set is an attribute. The order of tags is important. The binary vectors are then submitted to an algorithm which will induce a decision tree which classifies the training set into positive and negative examples. Since each possible path through the tree corresponds to a rule, the tree yields a set of rules for distinguishing positive and negative sentences. The rules are expressed in terms of the attributes, i.e. the word class tags. These rules can be applied to the test set, in order to reclassify each member sentence as positive or negative. A comparison is then made with the original and authentic classification for the test set. Thus, the % error rate can be calculated. If the error rate is too high, the process can be re-run with a larger volume of training data, until an acceptably low error rate is achieved. The induction of decision trees may use either the ID3 algorithm, or the C4.5 algorithm.

Description

SYSTEM FOR GENERATING THE GRAMMAR OF A SPOKEN DIALOGUE SYSTEM
This invention relates to spoken language interfaces which enable a user to" interact using voice with a computer system. It is more specifically concerned with the writing of grammars for use in such interfaces.
A spoken language interface involves a two way dialogue between the user and the computer system. An automatic speech recognition system is used to comprehend the speech delivered by the user and an automatic speech generation system is used to generate speech to be played out to the user. This may be, for example, a mixture of speech synthesis and recorded voice. Spoken language interfaces are well known. One example is described in our earlier application GB 01 05 005.3 filed on 28 February 2001.
Spoken language interfaces rely on grammars to interpret user's commands and formulate responses. The grammar defines the sequences of words which it is possible for the interface to recognise. A grammar is a set of rules which de ines a set of sentences . Typically the rules are normally expressed in some kind of algebraic notation. The set of grammar rules usually define a much larger set of sentences; few rules cover many sentences. Size constraints apply to these grammars. Automatic speech recognisers can only recognise with high accuracy for grammars of a limited size. This means that there is a very strong motivation to remove any useless expressions from a grammar.
In any computer related application, there will be a limited number of expressions that a us.er is likely to say. If one considers an on-line flight reservation service it can easily be seen that it is possible to predict the majority of expressions that a user is likely to use. It can further be seen that a grammar is specific to a given application. In other words, for each application, a new grammar has to be written for the Spoken Language Interface (SLI) .
When constructing a grammar, some sentences may be preferable over others, according to one or more criteria. For example, expressions which are grammatically correct
(in the old fashioned sense of the term) might be preferred over those which are not . Alternatively, expressions which users use frequently, based on significant volumes of user-data, might be preferred over those which occur less often. A particular SLI provider might develop a consistent style across applications and favour expressions in that style.
In the ideal case, a grammar should cover all and only the expressions users are likely to say. At present, grammar writing is a time consuming operation involving the expertise of grammar writers to encapsulate manually the preferred sentences within each grammar and 'requiring painstaking care. This makes it hard to bring new applications to market quickly and also very expensive.
We have appreciated that there is a need for a tool which can assist in the process of grammar writing. The invention aims to provide such a tool . The invention is defined by the independent claims to which reference should be made . Preferred features are set out in the dependent claims .
According to a first aspect of the invention, there is provided a grammar formulation mechanism for assisting a grammar writer when formulating a new grammar. The grammar formulation mechanism is operable to apply an adaptive learning algorithm to a predetermined set of language components to determine a set of grammar rules and to apply the set of grammar rules to the new grammar. The grammar formulation mechanism derives the grammar rules from the set of language components .
Language components may represent any number of audible sounds, words, sentences, phrases, noises, utterances etc., and may be tagged so as to identify one or more of them during the adaptive learning process . The language components may be, for example, a scored set of sentences (e.g. composed of a sequence of words represented electronically) that give examples of sentences that are good and/or bad and/or intermediate (i.e. somewhere in between good and bad) which the grammar formulation mechanism learns and can subsequently apply to other language components and/or grammars either as they are generated or after their generation. The preference criteria, i.e. good/bad/intermediate, are applied consistently to the set of language components during the adaptive learning process. In one example, the preference criteria are based on frequency of use of language components .
The first aspect of the invention enables the grammar formulation mechanism to monitor new grammars as they are written, and propose alternative, usually better, grammars to a grammar writer. The grammar formulation mechanism may be configured automatically to re-write new grammars where better alternatives are found. In this way, the writing of new grammars is greatly facilitated, even if the grammar writer is not experienced in the art of writing new grammars . The grammar formulation mechanism may use an adaptive learning algorithm that is trained on a training set of language components obtained from the predetermined set of language components . By using a training set that may be varied, the grammar formulation mechanism can be trained to formulate grammar rules in dependence upon the application and/or system for which a new set of grammars is being formulated. This allows the grammar formulation mechanism to, for example, be trained, or re-trained, to adapt where it is to be used in different geographic regions having different language/dialog requirements. Grammar rules may be tested on a test set of language components obtained from the predetermined set of language components in order that the efficacy of the grammar rules can be tested.
The adaptive learning algorithm may use an inductive classification scheme, such as, for example, that used by the ID3 algorithm, to classify the language components during a training phase. This has the advantage of providing a compact set of grammar rules . A compact set of grammar rules can lead to quicker processing in a spoken language interface. An information-based heuristic may also be used by the adaptive learning algorithm to select the language components for classification during the training phase. Use of an information-based heuristic, such as, for example, one based on Shannon's theory, allows for improved selection of the language components for classification during the training phase, and this in turn can lead to a compact and efficient set of grammar rules .
The predetermined set of language components may form part of at least one sentence pre-identified as being good and/or bad and/or intermediate . Such a predetermined set of language components may form a training set for an expert system written by a grammar expert . The language components may have associated identifiers that indicate the grammar expert's opinion of their applicability. Where the identifiers identify language components as being intermediate, they may be selected according to a hierarchical ranking, e.g. those closest to being "good" being selected in preference to any others . Grammar rules derived from intermediate language components may themselves be scored and retained or rejected based upon the scores. This feature allows a fixed sized set of grammar rules to be produced. Such a fixed size set of grammar rules may possibly contain one or more less- pre erred grammar rules, but this feature makes optimal use of any storage available for such grammar rules as it enables all the available storage space to be filled with grammar rules . The grammar formulation mechanism may be implemented using one or more of software, hardware and firmware. It may be provided as a software module that can act as a plug-in to other software to impart the functionality of the grammar formulation mechanism to that other software. For example, the grammar formulation mechanism may be provided as a software module provided on a carrier medium to a user, or the grammar formulation mechanism may be supplied as a component of a larger software package, such as, for example, as part of a grammar coverage tool as herein described.
According to a second aspect of the invention, there is provided a grammar coverage tool comprising a grammar formulation mechanism as herein described. The grammar coverage tool may be a software tool that provides an approval mechanism operable to propose one or more grammar rule to a grammar writer during the formulation of the new grammar. This allows the grammar coverage tool to suggest grammars to the grammar writer as new grammars are being written, providing the grammar writer with an interactive tool that aids him/her in the task of writing a new grammar. Further, the grammar coverage tool may automatically modify any new grammar (s) as they are being written, if, for example, a poor grammar rule written by the grammar writer is detected.
Examples of carrier media on which a grammar coverage tool and/or a grammar formulation mechanism may be supplied includes at least one of the following set of media: a radio-frequency signal, an optical signal, an electronic signal, a magnetic disc or tape, solid-state memory, an optical disc, a magneto-optical disc, a compact disc and a digital versatile disc.
According to a third aspect of the invention, there is provided a computer system configured to provide a grammar formulation mechanism and/or grammar coverage tool according to any of the aspects of the invention herein described. According to a fourth aspect of the invention, there is provided a method for assisting a grammar writer when formulating a new grammar, the method comprising applying an adaptive learning algorithm to a predetermined set of language components to determine a set grammar rules, said predetermined set of language components comprising at least one sentence composed of at least one further language component, at least one said sentence being pre- identified as being good, bad and/or intermediate, and applying the set of grammar rules to the new grammar to determine whether the new grammar is good or bad. For example, the sentence (s) that form the predetermined set of language components may comprise a logical group of one or more of the following: audible sounds, words, conventional sentences in any language, phrases, noises, utterances etc .
The method may additionally comprise method steps corresponding to any component of any of the other aspects of the invention. For example, the method may comprise one or more of: a) training the adaptive learning algorithm on a training set of language components obtained from the predetermined set of language components; b) testing the grammar rules on a test set of language components obtained from the predetermined set of language components; c) classifying the language components during a training phase by applying an inductive classification scheme; and d) selecting the language components for classification during a training phase according to an information-based heuristic measurement .
A grammar coverage tool embodying the invention has the advantage of enabling automatic elimination of unfavoured sentences, on the basis of one or more preference criteria. Unfavoured sentences include those generated by incorrectly written grammar rules, those which are syntactically undesirable, and those which fail to match the preference criteria. During the process of writing a grammar, a grammar writer may write a rule that is incorrectly structured.
That is to say, a rule which generates non-sense sentences . Embodiments of the invention automatically detect poorly written grammar rules and correct them. In this sense, a Grammar Coverage Tool embodying the invention is an automatic debugging tool that saves a significant amount of time in the grammar writing process.
It has the further advantage of further assessing the quality of a grammar by assuring that all forms of the likely occurring utterances are captured by the grammar..
In this sense, the Grammar Coverage Tool embodying the invention is a design enhancement tool that has the advantage of further reducing the time it takes the grammar writer to design a well-written grammar.
An embodiment of the present invention will now be described, by way of example, in which:
Figure 1 is a schematic overview of a Spoken Language Interface; Figure 2 is a logical model of the interface archi ecture;
Figure 3 is a more detailed view of the interface architecture ;
Figure 4 is a flow chart illustrating steps in a process embodying the present invention;
Figure 5 is a table showing a matrix of vectors representing tag sequences;
Figure 6 shows a decision tree generated from the matrix of Figure 5; Figure 7 shows a vector matrix for a second example; and
Figure 8 shows a computer system that can be used to implement aspects of the invention.
The system schematically outlined in Figure 1 is a spoken language interface intended for communication with applications via mobile, satellite, or landline telephone.
In the example shown communication is via a mobile telephone 18 but any other voice telecommunications device such as a conventional telephone can be utilised. Calls to the system are handled by a telephony unit 20. Connected to the telephony unit are a Voice Controller 19, an Automatic Speech Recognition System (ASR) 22 and a automatic speech generation (ASG) system 26. The ASR 22 and ASG systems are each connected to the voice controller 19. A dialogue manager 24 is connected to the voice controller 19 and also to a spoken language interface (SLI) repository 30, a personalisation and adaptive learning unit 32 which is also attached to the SLI repository 30, and a session and notification manager 28. The Dialogue Manager is also connected to a plurality of Application Managers AM, 34 each of which is connected to an application which may be content provision external to the system. In the example shown, the content layer includes e-mail, news, travel, information, diary, banking etc. The nature of the content provided is not important to the principles of the invention. The SLI repository is also connected to a development suite 35.
Figure 2 provides a more detailed overview of the architecture of the system. The automatic speech generation unit 26 of figure 1 includes a basic text-to-speech (TTS) unit, a batch TTS unit 120, connected to a prompt cache 124 and an audio player 122. It will be appreciated that instead of using generated speech, pre-recorded speech may be played to the user under the control of the voice control 19. It the embodiment illustrated a mixture of pre-recorded voice and TTS is used. The system then comprises three levels: session level 120, application level 122 and non-application level 124. The session level comprises a location manager 126 and a dialogue manager 128. The session level also includes an interactive device control 130 and a session manager 132 which includes the functions of user identification and Help Desk.
The application layer comprises the application framework 134 under which an application manager controls an application. Many application managers and applications will be provided, such as UMS (Unified Messaging Service) , Call connect & conferencing, e- Commerce, Dictation etc. The non-application level 124 comprises a back office subsystem 140 which includes functions such as reporting, billing, account management, system administration, "push" advertising and current user profile. A transaction subsystem 142 includes a transaction log together with a transaction monitor and message broker.
In the final subsystem, an activity log 144 and a user profile repository 146 communicate with an adaptive learning unit 148. The adaptive learning unit also communicates with the dialogue manager 128. A personalisation module 150 also communicates with the user profiles repository 146 and the dialogue manager 128.
Referring back to Figure 1, the various functional components are briefly described as follows:
Voice Control 19
This allows the system to be independent of the ASR 22 and TTS 26 by providing an interface to either proprietary or non-proprietary speech recognition, text to speech and telephony components . The TTS may be replaced by, or supplemented by, recorded voice. The voice control also provides for logging and assessing call quality. The voice control will optimise the performance of the ASR.
Spoken Language Interface Reposi tory 30 In contrast to the prior art, grammars, that is constructs and user utterances for which the system listens, prompts and workflow descriptors are stored as data in a database rather than written in time consuming ASR/TTS specific scripts. As a result, multiple languages can be readily supported with greatly reduced development time, a multi-user development environment is facilitated and the database can be updated at anytime to reflect new or updated applications without taking the system down. The data is stored in a notation independent form. The data is converted or compiled between the repository and the voice control to the optimal notation for the ASR being used. This enables the system to be ASR independent.
ASR & ASG (Voice Engine) 22, 26
The voice engine is effectively dumb as all control comes from the dialogue manager via the voice control .
Dialogue Manager 24
The dialogue manager controls the dialogue across multiple voice servers and other interactive servers (eg WAP, Web etc) . As well as controlling dialogue flow it controls the steps required for a user to complete a task through mixed initiative - by permitting the user to change initiative with respect to specifying a data element (e.g. destination city for travel) . The Dialog Manager may support comprehensive mixed initiative, allowing the user to change topic of conversation, across multiple applications while maintaining state representations where the user left off in the many domain specific conversations. Currently, as initiative is changed across two applications, state of conversation is maintained. Within the system, the dialogue manager controls the workflow. It is also able to dynamically weight the user's language model by adaptively controlling the probabilities associated with the likely speaking style that the individual user employs dialogue structures in real-time, this is the chief responsibility the Adaptive Learning Engine and the current state of the conversation as a function of the current state of the conversation e user) with the user. The method by which the adaptive learning agent was conceived, is to collect user speaking data from call data records. This data, collected from a large domain of calls (thousands) provides the general profile of language usage across the population of speakers. This profile, or mean language model forms a basis for the first step in adjusting the language model probabilities to improve ASR accuracy. Within a conversation, the individual user's profile is generated and adaptively tuned across the user's subsequent calls. Early in the process, key linguistic cues are monitored, and based on individual user modelling, the elicitation of a particular language utterance dynamically invokes the modified language model profile tailored to the user, thereby adaptively tuning the user's language model profile and individual increasing the ASR accuracy for that user. Finally, the dialog manager includes a personalisation engine. Given the user demographics (age, sex, dialect) a specific personality tuned to user characteristics for that user's demographic group is invoked. The dialog manager also allows dialogue structures and applications to be updated or added without shutting the system down. It enables users to move easily between contexts, for example from flight booking to calendar etc, hang up and resume conversation at any point; specify information either step-by-step or in one complex sentence, cut-in and direct the conversation or pause the conversation temporarily.
Tel ephony The telephony component includes the physical telephony interface and the software API that controls it . The physical interface controls inbound and outbound calls, handles conferencing, and other telephony related functionality.
Session and Notification Management 28
The Session Manager initiates and maintains user and application sessions. These are persistent in the event of a voluntary or involuntary disconnection. They can re- instate the call at the position it had reached in the system at any time within a given period, for example 24 hours. A major problem in achieving this level of session storage and retrieval relates to retrieving a session in which a conversation is stored with either a dialogue structure, workflow structure or an application manager has been upgraded. In the preferred embodiment this problem is overcome through versioning of dialogue structures, workflow structures and application managers. The system maintains a count of active sessions for each version and only retires old versions once the versions count reaches zero. An alternative, which may be .implemented, requires new versions of dialogue structures, workflow structures and application managers to supply upgrade agents . These agents are invoked whenever by the session manager whenever it encounters old versions in the stored session. A log is kept by the system of the most recent version number. It may be beneficial to implement a combination of these solutions the former for dialogue structures and workflow structures and the latter for application managers.
The notification manager brings events to a user's attention, such as the movement of a share price by a predefined margin. This can be accomplished while the users are offline through interaction with the dialogue manager or offline. Offline notification is achieved either by the system calling the user and initiating an online session or through other media channels, for example, SMS, Pager, fax, email or other device.
Application Managers 34
Application Managers (AM) are components that provide the interface between the SLI and one or more of its content suppliers (i.e. other systems, services or applications) . Each application manager (there is one for every content supplier) exposes a set of functions to the dialogue manager to allow business transactions to be realised (e.g. GetEmailO, SendEmailO, BookFlight () , GetNewsItemO , etc) . Functions require the DM to pass the complete set of parameters required to complete the transaction. The AM returns the successful result or an error code to be handled in a predetermined fashion by the DM.
An AM is also responsible for handling some stateful information. For example, User A has been passed the first 5 unread emails. Additionally, it stores information relevant to a current user task. For example, flight booking details. It is able to facilitate user access to secure systems, such as banking, email or other. It can also deal with offline events, such as email arriving while a user is offline or notification from a flight reservation system that a booking has been confirmed. In these instances the AM's role is to pass the information to the Notification Manager.
An AM also exposes functions to other devices or channels, such as web, WAP, etc. This facilitates the multi channel conversation discussed earlier.
AMs are able to communicate with each other to facilitate aggregation of tasks. For example, booking a flight primarily would involve a flight booking AM, but this would directly utilise a Calendar AM in order to enter flight times into a user's Calendar.
AMs are discrete components built, for example, as enterprise Java Beans (EJBs) they can be added or updated while the system is live.
Transaction & Message Broker 142 (Fig. 2)
The Transaction and Message Broker records every logical transaction, identifies revenue-generating transactions, routes messages and facilitates system recovery.
Adaptive Learning & Personalisation 32; 148, 150 (Fig.2)
Spoken conversational language reflects quite a bit of a user's psychology, socio-economic background, and dialect and speech style. The reason an SLI is a challenge, which is met by embodiments of the invention, is due to these confounding factors . Embodiments of the invention provide a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features . Before discussing in detail the complexity of encoding this knowledge, it is noted that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any recogniser. User profiling, in part, solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
The adaptive learning technique is a stochastic
(statistical) process which first models which types, dialects and styles the entire user base of users employ.
By monitoring the Spoken Language of many hundreds of calls, a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, could be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him. A more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed. The general data of the masses is used initially to set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary. The approach, then, embodies statistical modelling across an entire population of users . The stochastic nature of the approach occurs, when new observations are made across the average mass, and language modelling weights are adaptively assigned to tune the recogniser. Help Assistant & Interactive Training
The Help Assistant & Interactive Training component allows users to receive real-time interactive assistance and training. The component provides for simultaneous, multi channel conversation (i.e. the user can talk through a voice interface and at the same time see visual representation of their interaction through another device, such as the web) .
Da tabases
The system uses a commercially available database such as Oracle 81 from Oracle Corp.
Central Directory
The Central Directory stores information on users, available applications, available devices, locations of servers and other directory type information.
System Administration - Infrastructure
The System Administration - Applications, provides centralised, web-based functionality to administer the custom build components of the system (e.g. Application Managers, Content Negotiators, etc.).
Development Sui te (35)
This provides an environment for building spoken language systems incorporating dialogue and prompt design, workflow and business process design, version control and system testing. It is also used to manage deployment of system updates and versioning.
Rather than having to labouriously code likely occurring user responses in a cumbersome grammar (e.g. BNF grammar - Bachus Nauer Format) resulting in time consuming detailed syntactic specification, the development suite provides an intuitive hierarchical, graphical display of language, reducing the modelling act to creatively uncover the precise utterance but the coding act to a simple entry of a data string. The development suite enables a Rapid Application Development (RAD) tool that combines language modelling with business process design (workflow) .
The grammar coverage tool (GCT) is embodied in the development suite. After a rule is entered into the RAD system, the GCT is invoked. It tests the efficiency of the rule (i.e. that it does not generate any garbage utterances which a user would not say; due to incorrect syntax) and also evaluates if the rule generates the necessary linguistic coverage adopted by the house-style or conventional wisdom about the way in which users phrase responses .
Dialogue Subsystem The Dialogue Subsystem manages, controls and provides the interface for human dialogue via speech and sound. Referring to Figure 1, it includes the dialogue manager, spoken language interface repository, session and notification managers, the voice controller 19, the Automatic Speech Recognition Unit 22, the Automatic Speech Generation unit 26 and telephony components 20. The subsystem is illustrated in more detailed architecture of the interface shown in Figure 3.
Before describing the dialogue subsystem in more detail, it is appropriate first to discuss what is a Spoken Language Interface (SLI) .
A SLI refers to the hardware, software and data components that allow users to interact with a computer through spoken language. The term "interface" is particularly apt in the context of voice interaction, since the SLI acts as a conversational mediator, allowing information to be exchanged between user and system via speech. In its idealised form, this interface would be "invisible" and the interaction would, from the user's standpoint, appear as seamless and natural as a conversation with another person. In fact, one principle aim of most SLI projects is to create a system that is as near as possible to a human-human conversation. If the exchange between user and machine is construed as a dialogue, the objective for the SLI development team is to create the ears, mind and voice of the machine. In computational terms, the ears of the system are created by the Automatic Speech Recognition (ASR) System 22. The voice is created via the Automatic Speech Generation (ASG) software 26, and the mind is made up of the computational power of the hardware and the databases of information contained in the system. The present system uses software developed by other companies for its ASR and ASG. Suitable systems are available from Nuance and Lernout & Hauspie respectively. These systems will not be described further. However, it should be noted that the system allows great flexibility in the selection of these components from different vendors. Additionally, the basic Text To Speech unit supplied, for example, by Lernout & Hauspie may be supplemented by an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements) , streaming of audio data from other source (e.g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
One implementation of the system is given in Figure 3. It should be noted that this is a simplified description. A voice controller 19 and the dialogue manager 24 control and manage the dialogue between the system and the end user. The dialogue is dynamically generated at run time from a SLI repository which is managed by a separate component, the development suite. The ASR unit 22 comprises a plurality of ASR servers.
The ASG unit 26 comprises a plurality of speech servers. Both are managed and controlled by the voice controller.
The telephony unit 20 comprises a number of telephony board servers and communicates with the voice controller, the ASR servers and the ASG servers.
Calls from users, shown as mobile phone 18 are handled initially by the telephony server 20 which makes contact with a free voice controller. The voice controller contacts and locates an available ASR resource. The voice controller 19 which identifies the relevant ASR and ASG ports to the telephony server. The telephony server can now stream voice data from the user to the ASR server and the ASG stream audio to the telephony server. The voice controller, having established contacts with the ASR and ASG servers now informs the Dialogue Manager which requests a session on behalf of a user in the session manager. As a security precaution, the user is required to provide authentication information before this step can take place. This request is made to the session manager 28 which is represented logically at 132 in the session layer in Figure 2. The session manager server 28 checks with a dropped session store (not shown) whether the user has a recently dropped session. A dropped session could be caused by, for example, a user on a mobile entering a tunnel. This facility enables the user to be reconnected to a session without having to start over again.
The dialogue manager 24 communicates with the application managers 34 which in turn communicate with the internal/external services or applications to which the user has access. The application managers each communicate with a business transaction log 50, which records transactions and with the notification manager 28b. Communications from the application manager to the notification manager are asynchronous and communications from the notification manager to the application managers are synchronous. The notification manager also sends communications asynchronously to the dialogue manager 24. The dialogue manager 24 has a synchronous link with the session manager 28a, which has a synchronous link with the notification manager.
The dialogue manager 24 communicates with the adaptive learning unit 33 via an event log 52 which records user activity so that the system can learn from the user's interaction. This log also provides a series of debugging and reporting information. The adaptive learning unit is connected to the personalisation module 34 which is in turn connected to the dialogue manager. Workflow 56, Dialogue 58 and Personalisation repositories 60 are also connected to the dialogue manager 24 through the personalisation module 554 so that a personalised view is always handled by the dialogue manager 24. These three repositories make up the SLI Repository referred to earlier.
As well as receiving data from the workflow, dialogue and personalisation repositories, the personalisation can also write to the personalisation repository 60. The Development Suite 35 is connected to the workflow and dialogue repositories 56, 58 and implements functional specifications of applications storing the relevant grammars, dialogues, workflow and application manager function references for each the application in the repositories. It also facilitates the design and implementation of system, help, navigation and misrecognition grammars, dialogues, workflow and action references in the same repositories.
The dialogue manager 24 provides the following key areas of functionality: the dynamic management of task oriented conversation and dialogue; the management of synchronous conversations across multiple formats; and the management of resources within the dialogue subsystem. Each of these will now be considered in turn.
Dynamic Management of Task Oriented Conversation and Dialogue
The conversation a user has with a system is determined by a set of dialogue and workflow structures, typically one set for each application. The structures store the speech to which the user listens, the keywords for which the ASR listens and the steps required to complete a task
(workflow) . By analysing what the users say, which is returned by the ASR, and combining this with what the DM knows about the current context of the conversation, based on current state of dialogue structure, workflow structure, and application & system notifications, the DM determines its next contribution to the conversation or action to be carried out by the AMs . The system allows the user to move between applications or context using either hotword or natural language navigation. The complex issues relating to managing state as the user moves from one application to the next or even between multiple instances of the same application is handled by the DM. This state management allows users to leave an application and return to it at the same point as when they left . This functionality is extended by another component, the session manager, to allow users to leave the system entirely and return to the same point in an application when they log back in.
The dialogue manager communicates via the voice controller with both the speech engine (ASG) 26 and the voice recognition engine (ASR) 22. The output from the speech generator 26 is voice data from the dialogue structures, which is played back to the user either as dynamic text to speech, as a pre-recorded voice or other stored audio format. The ASR listens for keywords or phrases that the user might say.
Typically, the dialogue structures are predetermined
(but stochastic language models could be employed in an implementation of the system or hybrids of the two) .
Predetermined dialogue structures or grammars are statically generated when the system is inactive. This is acceptable in prior art systems as scripts tended to be simple and did not change often once a system was activated. However, in the present system, the dialogue structures can be complex and may be modified frequently when the system is activated. To cope with this, the dialogue structure is stored as data in a run time repository, together with the mappings between recognised conversation points and application functionality. The repository is dynamically accessed and modified by multiple sources even when active users are on-line.
The dialogue subsystem comprises a plurality of voice controllers 19 and dialogue managers 24 (shown as a single server in Figure 3) . The ability to update the dialogue and workflow structures dynamically greatly increases the flexibility of the system. In particular, it allows updates of the voice interface and applications without taking the system down; and provides for adaptive learning functionality which enriches the voice experience to the user as the system becomes more responsive and friendly to a user's particular syntax and phraseology with time. Considering each of these two aspects in more detail :
Updates
Today we are accustomed to having access to services 24 hours a day and for mobile professionals this is even more the case given the difference in time zones. This means the system must run non-stop 24 hours a day, 7 days a week. Therefore an architecture and system that allows new applications and services, or merely improvements in interface design, to be added with no affect on the serviceability of the system has a competitive advantage in the market place .
Adaptive Learning Functionality
Spoken conversational language reflects quite a bit of a user's psychology, soσiol-economic background, dialect and speech style. One reason an SLI is a challenge is due to these confounding factors. The solution this system provides to this challenge is a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features - Adaptive Learning. Without discussing in detail the complexity of encoding this knowledge, suffice it to say that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any ASR. User profiling, in part, solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options . The adaptive learning technique is a stochastic process which first models which types, dialects and styles the entire user base of users employ. By monitoring the Spoken Language of many hundreds of calls, a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, can be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him. A more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed. The general data of the masses is used to initially set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary. The grammar coverage tool will now be described. It is first useful to define terms used in the generation of grammars :
Definitions
Tag: a label applied to a class of words which play a similar role in terms of syntax.
Word class : a wordclass is a set of words which can play a common syntactic role.
Grammar: a set of algebraic rules which define a set of sentences . The rules can be expressed in terms of the tags which are used to label word classes . Lexicon: a list of words where each is followed by its tag. In certain cases, a word may have more than one tag in the lexicon, which can make the job of reliably applying tags ambiguous .
Coverage: the set of sentences defined by a set of grammar rules . The set thus ' covers ' the expressions a user may have recognised by the system.
Decision tree: a notation for representing logical rules as a hierarchical structure of nodes and branches .
Binary vector: a bracketed sequence of ls and 0s. Training set: a set of classified items to be presented to a learning process.
Test set: a set of classified items used to evaluate the performance of a trained process .
Sentence Structure
Every sentence in every language has a structure . This structure can be described in more or less detail. The coverage tool to be described applies to a class of sentences which can be described as ordered sequences of tags. A tag is simply a label which refers to a class of components. A component is generally an individual word. For example, consider the following simple example which relates to an application for purchasing airline tickets:
Example (1)
Let x = one of [Please, Kindly] Let y = one of [change, modify] Let t = the
Let z = one of ^destination city>, <airline>] Let ? denote that a component is optional, so that a sentence may or may not contain it . The question mark applies to bracketed items as if they were a single indivisible unit. Then the rule ?x ?y ?t z defines the following set of sentences 1) I want to <destination city>
2) I want to the <destination city>
3) I want to change the <destination city>
4) I want to change <destination city>
5) Change <destination city> 6) Change the <destination city>
7) The <destination city>
8) <destination city>
There is something awkward about sentences 1) & 2) . In an appropriate context, sentences 6-8 may be quite natural; for example, if the grammar rule is supposed to cover what someone might say at an airline reservation desk, after having been asked, "Which of your details do you want to change?" Sentences 3 to 5 are acceptable as natural in most, if not all circumstances. Sentences 1 & 2 are 'nonsense' or ' garbage ' utterances . We explain how the tool can propose new grammar rules which will exclude such expressions .
One useful aspect of algebraic notation for grammar rules is that it enables large sets of sentences to be specified by a very compact set of rules. The downside is that these compact rules may not be as precise as we would like, and thus they cover awkward sentences too. Even sentences that are grammatically incorrect might be included even though they would never be spoken. This awkwardness provides us with one possible preference criterion.
Preference Criteria The collection of sentences listed above comes from expanding the grammar rule. As discussed above, some sentences may be preferred to others, based on one or more preference criteria. It does not matter what the content of such a criterion is, provided that it can be applied consistently. Clearly there is no argument about a frequency of usage criterion. No sentence can be both frequently occurring and infrequently occurring. More subjective measures, such as 'intuitive acceptability' may be less consistent. Significant inconsistency in the training data will be disastrous.
In the following descriptions, it does not really matter what preference criterion is used, as long as it can be applied consistently. The point is for any set of sentences obtained by expanding a grammar rule, for example sentences 1-8 above, some can be categorised as positive examples, and some as negative examples.
Possible preference criteria may include:
Frequency of usage;
House style;
Disfavouring 'garbage' - sentences with poor syntax;
Intuitive acceptability.
Other criteria, as yet unspecified, may also be selected.
Defining the preference criterion or criteria to classify each member of a set of sentences is the first step towards improving the coverage defined by the grammar .
The Tagging Procedure
The discussion above introduced the notion of sentence structure, and defines a simple scheme for describing that structure. It is axiomatic for the invention here described that it is possible to assign tags accurately to each word in a set of sentences . We need reasonably high levels of accuracy, but since the GCT can cope with a small amount of ambiguity, it need not be 100% in every case. There are standard tools and techniques available which can be easily adapted to work well with corpora from particular contexts and domains. For example, the Brill tagger is a program freely available in the public domain, which is readily usable with various different types of context and domains .
Those skilled in the art will realise that any tagging system or method can be used and there need be no restriction on the tag set used or to the parts of speech (or other audible sound) to which any particular tag set is applied. For example, tags may be applied in an adaptive way to deal with various topic specific grammars like news and/or financially-orientated grammars. Moreover, a single tag may be applied to a group of words as a unit. For example, clitics such as "I'm" and "gonna" as well as fully formed groups such as "I am" and "going to" may be tagged using respective tags. Furthermore, tagged parts of speech need not be purely linguistic elements and could in addition, or alternatively, be tagged according to semantic content.
The Grammar Coverage Tool
Given a set of sentences, divided into positive and negative examples, tagged as per our method above, the procedure below can be implemented. This is illustrated in the flow chart of figure
1. Take a set of sentences relating to a particular grammar. According to a defined preference criterion, categorise each sentence as 'good' or 'bad'
(120) . Ideally the set should reflect as many distinct kinds of sentence which are preferable, as possible. This part of the process may be performed manually. 2. Divide the set into a training set, and a test set (140) . As a guide, use around 20% of the cases for training and 80% for testing. Other proportions may be used. 3. Use a computer to run an algorithm which can induce a decision tree to classify the sentences of the training set into positive and negative examples, on the basis of the sequence of tags which describes each sentence (160) . Standard algorithms such as ID3 algorithm, described below, or its variant C4.5, which is robust with respect to incomplete data, can be used. The ID3 algorithm is in J.R. Quinlan's 1985 paper: The Induction of Decision Trees (described fully in Machine Learning and Data Mining by Michalski et al . Pub. J Wiley) . 4. Use the decision tree, to propose a classification, either positive or negative with respect to the preference criterion, for the sentences in the test set (180) .
5. Calculate the % error (200), E, over the predicted classification for the test set, given by the new rules, against the actual classification originally given to each member of the test set
6. If E is high (for example over 5%) (220), consider repeating the procedure from (100) , with a larger set. Otherwise the algorithm terminates.
7. A set of rules constituting a new grammar can be easily obtained from the decision tree . These new rules can be offered to a (human) grammar writer, or they may even be deployed directly.
The new rules define a grammar of higher quality, because they include fewer garbage utterances. Inducing a Decision Tree
In step 3 above, the algorithm begins with a training set of good and bad examples of sentences which have previously been given appropriate tag-sequence descriptions.
These examples are used to produce a classification scheme, called a decision tree. The Algorithms
ID3 The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes CI, C2, .., Cn, the categorical attribute C, and a training set T of records . function ID3 (R: a set of non-categorical attributes,
C: the categorical attribute, S: a training set) returns a decision tree,- begin
If S is empty, return a single node with value Failure;
If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified] ; Let D be the attribute with largest Gain(D,S) among attributes in R;
Let {dj I j=l,2, .., m} be the values of attribute D;
Let {Sj I j=l,2, .., m} be the subsets of S consisting respectively of records with value dj for attribute D;
Return a tree with root labeled D and arcs labeled dl, d2, .., dm going respectively to the trees ID3(R-{D}, C, SI), ID3(R-{D}, C, S2) , .., ID3(R-{D}, C, Sm) ; end ID3;
2. C4.5 Extensions C4.5 introduces a number of extensions of the original ID3 algorithm.
In building a decision tree we can deal with training sets that have records with unknown attribute
5 values by evaluating the gain, or the gain ratio, for an attribute by considering only the records where that attribute is defined.
In using a decision tree, we can classify records that have unknown attribute values by estimating the 10 probability of the various possible results.
Cases of attributes with continuous ranges are dealt with as follows. Say that attribute Ci has a continuous range. Examine the values for this attribute in the training set. Say they are, in increasing order, 15 Al, A2 , .., Am. Then for each value Aj , j=l,2,..m, partition the records into those that have Ci values up to and including Aj , and those that have values greater than Aj . For each of these partitions compute the gain, or gain ratio, and choose the partition that maximizes the 0 gain.
Data Representation
In order to be used by the algorithm(s) outlined herein, the sequences of tags which represent the sentence 5 structures need to be encoded as binary vectors. A matrix is used as follows: Consider again the following simple example (1) :
Le t x = one of [Please, Kindly] 0 Let y = one of [change, modify] Let t = the
Let z = one of [<destination city>, <airline>] Let ? denote that a component is optional, so that a sentence may or may not contain it . Then the rule 5 ?x ?y ?t z defines a set of sentences which includes these examples (the training set) :
1) Please <destination city>
2) Please the destination city> 3) Please change the <destination city>
4) Please change <destination city>
5) Change <destination city>
6) Change the <destination city> 7) The <destination city>
8) destination city>
In the binary representation scheme of figure 5, a 1 indicates that the tag is present in that example, a 0 that it is not.
The classification tree is then made as follows: 1) If, at a given node, all the examples are the same class (all positive or all negative) then label the node as a leaf and terminate. 2) If there are both positive and negative examples, then select the most informative feature (in our case, one of the tags) and partition into branches according to the value of that tag (i.e. present, or null) . 3) Use Shannon's information heuristic to identify the probable most informative feature: this is given by:∑p log p for each feature, where p is probability of a particular value occurring in the feature slot across all the examples at this node. 4) Keep iterating the tree from step 1 until all bottom level nodes are leaf nodes . The decision tree is shown in Figure 6.
Each path through the tree corresponds to a rule. For the example matrix above, the ID3 algorithm produces the following set of rules :
1) ?x ?y ?t d => Good sentence
2) x ?t d => Bad sentence (garbage)
3) y ?t d => Good sentence
In simple terms, these new rules capture our intuition that if an x (Please) is present, it must be followed by a y (change) if it is to be a good sentence on our selected preference criterion. Consider further the following test set of sentences, drawn from the same rule which produced the training set . Example 2 :
1) Kindly airline (Bad) 2) Kindly modify the airline (Good)
3) Modify airline (Good)
4) Please airline (Bad)
5) Please modify the airline (Good)
6) Kindly destination (Bad) 7) Change airline (Good)
8) Change the airline (Good)
We can apply the rules derived from the decision tree induced for the training set to these test sentences . 1) ?x ?y ?t d => Good sentence
2) x ?t d => Bad sentence 3) y ?t d => Good sentence
The matrix for this example is shown in Figure 7. So, in the example Figure 7, the induced rules give 100% correct classification on the test set (i.e. 0% error rate) . The induced rules identify which sentences are good, and which bad, in terms of the preference criterion. The rules express structural patterns which determine whether a sentence is a good example or not.
In the description discussed above, for the class of sentences which can be represented as ordered sequences of tags, a procedure has been described for identifying which sequences of tags define good examples, and which define the bad. Good and bad are determined on the basis of a consistently applied preference criterion. Most of this procedure is implemented in software, thus providing a tool which can improve the coverage quality of a grammar, with respect to the preference criterion. The improvement is manifested in a modified set of grammar rules which is applied by a spoken language interface. For example, rules can be adopted which identify the good examples as the new grammar for an ASR. The ASR thus no longer risks reduced accuracy by having to listen out for 'bad' sentences. The embodiment described enables unfavourable elements to be eliminated automatically in contrast to the prior art in which all processing is manual.
In the embodiment described, the steps of selecting the set of sentences, choosing the preference criterion separating the sentences into a test and a training set and applying the preference criterion are performed manually. However, these steps may also be automated given a sufficiently large database of sentences to select from. Figure 8 shows a computer system 800 that may be used to implement embodiments of the invention. The computer system 800 may be used to implement a grammar formulation mechanism and/or a grammar coverage tool . The computer system 800 may be used to provide at least one component of a spoken language interface. The computer system 800 may be used by a grammar writer to formulate grammars separately from any part of a spoken language interface .
The computer system 800 comprises various data processing resources such as a processor (CPU) 830 coupled to a bus structure 838. Also connected to the bus structure 838 are further data processing resources such as read only memory 832 and random access memory 834. A display adapter 836 connects a display device 818 having screen 820 to the bus structure 838. One or more user- input device adapters 840 connect the user-input devices, including the keyboard 822 and mouse 824 to the bus structure 838. An adapter 841 for the connection of a printer 821 may also be provided. One or more media drive adapters 842 can be provided for connecting the media drives, for example the optical disk drive 814, the floppy disk drive 816 and hard disk drive 819, to the bus structure 838. One or more telecommunications adapters
844 can be provided thereby providing processing resource interface means for connecting the computer system to one or more networks or to other computer systems or devices .
The communications adapters 844 could include a local area network adapter, a modem and/or ISDN terminal adapter, or serial or parallel port adapter etc, as required. In operation the processor 830 will execute computer program instructions that may be stored in one ox more of the read only memory 832, random access memory 834 the hard disk drive 819, a floppy disk in the floppy disk drive 816 and an optical disc, for example a compact disc (CD) or digital versatile disc (DVD) , in the optical disc drive or dynamically loaded via adapter 844. The results of the processing performed may be displayed to a user via the display adapter 836 and display device 818. User inputs for controlling the operation of the computer system 800 may be received via the user-input device adapters 840 from the user-input devices.
A computer program for implementing various functions or conveying various information can be written in a variety of different computer languages and can be supplied on carrier media. A program or program element may be supplied on one or more CDs, DVDs and/or floppy disks and then stored on a hard disk, for example. A program may also be embodied as an electronic signal supplied on a telecommunications medium, for example over a telecommunications network.
It will be appreciated that the architecture of a computer system could vary considerably, and that Figure 8 is only one example . Although the invention has been described in relation to one or more mechanism, interface, tool and/or system, those skilled in the art will realise that any one or more such mechanism, interface and/or system, or any component thereof, may be implemented using one or more of hardware, firmware and/or software. Such mechanisms, interfaces and/or systems may, for example, form part of a distributed mechanism, interface and/or system providing functionality at a plurality of different physical locations . Insofar as embodiments of the invention described above are implementable, at least in part, using a software-controlled programmable processing device such as a Digital Signal Processor, microprocessor, other processing devices, data processing apparatus or computer system, it will be appreciated that a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention. The computer program may be embodied as source code and undergo compilation for implementation on a processing device, apparatus or system, or may be embodied as object code, for example. The skilled person would readily understand that the term computer system in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and firmware embodied equivalents, whether part of a distributed computer system or not .
Software components may be implemented as plug-ins, modules and/or objects, for example, and may be provided as a computer program product stored on a carrier medium in machine or device readable form. Such a computer program may be stored, for example, in solid-state memory, magnetic memory such as disc or tape, optically or magneto-optically readable memory, such as compact disc read-only or read-write memory (CD-ROM, CD-RW) , digital versatile disc (DVD) etc., and the processing device utilises the program or a part thereof to configure it for operation.
The computer program product may be supplied from a remote source embodied on a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave. Such carrier media are also envisaged as aspects of the present invention.
Although the invention has been described in relation to the preceding example embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and that many variations are possible falling within the scope of the invention. For example, methods for performing operations in accordance with any one or combination of the embodiments and aspects described herein are intended to fall within the scope of the invention. As another example, those skilled in the art will understand that any communication link between a user and a mechanism, interface, tool and/or system according to aspects of the invention may be implemented using any available mechanisms, including mechanisms using of one or more of: wired, WWW, LAN, Internet, WAN, wireless, optical, satellite, TV, cable, microwave, telephone, cellular etc. The communication link may also be a secure link. For example, the communication link can be a secure link created over the Internet using Public Cryptographic key Encryption techniques or as an SSL link. Embodiments of the invention may also employ voice recognition techniques for identifying a user.
The scope of the present disclosure includes any novel feature or combination of features disclosed therein either explicitly or implicitly or any generalisation thereof irrespective of whether or not it relates to the claimed invention or mitigates any or all of the problems addressed by the present invention. The applicant hereby gives notice that new claims may be formulated to such features during the prosecution of this application or of any such further application derived therefrom. In particular, with reference to the appended claims, features and sub-features from the claims may be combined with those of any other of the claims in any appropriate manner and not merely in the specific combinations enumerated in the claims . For the avoidance of doubt the term "comprising" , as used herein throughout the description and claims is not to be construed solely as meaning "consisting only of" .

Claims

1. A grammar formulation mechanism for assisting a grammar writer when formulating a new grammar, wherein said grammar formulation mechanism is operable to apply an adaptive learning algorithm to a predetermined set of language components to determine a set of grammar rules and to apply the set of grammar rules to the new grammar.
2. The grammar formulation mechanism of Claim 1, wherein the adaptive learning algorithm is trained on a training set of language components obtained from said predetermined set of language components .
3. The grammar formulation mechanism of Claim 1 or Claim 2, wherein said grammar rules are tested on a test set of language components obtained from said predetermined set of language components .
4. The grammar formulation mechanism of any one of Claims 1 to 3, wherein the adaptive learning algorithm uses an inductive classification scheme to classify the language components during a training phase.
5. The grammar formulation mechanism of any one of Claims 1 to 4, wherein the adaptive learning algorithm uses an information-based heuristic to select the language components for classification during a/the training phase.
6. The grammar formulation mechanism of any one of Claims 1 to 5, wherein the predetermined set of language components form part of at least one sentence pre-identified as being good or bad or somewhere between good and bad.
7. A grammar coverage tool comprising a grammar formulation mechanism according to any one of Claims 1 to 6, and an approval mechanism operable to propose at least one of said grammar rules to a grammar writer during the formulation of the new grammar.
8. The grammar coverage tool of Claim 7, operable to provide one of a plurality of predetermined sets of language components to the grammar formulation mechanism in dependence upon the application and/or system for which a new set of grammars is being formulated.
9. A computer program product comprising a computer usable medium having computer readable program code embodied in said computer usable medium, said computer readable program code comprising computer readable program code for causing at least one computer to provide the grammar formulation mechanism according to any one of Claims 1 to 6.
10. A computer program product comprising a computer usable medium having computer readable program code embodied in said computer usable medium, said computer readable program code comprising computer readable program code for causing at least one computer to provide the grammar coverage tool according to Claim 7 or Claim 8.
11. The computer program product according to Claim 9 or Claim 10, wherein the carrier medium includes at least one of the following set of media: a radio-frequency signal, an optical signal, an electronic signal, a magnetic disc or tape, solid-state memory, an optical disc, a magneto-optical disc, a compact disc and a digital versatile disc.
12. A computer system configured to provide the grammar formulation mechanism according to any one of Claims 1 to 6 or the grammar coverage tool according to Claim 7 or Claim 8.
13. A method for assisting a grammar writer when formulating a new grammar, the method comprising: applying an adaptive learning algorithm to a predetermined set of language components to determine a set grammar rules, said predetermined set of language components comprising at least one sentence composed of at least one further language component, at least one said sentence being pre-identified as being good, bad and/or intermediate; and applying the set of grammar rules to the new grammar to determine whether the new grammar is good or bad.
14. The method of Claim 13, comprising training the adaptive learning algorithm on a training set of language components obtained from said predetermined set of language components .
15. The method of Claim 13 or Claim 14, comprising testing said grammar rules on a test set of language components obtained from said predetermined set of language components .
16. The method of any one of Claims 13 to 15, comprising classifying the language components during a training phase by applying an inductive classification scheme.
17. The method of any one of Claims 13 to 16, comprising selecting the language components for classification during a/the training phase according to an information-based heuristic measurement.
18. A grammar formulation mechanism substantially as hereinbefore described, and with reference to the accompanying drawings .
19. A grammar coverage tool substantially as hereinbefore described, and with reference to the accompanying drawings.
20. A computer program element substantially as hereinbefore described.
21. A computer program product substantially as hereinbefore described.
22. A computer system substantially as hereinbefore described, and with reference to the accompanying drawings.
23. A method for assisting a grammar writer when formulating a new grammar substantially as hereinbefore described, and with reference to the accompanying drawings.
24. A method of writing a grammar for a class of sentences expressed as an ordered sequence of tags, comprising the steps of : a) acquiring a set of sentences or candidate sentences in a grammar; b) selecting a preference criterion and defining definitive acceptable and unacceptable sentences according to that criterion within the acquired set of sentences; c) dividing the set of sentences into a training set of sentences and a test set of sentences; d) assigning the words of each sentence a descriptor reflecting its syntactic class to form a tag sequence for each sentence; e) inducing a set of grammar rules by running an algorithm on a computer system to form a decision tree on the basis of the tags sequence; f) applying the set of induced grammar rules to the test sets of sentences to yield a proposed classification of sentences as acceptable or unacceptable examples with respect to the preference criterion; g) comparing the proposed classification application of the set of grammar rules to the test set of sentences with the definitive classification of the test set; and, depending on the result of the comparison either, accepting the rules from the decision tree; or repeating steps a) to g) with an expanded set of sentences or a different division between test and training sets .
25. A method according to claim 24, comprising encoding the sequence of tags representing sentence structures as binary vectors prior to running the algorithm.
26. A method according to claim 24 or 25, wherein the algorithm is an ID3 or C4.5 algorithm.
27. A method according to claim 24, 25 or 26, wherein the preference criterion is selected from the group comprising frequency of usage, house style and/or intuitive acceptability.
28. A method according to any of claims 24 to 27, wherein the formation of the decision tree comprises the steps of: a) labelling a node as a leaf and terminating if all examples have the same classification; b) where both acceptable and unacceptable examples are present, selecting the most informative tag and partition into branches according to the value of that tag; and c) repeating steps a) and b) until all bottom level nodes are leaf nodes .
29. A method according to claim 28, wherein the step of selecting the most informative tag comprises identifying the probable most informative tag.
30. A method according to claim 29, comprising applying Shannon's information heuristic to identify the probable most informative tag.
31. Apparatus for writing a grammar for a class of sentences expressed as an ordered sequence of tags in which a set of sentences or candidate sentences in a grammar is acquired, a preference criterion is selected to define definitive acceptable and unacceptable sentences according to that criterion within the acquired set of sentences, the sentences are divided into a training set and a test set , and the words of each sentence are assigned a descriptor reflecting its syntactic class to form a tag sequence for each sentence; the apparatus comprising: means for inducing a set of grammar rules by comprising an algorithm for running on a computer system to form a decision tree on the basis of the tag sequence; means for applying the set of induced grammar rules to the test set of sentences to yield a proposed classification of sentences as acceptable or unacceptable examples with respect to the preference criterion; and means for comparing the proposed classification by application of the set of grammar rules to the test set of sentences with the definitive classification of the test set, and for accepting or rejecting the rules depending on the result of the comparison.
32. Apparatus according to claim 31, comprising means for automatically acquiring the set of sentences or candidate sentences in the grammar; means for automatically selecting the preference criterion and for defining definitive acceptable and unacceptable sentences according to that criterion; means for automatically dividing the sentence set into training and test sets; and means for automatically forming a tag sequence for each sentence .
PCT/GB2002/001962 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system WO2002089113A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0326763A GB2391993B (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0110532A GB2375210B (en) 2001-04-30 2001-04-30 Grammar coverage tool for spoken language interface
GB0110532.9 2001-04-30

Publications (1)

Publication Number Publication Date
WO2002089113A1 true WO2002089113A1 (en) 2002-11-07

Family

ID=9913726

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/001962 WO2002089113A1 (en) 2001-04-30 2002-04-30 System for generating the grammar of a spoken dialogue system

Country Status (2)

Country Link
GB (2) GB2375210B (en)
WO (1) WO2002089113A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2115735A1 (en) * 2007-02-27 2009-11-11 Nuance Communications, Inc. Presenting supplemental content for digital media using a multimodal application
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN105845134A (en) * 2016-06-14 2016-08-10 科大讯飞股份有限公司 Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
US9733901B2 (en) 2011-07-26 2017-08-15 International Business Machines Corporation Domain specific language design
EP3502923A1 (en) * 2017-12-22 2019-06-26 SoundHound, Inc. Natural language grammars adapted for interactive experiences
CN112069797A (en) * 2020-09-03 2020-12-11 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics
US11900928B2 (en) 2017-12-23 2024-02-13 Soundhound Ai Ip, Llc System and method for adapted interactive experiences

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435408B (en) * 2018-12-26 2023-04-18 阿里巴巴集团控股有限公司 Dialog error correction method and device and electronic equipment
CN112992128B (en) * 2021-02-04 2023-06-06 北京淇瑀信息科技有限公司 Training method, device and system of intelligent voice robot

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0813185A2 (en) * 1996-06-14 1997-12-17 Lucent Technologies Inc. Compilation of weighted finite-state transducers from decision trees
US5937385A (en) * 1997-10-20 1999-08-10 International Business Machines Corporation Method and apparatus for creating speech recognition grammars constrained by counter examples
WO2000078022A1 (en) * 1999-06-11 2000-12-21 Telstra New Wave Pty Ltd A method of developing an interactive system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0813185A2 (en) * 1996-06-14 1997-12-17 Lucent Technologies Inc. Compilation of weighted finite-state transducers from decision trees
US5937385A (en) * 1997-10-20 1999-08-10 International Business Machines Corporation Method and apparatus for creating speech recognition grammars constrained by counter examples
WO2000078022A1 (en) * 1999-06-11 2000-12-21 Telstra New Wave Pty Ltd A method of developing an interactive system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LAWRENCE S ET AL: "Natural language grammatical inference: a comparison of recurrent neural networks and machine learning methods", CONNECTIONIST, STATISTICAL AND SYMBOLIC APPROACHES TO LEARNING FOR NATURAL LANGUAGE PROCESSING, CONNECTIONIST, STATISTICAL AND SYMBOLIC APPROACHES TO LEARNING FOR NATURAL LANGUAGE PROCESSING, MONTREAL, QUE., CANADA, AUG. 1995, 1996, Berlin, Germany, Springer-Verlag, Germany, pages 33 - 47, XP002204936, ISBN: 3-540-60925-3 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2115735A1 (en) * 2007-02-27 2009-11-11 Nuance Communications, Inc. Presenting supplemental content for digital media using a multimodal application
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
US9733901B2 (en) 2011-07-26 2017-08-15 International Business Machines Corporation Domain specific language design
US10120654B2 (en) 2011-07-26 2018-11-06 International Business Machines Corporation Domain specific language design
CN105845134A (en) * 2016-06-14 2016-08-10 科大讯飞股份有限公司 Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
EP3502923A1 (en) * 2017-12-22 2019-06-26 SoundHound, Inc. Natural language grammars adapted for interactive experiences
US11900928B2 (en) 2017-12-23 2024-02-13 Soundhound Ai Ip, Llc System and method for adapted interactive experiences
CN112069797A (en) * 2020-09-03 2020-12-11 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics
CN112069797B (en) * 2020-09-03 2023-09-01 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics

Also Published As

Publication number Publication date
GB2391993B (en) 2005-04-06
GB2391993A (en) 2004-02-18
GB2375210A (en) 2002-11-06
GB0110532D0 (en) 2001-06-20
GB2375210B (en) 2005-03-23
GB0326763D0 (en) 2003-12-24

Similar Documents

Publication Publication Date Title
US7606714B2 (en) Natural language classification within an automated response system
EP1602102B1 (en) Management of conversations
AU2019376649B2 (en) Semantic artificial intelligence agent
US10623572B1 (en) Semantic CRM transcripts from mobile communications sessions
US20050033582A1 (en) Spoken language interface
US20040260543A1 (en) Pattern cross-matching
US20210350384A1 (en) Assistance for customer service agents
US20220101835A1 (en) Speech recognition transcriptions
WO2002089112A1 (en) Adaptive learning of language models for speech recognition
WO2002089113A1 (en) System for generating the grammar of a spoken dialogue system
CN114283810A (en) Improving speech recognition transcription
US11132695B2 (en) Semantic CRM mobile communications sessions

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

ENP Entry into the national phase

Ref document number: 0326763

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20020430

Ref document number: 0326763

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20020430

Format of ref document f/p: F

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP