WO2002089113A1 - System for generating the grammar of a spoken dialogue system - Google Patents
System for generating the grammar of a spoken dialogue system Download PDFInfo
- Publication number
- WO2002089113A1 WO2002089113A1 PCT/GB2002/001962 GB0201962W WO02089113A1 WO 2002089113 A1 WO2002089113 A1 WO 2002089113A1 GB 0201962 W GB0201962 W GB 0201962W WO 02089113 A1 WO02089113 A1 WO 02089113A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grammar
- sentences
- rules
- sentence
- language components
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/193—Formal grammars, e.g. finite state automata, context free grammars or word networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
Definitions
- This invention relates to spoken language interfaces which enable a user to " interact using voice with a computer system. It is more specifically concerned with the writing of grammars for use in such interfaces.
- a spoken language interface involves a two way dialogue between the user and the computer system.
- An automatic speech recognition system is used to comprehend the speech delivered by the user and an automatic speech generation system is used to generate speech to be played out to the user. This may be, for example, a mixture of speech synthesis and recorded voice.
- Spoken language interfaces are well known. One example is described in our earlier application GB 01 05 005.3 filed on 28 February 2001.
- Spoken language interfaces rely on grammars to interpret user's commands and formulate responses.
- the grammar defines the sequences of words which it is possible for the interface to recognise.
- a grammar is a set of rules which de ines a set of sentences . Typically the rules are normally expressed in some kind of algebraic notation. The set of grammar rules usually define a much larger set of sentences; few rules cover many sentences. Size constraints apply to these grammars. Automatic speech recognisers can only recognise with high accuracy for grammars of a limited size. This means that there is a very strong motivation to remove any useless expressions from a grammar.
- some sentences may be preferable over others, according to one or more criteria. For example, expressions which are grammatically correct
- grammar writing is a time consuming operation involving the expertise of grammar writers to encapsulate manually the preferred sentences within each grammar and 'requiring painstaking care. This makes it hard to bring new applications to market quickly and also very expensive.
- a grammar formulation mechanism for assisting a grammar writer when formulating a new grammar.
- the grammar formulation mechanism is operable to apply an adaptive learning algorithm to a predetermined set of language components to determine a set of grammar rules and to apply the set of grammar rules to the new grammar.
- the grammar formulation mechanism derives the grammar rules from the set of language components .
- Language components may represent any number of audible sounds, words, sentences, phrases, noises, utterances etc., and may be tagged so as to identify one or more of them during the adaptive learning process .
- the language components may be, for example, a scored set of sentences (e.g. composed of a sequence of words represented electronically) that give examples of sentences that are good and/or bad and/or intermediate (i.e. somewhere in between good and bad) which the grammar formulation mechanism learns and can subsequently apply to other language components and/or grammars either as they are generated or after their generation.
- the preference criteria i.e. good/bad/intermediate, are applied consistently to the set of language components during the adaptive learning process. In one example, the preference criteria are based on frequency of use of language components .
- the first aspect of the invention enables the grammar formulation mechanism to monitor new grammars as they are written, and propose alternative, usually better, grammars to a grammar writer.
- the grammar formulation mechanism may be configured automatically to re-write new grammars where better alternatives are found. In this way, the writing of new grammars is greatly facilitated, even if the grammar writer is not experienced in the art of writing new grammars .
- the grammar formulation mechanism may use an adaptive learning algorithm that is trained on a training set of language components obtained from the predetermined set of language components . By using a training set that may be varied, the grammar formulation mechanism can be trained to formulate grammar rules in dependence upon the application and/or system for which a new set of grammars is being formulated.
- grammar rules may be tested on a test set of language components obtained from the predetermined set of language components in order that the efficacy of the grammar rules can be tested.
- the adaptive learning algorithm may use an inductive classification scheme, such as, for example, that used by the ID3 algorithm, to classify the language components during a training phase.
- This has the advantage of providing a compact set of grammar rules .
- a compact set of grammar rules can lead to quicker processing in a spoken language interface.
- An information-based heuristic may also be used by the adaptive learning algorithm to select the language components for classification during the training phase.
- Use of an information-based heuristic such as, for example, one based on Shannon's theory, allows for improved selection of the language components for classification during the training phase, and this in turn can lead to a compact and efficient set of grammar rules .
- the predetermined set of language components may form part of at least one sentence pre-identified as being good and/or bad and/or intermediate .
- Such a predetermined set of language components may form a training set for an expert system written by a grammar expert .
- the language components may have associated identifiers that indicate the grammar expert's opinion of their applicability. Where the identifiers identify language components as being intermediate, they may be selected according to a hierarchical ranking, e.g. those closest to being "good” being selected in preference to any others .
- Grammar rules derived from intermediate language components may themselves be scored and retained or rejected based upon the scores. This feature allows a fixed sized set of grammar rules to be produced.
- the grammar formulation mechanism may be implemented using one or more of software, hardware and firmware. It may be provided as a software module that can act as a plug-in to other software to impart the functionality of the grammar formulation mechanism to that other software.
- the grammar formulation mechanism may be provided as a software module provided on a carrier medium to a user, or the grammar formulation mechanism may be supplied as a component of a larger software package, such as, for example, as part of a grammar coverage tool as herein described.
- a grammar coverage tool comprising a grammar formulation mechanism as herein described.
- the grammar coverage tool may be a software tool that provides an approval mechanism operable to propose one or more grammar rule to a grammar writer during the formulation of the new grammar. This allows the grammar coverage tool to suggest grammars to the grammar writer as new grammars are being written, providing the grammar writer with an interactive tool that aids him/her in the task of writing a new grammar. Further, the grammar coverage tool may automatically modify any new grammar (s) as they are being written, if, for example, a poor grammar rule written by the grammar writer is detected.
- Examples of carrier media on which a grammar coverage tool and/or a grammar formulation mechanism may be supplied includes at least one of the following set of media: a radio-frequency signal, an optical signal, an electronic signal, a magnetic disc or tape, solid-state memory, an optical disc, a magneto-optical disc, a compact disc and a digital versatile disc.
- a computer system configured to provide a grammar formulation mechanism and/or grammar coverage tool according to any of the aspects of the invention herein described.
- a method for assisting a grammar writer when formulating a new grammar comprising applying an adaptive learning algorithm to a predetermined set of language components to determine a set grammar rules, said predetermined set of language components comprising at least one sentence composed of at least one further language component, at least one said sentence being pre- identified as being good, bad and/or intermediate, and applying the set of grammar rules to the new grammar to determine whether the new grammar is good or bad.
- the sentence (s) that form the predetermined set of language components may comprise a logical group of one or more of the following: audible sounds, words, conventional sentences in any language, phrases, noises, utterances etc .
- the method may additionally comprise method steps corresponding to any component of any of the other aspects of the invention.
- the method may comprise one or more of: a) training the adaptive learning algorithm on a training set of language components obtained from the predetermined set of language components; b) testing the grammar rules on a test set of language components obtained from the predetermined set of language components; c) classifying the language components during a training phase by applying an inductive classification scheme; and d) selecting the language components for classification during a training phase according to an information-based heuristic measurement .
- a grammar coverage tool embodying the invention has the advantage of enabling automatic elimination of unfavoured sentences, on the basis of one or more preference criteria.
- Unfavoured sentences include those generated by incorrectly written grammar rules, those which are syntactically undesirable, and those which fail to match the preference criteria.
- a grammar writer may write a rule that is incorrectly structured.
- a Grammar Coverage Tool embodying the invention is an automatic debugging tool that saves a significant amount of time in the grammar writing process.
- the Grammar Coverage Tool embodying the invention is a design enhancement tool that has the advantage of further reducing the time it takes the grammar writer to design a well-written grammar.
- Figure 1 is a schematic overview of a Spoken Language Interface
- Figure 2 is a logical model of the interface archi ecture
- FIG. 3 is a more detailed view of the interface architecture
- Figure 4 is a flow chart illustrating steps in a process embodying the present invention
- Figure 5 is a table showing a matrix of vectors representing tag sequences
- Figure 6 shows a decision tree generated from the matrix of Figure 5;
- Figure 7 shows a vector matrix for a second example;
- Figure 8 shows a computer system that can be used to implement aspects of the invention.
- the system schematically outlined in Figure 1 is a spoken language interface intended for communication with applications via mobile, satellite, or landline telephone.
- communication is via a mobile telephone 18 but any other voice telecommunications device such as a conventional telephone can be utilised.
- Calls to the system are handled by a telephony unit 20.
- a Voice Controller 19 Connected to the telephony unit are a Voice Controller 19, an Automatic Speech Recognition System (ASR) 22 and a automatic speech generation (ASG) system 26.
- the ASR 22 and ASG systems are each connected to the voice controller 19.
- a dialogue manager 24 is connected to the voice controller 19 and also to a spoken language interface (SLI) repository 30, a personalisation and adaptive learning unit 32 which is also attached to the SLI repository 30, and a session and notification manager 28.
- SLI spoken language interface
- the Dialogue Manager is also connected to a plurality of Application Managers AM, 34 each of which is connected to an application which may be content provision external to the system.
- the content layer includes e-mail, news, travel, information, diary, banking etc.
- the nature of the content provided is not important to the principles of the invention.
- the SLI repository is also connected to a development suite 35.
- FIG. 2 provides a more detailed overview of the architecture of the system.
- the automatic speech generation unit 26 of figure 1 includes a basic text-to-speech (TTS) unit, a batch TTS unit 120, connected to a prompt cache 124 and an audio player 122.
- TTS text-to-speech
- pre-recorded speech may be played to the user under the control of the voice control 19. It the embodiment illustrated a mixture of pre-recorded voice and TTS is used.
- the system then comprises three levels: session level 120, application level 122 and non-application level 124.
- the session level comprises a location manager 126 and a dialogue manager 128.
- the session level also includes an interactive device control 130 and a session manager 132 which includes the functions of user identification and Help Desk.
- the application layer comprises the application framework 134 under which an application manager controls an application. Many application managers and applications will be provided, such as UMS (Unified Messaging Service) , Call connect & conferencing, e- Commerce, Dictation etc.
- the non-application level 124 comprises a back office subsystem 140 which includes functions such as reporting, billing, account management, system administration, "push" advertising and current user profile.
- a transaction subsystem 142 includes a transaction log together with a transaction monitor and message broker.
- an activity log 144 and a user profile repository 146 communicate with an adaptive learning unit 148.
- the adaptive learning unit also communicates with the dialogue manager 128.
- a personalisation module 150 also communicates with the user profiles repository 146 and the dialogue manager 128.
- the system allows the system to be independent of the ASR 22 and TTS 26 by providing an interface to either proprietary or non-proprietary speech recognition, text to speech and telephony components .
- the TTS may be replaced by, or supplemented by, recorded voice.
- the voice control also provides for logging and assessing call quality. The voice control will optimise the performance of the ASR.
- the voice engine is effectively dumb as all control comes from the dialogue manager via the voice control .
- the dialogue manager controls the dialogue across multiple voice servers and other interactive servers (eg WAP, Web etc) .
- As well as controlling dialogue flow it controls the steps required for a user to complete a task through mixed initiative - by permitting the user to change initiative with respect to specifying a data element (e.g. destination city for travel) .
- the Dialog Manager may support comprehensive mixed initiative, allowing the user to change topic of conversation, across multiple applications while maintaining state representations where the user left off in the many domain specific conversations. Currently, as initiative is changed across two applications, state of conversation is maintained. Within the system, the dialogue manager controls the workflow.
- the method by which the adaptive learning agent was conceived is to collect user speaking data from call data records. This data, collected from a large domain of calls (thousands) provides the general profile of language usage across the population of speakers. This profile, or mean language model forms a basis for the first step in adjusting the language model probabilities to improve ASR accuracy.
- the individual user's profile is generated and adaptively tuned across the user's subsequent calls.
- the dialog manager includes a personalisation engine. Given the user demographics (age, sex, dialect) a specific personality tuned to user characteristics for that user's demographic group is invoked.
- the dialog manager also allows dialogue structures and applications to be updated or added without shutting the system down. It enables users to move easily between contexts, for example from flight booking to calendar etc, hang up and resume conversation at any point; specify information either step-by-step or in one complex sentence, cut-in and direct the conversation or pause the conversation temporarily.
- the telephony component includes the physical telephony interface and the software API that controls it .
- the physical interface controls inbound and outbound calls, handles conferencing, and other telephony related functionality.
- the Session Manager initiates and maintains user and application sessions. These are persistent in the event of a voluntary or involuntary disconnection. They can re- instate the call at the position it had reached in the system at any time within a given period, for example 24 hours.
- a major problem in achieving this level of session storage and retrieval relates to retrieving a session in which a conversation is stored with either a dialogue structure, workflow structure or an application manager has been upgraded. In the preferred embodiment this problem is overcome through versioning of dialogue structures, workflow structures and application managers. The system maintains a count of active sessions for each version and only retires old versions once the versions count reaches zero.
- An alternative which may be .implemented, requires new versions of dialogue structures, workflow structures and application managers to supply upgrade agents . These agents are invoked whenever by the session manager whenever it encounters old versions in the stored session. A log is kept by the system of the most recent version number. It may be beneficial to implement a combination of these solutions the former for dialogue structures and workflow structures and the latter for application managers.
- the notification manager brings events to a user's attention, such as the movement of a share price by a predefined margin. This can be accomplished while the users are offline through interaction with the dialogue manager or offline. Offline notification is achieved either by the system calling the user and initiating an online session or through other media channels, for example, SMS, Pager, fax, email or other device.
- AM Application Managers
- SLI Session Initiation Agents
- Each application manager (there is one for every content supplier) exposes a set of functions to the dialogue manager to allow business transactions to be realised (e.g. GetEmailO, SendEmailO, BookFlight () , GetNewsItemO , etc) .
- Functions require the DM to pass the complete set of parameters required to complete the transaction.
- the AM returns the successful result or an error code to be handled in a predetermined fashion by the DM.
- An AM is also responsible for handling some stateful information. For example, User A has been passed the first 5 unread emails. Additionally, it stores information relevant to a current user task. For example, flight booking details. It is able to facilitate user access to secure systems, such as banking, email or other. It can also deal with offline events, such as email arriving while a user is offline or notification from a flight reservation system that a booking has been confirmed. In these instances the AM's role is to pass the information to the Notification Manager.
- An AM also exposes functions to other devices or channels, such as web, WAP, etc. This facilitates the multi channel conversation discussed earlier.
- AMs are able to communicate with each other to facilitate aggregation of tasks. For example, booking a flight primarily would involve a flight booking AM, but this would directly utilise a Calendar AM in order to enter flight times into a user's Calendar.
- AMs are discrete components built, for example, as enterprise Java Beans (EJBs) they can be added or updated while the system is live.
- EJBs enterprise Java Beans
- the Transaction and Message Broker records every logical transaction, identifies revenue-generating transactions, routes messages and facilitates system recovery.
- Spoken conversational language reflects quite a bit of a user's psychology, socio-economic background, and dialect and speech style. The reason an SLI is a challenge, which is met by embodiments of the invention, is due to these confounding factors .
- Embodiments of the invention provide a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features .
- a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any recogniser.
- User profiling solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
- the adaptive learning technique is a stochastic
- a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, could be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him.
- a more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed.
- the general data of the masses is used initially to set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
- the approach then, embodies statistical modelling across an entire population of users . The stochastic nature of the approach occurs, when new observations are made across the average mass, and language modelling weights are adaptively assigned to tune the recogniser. Help Assistant & Interactive Training
- the Help Assistant & Interactive Training component allows users to receive real-time interactive assistance and training.
- the component provides for simultaneous, multi channel conversation (i.e. the user can talk through a voice interface and at the same time see visual representation of their interaction through another device, such as the web) .
- the system uses a commercially available database such as Oracle 81 from Oracle Corp.
- the Central Directory stores information on users, available applications, available devices, locations of servers and other directory type information.
- the System Administration - Applications provides centralised, web-based functionality to administer the custom build components of the system (e.g. Application Managers, Content Negotiators, etc.).
- This provides an environment for building spoken language systems incorporating dialogue and prompt design, workflow and business process design, version control and system testing. It is also used to manage deployment of system updates and versioning.
- the development suite Rather than having to labouriously code likely occurring user responses in a cumbersome grammar (e.g. BNF grammar - Bachus Nauer Format) resulting in time consuming detailed syntactic specification, the development suite provides an intuitive hierarchical, graphical display of language, reducing the modelling act to creatively uncover the precise utterance but the coding act to a simple entry of a data string.
- the development suite enables a Rapid Application Development (RAD) tool that combines language modelling with business process design (workflow) .
- RAD Rapid Application Development
- the grammar coverage tool is embodied in the development suite. After a rule is entered into the RAD system, the GCT is invoked. It tests the efficiency of the rule (i.e. that it does not generate any garbage utterances which a user would not say; due to incorrect syntax) and also evaluates if the rule generates the necessary linguistic coverage adopted by the house-style or conventional wisdom about the way in which users phrase responses .
- Dialogue Subsystem manages, controls and provides the interface for human dialogue via speech and sound. Referring to Figure 1, it includes the dialogue manager, spoken language interface repository, session and notification managers, the voice controller 19, the Automatic Speech Recognition Unit 22, the Automatic Speech Generation unit 26 and telephony components 20. The subsystem is illustrated in more detailed architecture of the interface shown in Figure 3.
- SLI Spoken Language Interface
- a SLI refers to the hardware, software and data components that allow users to interact with a computer through spoken language.
- the term "interface” is particularly apt in the context of voice interaction, since the SLI acts as a conversational mediator, allowing information to be exchanged between user and system via speech. In its idealised form, this interface would be "invisible” and the interaction would, from the user's standpoint, appear as seamless and natural as a conversation with another person.
- one principle aim of most SLI projects is to create a system that is as near as possible to a human-human conversation. If the exchange between user and machine is construed as a dialogue, the objective for the SLI development team is to create the ears, mind and voice of the machine.
- the ears of the system are created by the Automatic Speech Recognition (ASR) System 22.
- the voice is created via the Automatic Speech Generation (ASG) software 26, and the mind is made up of the computational power of the hardware and the databases of information contained in the system.
- the present system uses software developed by other companies for its ASR and ASG. Suitable systems are available from Nuance and Lernout & Hauspie respectively. These systems will not be described further. However, it should be noted that the system allows great flexibility in the selection of these components from different vendors.
- the basic Text To Speech unit supplied, for example, by Lernout & Hauspie may be supplemented by an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements) , streaming of audio data from other source (e.g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
- a voice controller 19 and the dialogue manager 24 control and manage the dialogue between the system and the end user.
- the dialogue is dynamically generated at run time from a SLI repository which is managed by a separate component, the development suite.
- the ASR unit 22 comprises a plurality of ASR servers.
- the ASG unit 26 comprises a plurality of speech servers. Both are managed and controlled by the voice controller.
- the telephony unit 20 comprises a number of telephony board servers and communicates with the voice controller, the ASR servers and the ASG servers.
- Calls from users, shown as mobile phone 18 are handled initially by the telephony server 20 which makes contact with a free voice controller.
- the voice controller contacts and locates an available ASR resource.
- the voice controller 19 which identifies the relevant ASR and ASG ports to the telephony server.
- the telephony server can now stream voice data from the user to the ASR server and the ASG stream audio to the telephony server.
- the voice controller having established contacts with the ASR and ASG servers now informs the Dialogue Manager which requests a session on behalf of a user in the session manager. As a security precaution, the user is required to provide authentication information before this step can take place. This request is made to the session manager 28 which is represented logically at 132 in the session layer in Figure 2.
- the session manager server 28 checks with a dropped session store (not shown) whether the user has a recently dropped session.
- a dropped session could be caused by, for example, a user on a mobile entering a tunnel. This facility enables the user to be reconnected to a session without having to start over again.
- the dialogue manager 24 communicates with the application managers 34 which in turn communicate with the internal/external services or applications to which the user has access.
- the application managers each communicate with a business transaction log 50, which records transactions and with the notification manager 28b. Communications from the application manager to the notification manager are asynchronous and communications from the notification manager to the application managers are synchronous.
- the notification manager also sends communications asynchronously to the dialogue manager 24.
- the dialogue manager 24 has a synchronous link with the session manager 28a, which has a synchronous link with the notification manager.
- the dialogue manager 24 communicates with the adaptive learning unit 33 via an event log 52 which records user activity so that the system can learn from the user's interaction. This log also provides a series of debugging and reporting information.
- the adaptive learning unit is connected to the personalisation module 34 which is in turn connected to the dialogue manager.
- Workflow 56, Dialogue 58 and Personalisation repositories 60 are also connected to the dialogue manager 24 through the personalisation module 554 so that a personalised view is always handled by the dialogue manager 24. These three repositories make up the SLI Repository referred to earlier.
- the personalisation can also write to the personalisation repository 60.
- the Development Suite 35 is connected to the workflow and dialogue repositories 56, 58 and implements functional specifications of applications storing the relevant grammars, dialogues, workflow and application manager function references for each the application in the repositories. It also facilitates the design and implementation of system, help, navigation and misrecognition grammars, dialogues, workflow and action references in the same repositories.
- the dialogue manager 24 provides the following key areas of functionality: the dynamic management of task oriented conversation and dialogue; the management of synchronous conversations across multiple formats; and the management of resources within the dialogue subsystem. Each of these will now be considered in turn.
- the conversation a user has with a system is determined by a set of dialogue and workflow structures, typically one set for each application.
- the structures store the speech to which the user listens, the keywords for which the ASR listens and the steps required to complete a task
- the DM determines its next contribution to the conversation or action to be carried out by the AMs .
- the system allows the user to move between applications or context using either hotword or natural language navigation.
- the complex issues relating to managing state as the user moves from one application to the next or even between multiple instances of the same application is handled by the DM.
- This state management allows users to leave an application and return to it at the same point as when they left .
- This functionality is extended by another component, the session manager, to allow users to leave the system entirely and return to the same point in an application when they log back in.
- the dialogue manager communicates via the voice controller with both the speech engine (ASG) 26 and the voice recognition engine (ASR) 22.
- the output from the speech generator 26 is voice data from the dialogue structures, which is played back to the user either as dynamic text to speech, as a pre-recorded voice or other stored audio format.
- the ASR listens for keywords or phrases that the user might say.
- the dialogue structures are predetermined
- Predetermined dialogue structures or grammars are statically generated when the system is inactive. This is acceptable in prior art systems as scripts tended to be simple and did not change often once a system was activated. However, in the present system, the dialogue structures can be complex and may be modified frequently when the system is activated. To cope with this, the dialogue structure is stored as data in a run time repository, together with the mappings between recognised conversation points and application functionality. The repository is dynamically accessed and modified by multiple sources even when active users are on-line.
- the dialogue subsystem comprises a plurality of voice controllers 19 and dialogue managers 24 (shown as a single server in Figure 3) .
- the ability to update the dialogue and workflow structures dynamically greatly increases the flexibility of the system. In particular, it allows updates of the voice interface and applications without taking the system down; and provides for adaptive learning functionality which enriches the voice experience to the user as the system becomes more responsive and friendly to a user's particular syntax and phraseology with time. Considering each of these two aspects in more detail :
- Spoken conversational language reflects quite a bit of a user's psychology, so ⁇ iol-economic background, dialect and speech style.
- One reason an SLI is a challenge is due to these confounding factors.
- the solution this system provides to this challenge is a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features - Adaptive Learning. Without discussing in detail the complexity of encoding this knowledge, suffice it to say that a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any ASR.
- User profiling solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
- the adaptive learning technique is a stochastic process which first models which types, dialects and styles the entire user base of users employ.
- a profile is created by counting the language mostly utilised across the population and profiles less likely occurrences. Indeed, the less likely occurring utterances, or those that do not get used at all, can be deleted to improve accuracy. But then, a new user who might employ the deleted phrase, not yet observed, could come along and he would have a dissatisfying experience and a system tuned for the average user would not work well for him.
- a more powerful technique is to profile individual user preferences early on in the transaction, and simply amplify those sets of utterances over those utterances less likely to be employed.
- the general data of the masses is used to initially set a set of tuning parameters and during a new phone call, individual stylistic cues are monitored, such as phrase usage and the model is immediately adapted to suit that caller. It is true, those that use the least likely utterances across the mass, may initially be asked to repeat what they have said, after which the cue re-assigns the probabilities for the entire vocabulary.
- the grammar coverage tool will now be described. It is first useful to define terms used in the generation of grammars :
- Tag a label applied to a class of words which play a similar role in terms of syntax.
- Word class a wordclass is a set of words which can play a common syntactic role.
- Grammar a set of algebraic rules which define a set of sentences .
- the rules can be expressed in terms of the tags which are used to label word classes .
- Lexicon a list of words where each is followed by its tag. In certain cases, a word may have more than one tag in the lexicon, which can make the job of reliably applying tags ambiguous .
- Coverage the set of sentences defined by a set of grammar rules .
- the set thus ' covers ' the expressions a user may have recognised by the system.
- Decision tree a notation for representing logical rules as a hierarchical structure of nodes and branches .
- Training set a set of classified items to be presented to a learning process.
- Test set a set of classified items used to evaluate the performance of a trained process .
- Every sentence in every language has a structure . This structure can be described in more or less detail.
- the coverage tool to be described applies to a class of sentences which can be described as ordered sequences of tags.
- a tag is simply a label which refers to a class of components.
- a component is generally an individual word. For example, consider the following simple example which relates to an application for purchasing airline tickets:
- sentences 6-8 may be quite natural; for example, if the grammar rule is supposed to cover what someone might say at an airline reservation desk, after having been asked, "Which of your details do you want to change?" Sentences 3 to 5 are acceptable as natural in most, if not all circumstances. Sentences 1 & 2 are 'nonsense' or ' garbage ' utterances . We explain how the tool can propose new grammar rules which will exclude such expressions .
- Preference Criteria The collection of sentences listed above comes from expanding the grammar rule. As discussed above, some sentences may be preferred to others, based on one or more preference criteria. It does not matter what the content of such a criterion is, provided that it can be applied consistently. Clearly there is no argument about a frequency of usage criterion. No sentence can be both frequently occurring and infrequently occurring. More subjective measures, such as 'intuitive acceptability' may be less consistent. Significant inconsistency in the training data will be disastrous.
- Possible preference criteria may include:
- Defining the preference criterion or criteria to classify each member of a set of sentences is the first step towards improving the coverage defined by the grammar .
- tags may be applied in an adaptive way to deal with various topic specific grammars like news and/or financially-orientated grammars.
- a single tag may be applied to a group of words as a unit. For example, clitics such as "I'm” and “gonna” as well as fully formed groups such as "I am” and "going to” may be tagged using respective tags.
- tagged parts of speech need not be purely linguistic elements and could in addition, or alternatively, be tagged according to semantic content.
- the set should reflect as many distinct kinds of sentence which are preferable, as possible. This part of the process may be performed manually.
- 3. Use a computer to run an algorithm which can induce a decision tree to classify the sentences of the training set into positive and negative examples, on the basis of the sequence of tags which describes each sentence (160) .
- Standard algorithms such as ID3 algorithm, described below, or its variant C4.5, which is robust with respect to incomplete data, can be used.
- the ID3 algorithm is in J.R.
- a set of rules constituting a new grammar can be easily obtained from the decision tree . These new rules can be offered to a (human) grammar writer, or they may even be deployed directly.
- the new rules define a grammar of higher quality, because they include fewer garbage utterances. Inducing a Decision Tree
- step 3 the algorithm begins with a training set of good and bad examples of sentences which have previously been given appropriate tag-sequence descriptions.
- ID3 The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes CI, C2, .., Cn, the categorical attribute C, and a training set T of records .
- function ID3 R: a set of non-categorical attributes
- S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified] ;
- D be the attribute with largest Gain(D,S) among attributes in R;
- a 1 indicates that the tag is present in that example, a 0 that it is not.
- the classification tree is then made as follows: 1) If, at a given node, all the examples are the same class (all positive or all negative) then label the node as a leaf and terminate. 2) If there are both positive and negative examples, then select the most informative feature (in our case, one of the tags) and partition into branches according to the value of that tag (i.e. present, or null) . 3) Use Shannon's information heuristic to identify the probable most informative feature: this is given by: ⁇ p log p for each feature, where p is probability of a particular value occurring in the feature slot across all the examples at this node. 4) Keep iterating the tree from step 1 until all bottom level nodes are leaf nodes . The decision tree is shown in Figure 6.
- Each path through the tree corresponds to a rule.
- the ID3 algorithm produces the following set of rules :
- the matrix for this example is shown in Figure 7. So, in the example Figure 7, the induced rules give 100% correct classification on the test set (i.e. 0% error rate) .
- the induced rules identify which sentences are good, and which bad, in terms of the preference criterion.
- the rules express structural patterns which determine whether a sentence is a good example or not.
- Figure 8 shows a computer system 800 that may be used to implement embodiments of the invention.
- the computer system 800 may be used to implement a grammar formulation mechanism and/or a grammar coverage tool .
- the computer system 800 may be used to provide at least one component of a spoken language interface.
- the computer system 800 may be used by a grammar writer to formulate grammars separately from any part of a spoken language interface .
- the computer system 800 comprises various data processing resources such as a processor (CPU) 830 coupled to a bus structure 838. Also connected to the bus structure 838 are further data processing resources such as read only memory 832 and random access memory 834.
- a display adapter 836 connects a display device 818 having screen 820 to the bus structure 838.
- One or more user- input device adapters 840 connect the user-input devices, including the keyboard 822 and mouse 824 to the bus structure 838.
- An adapter 841 for the connection of a printer 821 may also be provided.
- One or more media drive adapters 842 can be provided for connecting the media drives, for example the optical disk drive 814, the floppy disk drive 816 and hard disk drive 819, to the bus structure 838.
- One or more telecommunications adapters can be provided for connecting the media drives, for example the optical disk drive 814, the floppy disk drive 816 and hard disk drive 819, to the bus structure 838.
- processing resource interface means for connecting the computer system to one or more networks or to other computer systems or devices .
- the communications adapters 844 could include a local area network adapter, a modem and/or ISDN terminal adapter, or serial or parallel port adapter etc, as required.
- the processor 830 will execute computer program instructions that may be stored in one ox more of the read only memory 832, random access memory 834 the hard disk drive 819, a floppy disk in the floppy disk drive 816 and an optical disc, for example a compact disc (CD) or digital versatile disc (DVD) , in the optical disc drive or dynamically loaded via adapter 844.
- the results of the processing performed may be displayed to a user via the display adapter 836 and display device 818.
- User inputs for controlling the operation of the computer system 800 may be received via the user-input device adapters 840 from the user-input devices.
- a computer program for implementing various functions or conveying various information can be written in a variety of different computer languages and can be supplied on carrier media.
- a program or program element may be supplied on one or more CDs, DVDs and/or floppy disks and then stored on a hard disk, for example.
- a program may also be embodied as an electronic signal supplied on a telecommunications medium, for example over a telecommunications network.
- a software-controlled programmable processing device such as a Digital Signal Processor, microprocessor, other processing devices, data processing apparatus or computer system
- a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention.
- the computer program may be embodied as source code and undergo compilation for implementation on a processing device, apparatus or system, or may be embodied as object code, for example.
- object code for example.
- the term computer system in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and firmware embodied equivalents, whether part of a distributed computer system or not .
- Software components may be implemented as plug-ins, modules and/or objects, for example, and may be provided as a computer program product stored on a carrier medium in machine or device readable form.
- a computer program may be stored, for example, in solid-state memory, magnetic memory such as disc or tape, optically or magneto-optically readable memory, such as compact disc read-only or read-write memory (CD-ROM, CD-RW) , digital versatile disc (DVD) etc., and the processing device utilises the program or a part thereof to configure it for operation.
- the computer program product may be supplied from a remote source embodied on a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave.
- a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave.
- carrier media are also envisaged as aspects of the present invention.
- any communication link between a user and a mechanism, interface, tool and/or system according to aspects of the invention may be implemented using any available mechanisms, including mechanisms using of one or more of: wired, WWW, LAN, Internet, WAN, wireless, optical, satellite, TV, cable, microwave, telephone, cellular etc.
- the communication link may also be a secure link.
- the communication link can be a secure link created over the Internet using Public Cryptographic key Encryption techniques or as an SSL link.
- Embodiments of the invention may also employ voice recognition techniques for identifying a user.
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0326763A GB2391993B (en) | 2001-04-30 | 2002-04-30 | System for generating the grammar of a spoken dialogue system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0110532A GB2375210B (en) | 2001-04-30 | 2001-04-30 | Grammar coverage tool for spoken language interface |
GB0110532.9 | 2001-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002089113A1 true WO2002089113A1 (en) | 2002-11-07 |
Family
ID=9913726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2002/001962 WO2002089113A1 (en) | 2001-04-30 | 2002-04-30 | System for generating the grammar of a spoken dialogue system |
Country Status (2)
Country | Link |
---|---|
GB (2) | GB2375210B (en) |
WO (1) | WO2002089113A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2115735A1 (en) * | 2007-02-27 | 2009-11-11 | Nuance Communications, Inc. | Presenting supplemental content for digital media using a multimodal application |
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN105845134A (en) * | 2016-06-14 | 2016-08-10 | 科大讯飞股份有限公司 | Spoken language evaluation method through freely read topics and spoken language evaluation system thereof |
US9733901B2 (en) | 2011-07-26 | 2017-08-15 | International Business Machines Corporation | Domain specific language design |
EP3502923A1 (en) * | 2017-12-22 | 2019-06-26 | SoundHound, Inc. | Natural language grammars adapted for interactive experiences |
CN112069797A (en) * | 2020-09-03 | 2020-12-11 | 阳光保险集团股份有限公司 | Voice quality inspection method and device based on semantics |
US11900928B2 (en) | 2017-12-23 | 2024-02-13 | Soundhound Ai Ip, Llc | System and method for adapted interactive experiences |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111435408B (en) * | 2018-12-26 | 2023-04-18 | 阿里巴巴集团控股有限公司 | Dialog error correction method and device and electronic equipment |
CN112992128B (en) * | 2021-02-04 | 2023-06-06 | 北京淇瑀信息科技有限公司 | Training method, device and system of intelligent voice robot |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0813185A2 (en) * | 1996-06-14 | 1997-12-17 | Lucent Technologies Inc. | Compilation of weighted finite-state transducers from decision trees |
US5937385A (en) * | 1997-10-20 | 1999-08-10 | International Business Machines Corporation | Method and apparatus for creating speech recognition grammars constrained by counter examples |
WO2000078022A1 (en) * | 1999-06-11 | 2000-12-21 | Telstra New Wave Pty Ltd | A method of developing an interactive system |
-
2001
- 2001-04-30 GB GB0110532A patent/GB2375210B/en not_active Expired - Fee Related
-
2002
- 2002-04-30 GB GB0326763A patent/GB2391993B/en not_active Expired - Lifetime
- 2002-04-30 WO PCT/GB2002/001962 patent/WO2002089113A1/en not_active Application Discontinuation
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0813185A2 (en) * | 1996-06-14 | 1997-12-17 | Lucent Technologies Inc. | Compilation of weighted finite-state transducers from decision trees |
US5937385A (en) * | 1997-10-20 | 1999-08-10 | International Business Machines Corporation | Method and apparatus for creating speech recognition grammars constrained by counter examples |
WO2000078022A1 (en) * | 1999-06-11 | 2000-12-21 | Telstra New Wave Pty Ltd | A method of developing an interactive system |
Non-Patent Citations (1)
Title |
---|
LAWRENCE S ET AL: "Natural language grammatical inference: a comparison of recurrent neural networks and machine learning methods", CONNECTIONIST, STATISTICAL AND SYMBOLIC APPROACHES TO LEARNING FOR NATURAL LANGUAGE PROCESSING, CONNECTIONIST, STATISTICAL AND SYMBOLIC APPROACHES TO LEARNING FOR NATURAL LANGUAGE PROCESSING, MONTREAL, QUE., CANADA, AUG. 1995, 1996, Berlin, Germany, Springer-Verlag, Germany, pages 33 - 47, XP002204936, ISBN: 3-540-60925-3 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2115735A1 (en) * | 2007-02-27 | 2009-11-11 | Nuance Communications, Inc. | Presenting supplemental content for digital media using a multimodal application |
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
US9733901B2 (en) | 2011-07-26 | 2017-08-15 | International Business Machines Corporation | Domain specific language design |
US10120654B2 (en) | 2011-07-26 | 2018-11-06 | International Business Machines Corporation | Domain specific language design |
CN105845134A (en) * | 2016-06-14 | 2016-08-10 | 科大讯飞股份有限公司 | Spoken language evaluation method through freely read topics and spoken language evaluation system thereof |
EP3502923A1 (en) * | 2017-12-22 | 2019-06-26 | SoundHound, Inc. | Natural language grammars adapted for interactive experiences |
US11900928B2 (en) | 2017-12-23 | 2024-02-13 | Soundhound Ai Ip, Llc | System and method for adapted interactive experiences |
CN112069797A (en) * | 2020-09-03 | 2020-12-11 | 阳光保险集团股份有限公司 | Voice quality inspection method and device based on semantics |
CN112069797B (en) * | 2020-09-03 | 2023-09-01 | 阳光保险集团股份有限公司 | Voice quality inspection method and device based on semantics |
Also Published As
Publication number | Publication date |
---|---|
GB2391993B (en) | 2005-04-06 |
GB2391993A (en) | 2004-02-18 |
GB2375210A (en) | 2002-11-06 |
GB0110532D0 (en) | 2001-06-20 |
GB2375210B (en) | 2005-03-23 |
GB0326763D0 (en) | 2003-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7606714B2 (en) | Natural language classification within an automated response system | |
EP1602102B1 (en) | Management of conversations | |
AU2019376649B2 (en) | Semantic artificial intelligence agent | |
US10623572B1 (en) | Semantic CRM transcripts from mobile communications sessions | |
US20050033582A1 (en) | Spoken language interface | |
US20040260543A1 (en) | Pattern cross-matching | |
US20210350384A1 (en) | Assistance for customer service agents | |
US20220101835A1 (en) | Speech recognition transcriptions | |
WO2002089112A1 (en) | Adaptive learning of language models for speech recognition | |
WO2002089113A1 (en) | System for generating the grammar of a spoken dialogue system | |
CN114283810A (en) | Improving speech recognition transcription | |
US11132695B2 (en) | Semantic CRM mobile communications sessions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
ENP | Entry into the national phase |
Ref document number: 0326763 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20020430 Ref document number: 0326763 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20020430 Format of ref document f/p: F |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |