ON-LINE ENVIRONMENTAL AND SPEAKER MODEL ADAPTATION
The present invention relates to an on-line environmental and speaker model adaptation arrangement particularly suited for use with an automated speech recognition system.
BACKGROUND
Automated speech recognition is a complex task in itself. Automated speech understanding sufficient to provide automated dialogue with a user adds a further layer of complexity.
In this specification the term "automated speech recognition system" will refer to automated or substantially automated systems which perform automated speech recognition and also attempt automated speech understanding, at least to predetermined levels sufficient to provide a capability for at least limited automated dialogue with a user. A generalized diagram of a commercial grade automated speech recognition system as can be used for example in call centres and the like is illustrated in Fig. 1.
With advances in digital computers and a significant lowering in cost per unit of computing capacity there have been a number of attempts in the commercial marketplace to install such automated speech recognition systems implemented substantially by means of digital computers.
However, to date, there remain problems in achieving 100% recognition and/or 100% understanding in real time.
A particular problem in hosted speech recognition solutions is that where multiple calls are being handled simultaneously it is not possible to adapt a single model set concurrently. One solution is to periodically retrain the models to allow them to better represent new data however this has not been overly successful.
It is an object of the present invention to address or ameliorate one or more of the abovementioned disadvantages.
BRIEF DESCRIPTION OF INVENTION
Accordingly, in one broad form of the invention there is provided in a speech recognition system of the type adapted to process utterances from a caller or user by way of a recogniser, an utterance processing system and a dialogue processing system so as to produce responses to said utterances, an on-line environmental and speaker model adaptation arrangement wherein a plurality of models operating on a plurality of respective conversations each adapt differently to each conversation over a predetermined period of time subsequent to which the adapted models are tested and the best model according to a predetermined criterion is selected as the initial model to be applied commencing at a second predetermined period of time.
Preferably adaptation is performed by a Maximum Likelihood Linear Regression (MLLR) process.
BRIEF DESCRIPTION OF DRAWINGS
Embodiments of the present invention will now be described with reference to the accompanying drawings wherein:
Fig. 1 is a generalized block diagram of a prior art automated speech recognition system;
Fig. 2 is a generalized block diagram of an automated speech recognition system suited for use in conjunction with an embodiment of the present invention;
Fig. 3 is a more detailed block diagram of the utterance processing and dialogue processing portions of the system of Fig. 2;
Fig. 4 is a block diagram of the recogniser portion of the system of Fig. 2 incorporating an arrangement in accordance with a first preferred embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
With reference to Fig. 2 there is illustrated a generalized block diagram of an automated speech recognition system 10 adapted to receive human speech derived from user 11, and to process that speech with a view to recognizing and understanding the speech to a
sufficient level of accuracy that a response 12 can be returned to user 11 by system 10. In the context of systems to which embodiments of the present invention are applicable the response 12 can take the form of an auditory communication, a written or visual communication or any other form of communication intelligible to user 11 or a combination thereof.
In all cases input from user 11 is in the form of a plurality of utterances 13 which are received by transducer 14 (for example a microphone) and converted into an electronic representation 15 of the utterances 13. In one exemplary form the electronic representation 15 comprises a digital representation of the utterances 13 in .WAV format. Each electronic representation 15 represents an entire utterance 13. The electronic representations 1'5 are processed through front end processor 16 to produce a stream of vectors 17, one vector for example for each 10ms segment of utterance 13. The vectors 17 are matched against knowledge base vectors 18 derived from knowledge base 19 by back end processor 20 so as to produce ranked results 1-N in the form of N best results 21. The results can comprise for example subwords, words or phrases but will depend on the application. N can vary from 1 to very high values, again depending on the application. An utterance processing system 26 receives the N best results 21 and begins the task of assembling the results
into a meaning representation 25 for example based on the data contained in language knowledge database 31.
The utterance processing system 26 orders the resulting tokens or words 23 contained in N-best results 21 into a meaning representation 25 of token or word candidates which are passed to the dialogue processing system 27 where sufficient understanding is attained so as to permit functional utilization of speech input 15 from user 11 for the task to be performed by the automated speech recognition system 10. In this case the functionality includes attaining of sufficient understanding to permit at least a limited dialogue to be entered into with user/caller 11 by means of response 12 in the form of prompts so as to elicit further speech input from the user 11. In the alternative or in addition, the functionality for example can include a sufficient understanding to permit interaction with extended databases for data identification.
Fig. 3 illustrates further detail of the system of Fig. 2 including listing of further functional components which make up the utterance processing system 26 and the dialogue processing system 27 and their interaction. Like components are numbered as for the arrangement of Fig. 2.
The utterance processing system 26 and the dialogue processing system 27 together form a natural language processing system. The utterance processing system 26 is
event-driven and processes each of the utterances 13 of caller/user 11 individually. The dialogue processing system 27 puts any given utterance 13 of caller/user 11 into the context of the current conversation (usually in the context of a telephone conversation) . Broadly, in a telephone answering context, it will try to resolve the query from the caller and decide on an appropriate answer to be provided by way of response 12.
The utterance processing system 26 takes as input the output of the acoustic or speech recogniser 30 and produces a meaning representation 25 for passing to dialogue processing system 27.
In a . typical, but not limiting form, the meaning representation 25 can take the form of value pairs. For example, the utterance "I want to go from Melbourne to Sydney on Wednesday" may be presented to the dialogue processing system 27 in the form of three value pairs, comprising:
1. Start ; Melbourne 2. Destination; Sydney 3. Date; Wednesday where, in this instance, the components Melbourne, Sydney, Wednesday of the value pairs 24 comprise tokens or words 23. With particular reference to Fig. 3 "the recogniser 30 provides as output N best results 21 usually in the form of
tokens or words 23 to the utterance processing system 26 where it is first disambiguated by language model 32. In one form the language model 32 is based on trigrams with cut off. Analyser 33 specifies how words derived from language model 32 can be grouped together to form meaningful phrases which are used to interpret utterance 13. In one form the analyzer is based on a series of simple finite state automata which produce robust parses of phrasal chunks - for example noun phrases for entities and concepts and WH- phrases for questions, dates. Analyser 33 is driven by grammars such as meta-grammar 34. The grammars themselves must be tailored for each application and can be thought of as data created for a specific customer. The resolver 35 then uses semantic information associated with the words of the phrases recognized as relevant by the analyzer 33 to refine the meaning representation 25 into its final form for passing through the dialogue flow controller 36 within dialogue processing system 27.
The dialogue processing system 27, in this instance with reference to Fig. 3, receives meaning representation 25 from resolver 35 and processes the dialogue according to the appropriate dialogue models. Again, dialogue models will be specific to different applications but some may be reusable. For example a protocol model may handle
greetings, closures, interruptions, errors and the like across a number of different applications.
The dialogue flow controller 36 uses the dialogue history to keep track of the interactions. The logic engine 37, in this instance, creates SQL queries based on the meaning representation 25. Again it will be dependent on the specific application and its domain knowledge base.
The generator 38 produces responses 12 (for example speech out) . In the simplest form the generator 38 can utilize generic text to speech (TTS) systems to produce a voiced response.
Language knowledge database 31 comprises, in the instance of Fig. 3, a lexicon 39 operating in conjunction with database 40. The lexicon 39 and database 40 operating in conjunction with knowledge base mapping tools 41 and, as appropriate, language model 32 and grammars 34 constitutes a language knowledge database 31 or knowledge base which deals with domain specific data. The structure and grouping of data is modeled in the knowledge base 31.
Database 40 comprises raw data provided by a customer. In one instance this data may comprise names, addresses, places, dates and is usually organised in a way that logically relates to the way it will be used. The database 40 may remain unchanged or it may be updated throughout the lifetime of an application. Functional implementation can
be by way of database servers such as MySQL, Oricle, Postgres .
As will be observed particularly with reference to
Fig. 3, interaction between a number of components in the system can be quite complex with lexicon 39, in particular, being used by and interacting with multiple components of
System 10.
With reference to Fig. 4 there is illustrated in schematic form the basic procedure adopted by an on-line environmental and speaker model adaptation arrangement 910 operable, in this instance, from back end 20 of recogniser
30.
Broadly, where system 10 is handling a plurality of telephone conversations 911, in this instance conversations 1-N, each has a respective speech model 912 comprising models 1-N and in respect of each of which over a first predetermined period of time 913 an adaptation process is applied to the models 912, in this instance an MLLR adaptation process, thereby to improve recognition. Because the conversations 911 are running concurrently each respective model 912 will adapt differently over first period 913. At the end of first period 913 all models 912 are tested by test and select means 914 thereby to select the best performing model of the models 912 in accordance with the predetermined criteria. The best model 912A then
becomes the initial model for the start of the next predetermined period 915.
In this instance, procedure 910 is based on a technique called MLLR (Maximum Likelihood Linear Regression) .
System 10 can perform speech recognition in a hosted environment. In a given day, it handles numerous phone calls for different applications. During each call, speech models are adapted using the MLLR algorithm. However because the system is handling several calls simultaneously, it ends up having several adapted model sets. The recombination of these sets is difficult yet it must be done.
The arrangement 910 recombines these in the following fashion:
At the end of each day (or any predetermined time 913) all adapted model sets 912 are collected, and each is tested with a standard test algorithm. This algorithm tests the performance of each adapted model set 912. The model set that gives the best performance is then used as the new initial model 912A for the next day.
During this process, if any calls come in, recognition is performed as normal, but no adaptation may take place.
The main benefit in this solution is that the speech models will improve over time, giving better speech recognition performance.
Other benefits are:
• Adaptation to a noisy environment
• Adaptation to individual speaker styles
In this instance, the adaptation is based on MLLR, which is a well-known adaptation process for HMMs. It is essentially a method of determining the distribution of speech features in new speech data, and then modifying the parameters of the old models to better fit the new speech data. The above describes only some embodiments of the present invention and modifications, obvious to those skilled in the art, can be made thereto without departing from the scope and spirit of the present invention.