GB2465384A

GB2465384A - A speech recognition based method and system for retrieving data

Info

Publication number: GB2465384A
Application number: GB0820912A
Authority: GB
Inventors: Matthew Stuttle; Catherine Breslin; Kate Knill
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2008-11-14
Filing date: 2008-11-14
Publication date: 2010-05-19
Anticipated expiration: 2028-11-14
Also published as: GB2465384B; GB0820912D0

Abstract

A speech based method for retrieving data from an information store, the method comprising: receiving a speech input; providing said speech input to a speech processor, said speech processor comprising an acoustic model using a grapheme loop to produce a plurality of likely matches to said speech input; outputting a plurality of likely matches from said speech processor to a information store to be searched; searching said information store using said plurality of matches from said acoustic model in said information store and outputting the results from said search. The grapheme loop works by looping through words in inputted speech, it repetitively scores each grapheme in each word to identify the possible graphemes which could contribute to each word and assigns a confidence value to each possibility.

Description

I

A Speech Based Method and System for Retrieving Data The present invention is concerned with the general field of speech recognition. More specifically, the present invention is concerned with a speech recognition system which can be used to retrieve data from an information store in response to a voice command.

The ability to access information using vocal commands is now becoming commonplace in many types of electronic equipment, for example, mobile telephones, automated switchboard services, automated music retrieval, automated car navigation guides etc. Traditional ASR system have been developed for these applications which use phonetic units (phonemes) to model the speech together with a phonetic dictionary to map from the input words to the phonetic characters. The recognition grammar/network is used together with the dictionary in recognition to constrain the possible recognition results.

For very large databases and low-resource applications, some systems offer open/unlimited vocabulary access by using a phoneme ioop (any phoneme can appear in any order) and then use a phonetic dictionary to find database entries in the output results.

Figure 6 is a schematic showing phoneme ioop grammar where the speaker speaks the word tide and the acoustic model operating the phoneme ioop grammar returns the scores.

A phonetic dictionary is then used to find database entries in the output results. It is possible to improve the phonetic scoring further by implementing an "n-gram based phone/language model".

This type of system suffers from the problems that it requires a phonetic dictionary in order to convert words to phonemes. Words not covered (out of vocabulary) will require an automatic pronunciation generation system. This is sometimes called LTS (letter to sound) or G2P (grapheme to phoneme). Any unusual or irregular words will cause problems during recognition.

For a phoneme ioop system, all possible database keys must have a pronunciation generated before the output of the ASR system can be used for the search as shown in figures 7a and 7b. This requires that either the application has a phonetic dictionary interface or that a list of keys be provided to the speech recognition system (or that the application has a G2P system). Hence, the functions of the speech recognizer and the target application become tightly linked and that a large amount of information must be shared between the two systems.

The present invention uses a grapheme based system to address the above problems.

Grapheme based systems are known. However, grapheme based systems in the past have mainly used spoken word detection. In spoken word detection, a large audio database is searched. Prior to any search being performed in the audio database, the audio database is indexed using graphemes. This is an offline process and grapheme loops are not used in real time.

In Schillo et al "Grapheme Based Speech Recognition for Large Vocabularies" ICSLP2000 a grapheme based system for a car based unit is described which operates using city names etc. A single graphemic chain is output and a single result is compared with the database to find the n-best matches in the database.

Although the system is grapheme based, it still suffers from the problems that the matching between the search results and the inputted text is inherently done within the recognizer. Schillo uses a grapheme language model, but it requires it to be trained on task specific data.

The present invention has been developed to address at least some of the above problems and in a first aspect provides a speech based method for retrieving data from an information store, the method comprising: receiving a speech input; providing said speech input to a speech processor, said speech processor comprising an acoustic model using a grapheme loop to produce a plurality of likely matches to said speech input; outputting a plurality of likely matches from said speech processor to a information store to be searched; searching said information store using said plurality of matches from said acoustic model in said information store and outputting the results from said search.

The above provides the n best matches to an application where the application will return information in response to said matches. By giving the application a plurality of matches, errors in the decoding of the input speech and misspellings can be corrected by the application.

Thus, the speech processor does not require information about the information store to be searched. Instead the speech processor supplies more information to the application to allow the application itself to find the best matches.

In one embodiment, the plurality of likely matches is provided to said information store with a confidence score related to each match.

To allow a match which comprises a plurality of words (i.e. conventional words) a said grapheme loop preferably comprises a between word symbol such that a likely match may comprise a plurality of separate words.

The use of the grapheme loop means that the acoustic model has no information concerning the specific contents of the information store to be searched.

Preferably a grapheme language model is applied to the output of said acoustic model to introduce language constraints when determining the plurality of likely matches.

Unlike the prior art, the language model is not pre-trained using information from said information store to be searched. However, feedback from the information store can be provided to the language model on-line during searching. For example keywords which remain fixed in time may be fed back. Supplying a small number of words to the speech processor does not degrade the performance. However, in some embodiments, there is no feedback from the application.

Preferably, the acoustic model is a context dependent grapheme ioop.

The search results output from the information store may take into account the best match to the data in the information store as well as the likelihood of a match determined by the speech processor.

In a second aspect, the present invention provides a system for retrieving information from an information store using a speech input, the system comprising: a speech input configured to receive speech; a speech processor configured to receive said speech input, said speech processor comprising an acoustic model using a grapheme loop to produce a plurality of likely matches to said speech input; said speech processor being further adapted to output a plurality of likely matches to an information store to be searched; the system further comprising a processor adapted to search said information store using said plurality of matches from said acoustic model in said information store and outputting the results from said search.

The present invention will now be described with reference to the following non-limiting embodiments in which: Figure 1 is a schematic of a basic speech recognition system interfacing to an application in accordance with an embodiment of the present invention; Figure 2 is a schematic of the standard components of a speech processor; Figure 3 is a schematic of a grapheme loop for use in accordance with an embodiment of the present invention; Figure 4 is a flow chart showing a method in accordance with an embodiment of the present invention; Figure 5 is a schematic of a ioop using feedback; Figure 6 is a schematic of a known phoneme loop; and Figure 7a and 7b are schematics of systems where the pronunciation of the database keys is generated Figure 1 is a schematic of a very basic speech recognition system. A user (not shown) speaks into microphone 1 or other collection device for an audio system. The device 1 could be substituted by a memory which contains audio data previously recorded or the device 1 may be a network connection for receiving audio data from a remote location.

The speech signal is then directed into a speech processor 3 which will be described in more detail with reference to figure 2.

The speech processor 3 takes the speech signal and turns it into text corresponding to the speech signal. The output of the speech processor 3 is then directed into retrieval unit 5 which allows information to be retrieved from an information store 7. Examples of information stores are the internet, musical retrieval databases, automated telephone directories, car navigation systems, accessing mobile telephone data such as e-mails, contacts etc, accessing electronic program guides.

The speech processor outputs the n-best matches into the retrieval system 5.

Figure 2 is a block diagram of the standard components of a speech recognition processor 3 of the type shown in figure 1. The speech signal received from a microphone, through a network or from a recording medium 1 is directed into front-end unit 11.

The front end unit 11 digitises the received speech signal and splits it into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space".

The front end unit 11 also removes signals which are believed not to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, MelFrequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.

The input vector is then fed into a decoder 13 which cooperates with both an acotistic model section 15 and a language model section 17. The acoustic model section 15 will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on connectionist models and hybrid models.

The acoustic model unit 15 derives the likelihood or other confidence measure of a sequence of observations corresponding to a word or part thereof on the basis of the acoustic input alone.

The language model section 17 contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language.

Generally a static model is used. The most popular method is the N-gram model.

The decoder 13 then traditionally uses a dynamic programming (DP) approach to find the best transcription for a given speech utterance using the results from the acoustic model 15 and the language model 17.

The output is then provided in terms of the n-best matches to an information retrieval system.

This description will be mainly concerned with the use of an acoustic model which is a Hidden Markov Model (HMM). However, it could also be used for other models.

In a 1-1MM, once the model parameters have been determined, the model can be used to determine the likelihood of a sequence of observations corresponding to a sequence of In an embodiment of the present invention, the acoustic model is a grapheme model which uses graphemes as fundamental units of recognition.

A grapheme based acoustic model is then used with grapheme-loop grammar. All words or phrases may be embedded within the grapheme loop, or the grapheme loop may form part of a larger grammar.

In speech recognition there is a grammar of allowable terms, this is what is searched to find the most likely text string given the speech input. For example, consider the phrase: Please play (Madonna The Beatles Queen) where (I) indicates different alternatives.

it is also possible to specify repetitions of phrases: Please play ( (Madonna I The Beatles I Queen) and) <1-> where <1-> indicates the number of repetitions (in this case, 1> upwards) would cover the phrases "please play Madonna and the Beatles and queen and the Beatles and...".

The term "loop" is used to identify this repetition.

The above example relates to "word ioop" grammar. In the embodiment of the present invention, the "words" are replaced by graphemes (ie, letters). In a grapheme ioop, the recognition grammar is all of these letters as alternatives, in a loop: (a lb Ic I...) <1-> Thus, the possible recognition space is any combination of graphemes.

Figure 3 shows a schematic of a grapheme loop. The grapheme ioop 51 uses grapheme loop grammar which as explained above is a recognition grammar which allows all of the letters to be checked as alternatives in a loop.

This grammar is used to find the parts of the acoustic model to score the input hypothesis. For context independent systems, this means all the letter/grapheme models. If the subword units are context dependent, then models are used with the relevant contexts.

Context dependency means increasing the number of subword units in the acoustic model by considering the units before and after the one under consideration: Thus, rather than treating all "g"s the same, "g' with a h" after it will be modelled differently to "g" with an "e" afterwards.

Context dependency is important for phone-based systems as the type of sound changes dependent on the previous sound since lips and vocal cords etc take time to move between different types of sounds. In grapheme systems context dependency is even more important as the context will indicated the type of sound, This helps to disambiguate sound differences (g on it's own will be pronounced very differently to part of a gh sound).

Context dependant models involve considering more units for each input frame.

Generally a left to right context will be used taking into account the letters to the left and right of the grapheme. When the left and right context is used, these are referred as trigrapheme units.

Figure 3 is an example of a grapheme ioop where the input speech is the word "tide".

The word tide comprises four graphemes t: i: d: e:. The grapheme ioop shown in figure 3 will loop through the inputted speech repetitively scoring each grapheme to identify the possible graphemes which could contribute to the word uttered in the input signal.

The output of such a grapheme loop could be: 1) t:a:i:g:e: 2) t:i:a:d: 3) t:i:d:e: The grapheme loop will have determined that each of the above outputs are possible (in reality many different outputs may be produced) and these three outputs will each have a different confidence scores associated with them which indicates whether they are more likely or less likely than other words returned by the system. The confidence scores may be likelihood scores, posterior probability scores etc. The recognizer may purely use this grapheme ioop grammar and output a plurality of the most likely results to a database or other information store for searching. However, the grapheme loop acoustic model may also be used in conjunction with a grapheme language model. Typically, if a language model is used, it will be a grapheme based n-grain language model.

Using a word loop grammar or grapheme loop grammar means that any combination of words letters can be considered. This means that very unlikely combinations can be considered. To reduce the number of unlikely combinations, a language model is applied in the recognition process. It can be thought of as an additional score which is applied to the system based on the current or hypothesised word given what has come before. For example, for the sequence "the ball is "the language model score for "red" will be higher than "rude", so even if the acoustic model gets it wrong, the language model will fix it.

In a grapheme loop system there are no words, therefore the language model operates on graphemes. Thus, the language model scores are based on the sequence of letters that precedes the one under consideration. eg, if the sequence "oug" has been received, it's much more likely to be "h' as the next letter rather than "z" or "t".

Figure 4 shows the steps of a method in accordance with an embodiment of the present invention.

A search engine query is inputted via speech. For example, a search engine enquiry may be inputted into an Internet search engine. The input speech is then parameterised to produce an observation vector in acoustic space in step Si 03 as previously explained with reference to figure 2.

Next, an acoustic model is run using a grapheme ioop in step S 105. The grapheme loop has been described with reference to figure 3. The output is then subjected to a grapheme based language model.

The n-best results are then provided to the search engine in step S 107. The search engine can directly search for the n-best results/lattice for matching keys and sort results according to a detailed internal knowledge of the system state etc. The results can also be searched using, for example, a minimum edit distance metric to handle any misspellings in the original data. The results are output in step si 09. The system may feed back best results to the language model to improve performance for future searches. For example, the application may provide information about correct matches to improve the recognition performance. However, this would not be desirable in a very large system where doing a comparison against all search terms would be exhaustive or impractical for the recogniser, and it would be unwieldy for the application to give all search terms to the ASR engine.

The application, in this case, a search engine which calls the speech recognizer does not need to provide any information to the recognition system since the grapheme loop of the grapheme acoustic model in S105 is independent of the information store contents.

If the top recognition result does not match any item in the database, then the application can take a number of approaches to widen the search. For example, further n-best recognition results can be added as search terms. Alternatively, some metrics such as minimum word edit distance can be used to find the best partial match in the database. This approach has various potential benefits including a lower overhead in building an application, lower memory costs (since there is a reduced size for recognition network) and computational node (no recompilation required when the database changes).

The method does not also have to be restrained to run in only a grapheme recognizer.

The grapheme loop can be combined with any other recognition network to produce a database search key. For example, a finite state grammar consisting of some or all of the words in a database, a phone ioop andior statistical language model of some or all of the words in the database. For this arrangement, the search engine can provide some data back to the grapheme based language model in order to help identify better results.

When the grapheme loop is used in combination with a phonetic or word network built from the database contents, the application can choose from what information is passed to the speech recognition system to create the phone/word network. In this way, the application can restrict the amount of information passed, for example only providing the most frequently access database entries. The method therefore reduces the amount of information that needs to be shared between the database, application and speech recognition system. This has potential benefits in terms of memory (reduced size of recognition networks) and computational load (controls timing and size of recompilation).

Figure 5 shows an example where there is feedback of some keywords to the acoustic or language models Preferably, keywords which do not change over time may be fed back.

Claims

CLAIMS: 1. A speech based method for retrieving data from an information store, the method comprising: receiving a speech input; providing said speech input to a speech processor, said speech processor comprising an acoustic model using a grapheme loop to produce a plurality of likely matches to said speech input; outputting a plurality of likely matches from said speech processor to a information store to be searched; searching said information store using said plurality of matches from said acoustic model in said information store and outputting the results from said search.
2. A method according to any preceding claim, wherein the plurality of likely matches is provided to said information store with a confidence score related to each match.
3. A method according to any preceding claim, wherein said grapheme loop comprises a between word symbol such that a likely match may comprise a plurality of separate words.
4. A method according to any preceding claim, wherein the acoustic model has no information concerning the specific contents of the information store to be searched.
5. A method according to any preceding claim, further comprising applying a grapheme language model to the output of said acoustic model to introduce language constraints when determining the plurality of likely matches.
6. A method according to claim 5, wherein the language model is not pre-trained using information from said information store to be searched.
7. A method according to claims 5 and 6, wherein feedback from the information store is provided to the language model on-line during searching.
8. A method according to any preceding claim, wherein the acoustic model is a context dependent grapheme ioop.
9. A method according to any preceding claim, wherein the information store is the internet, a telephone switchboard database, a music database, a car navigation system database, a telephone directory, an electronic programme guide, a video game database, spoken word databases, an email database.
10. A method according to any preceding claim, wherein the search results output form the information store take into account the best match to the data in the information store as well as the likelihood of a match determined by the speech processor.
11. A system for retrieving information from an information store using a speech input, the system comprising: a speech input configured to receive speech; a speech processor configured to receive said speech input, said speech processor comprising an acoustic model using a grapheme loop to produce a plurality of likely matches to said speech input; said speech processor being further adapted to output a plurality of likely matches to an information store to be searched; the system further comprising a processor adapted to search said information store using said plurality of matches from said acoustic model in said information store and outputting the results from said search.
12. A system according to claim 1, wherein the information store is the internet, a telephone switchboard database, a music database, a car navigation system database, a telephone directory, an electronic programme guide, a video game database, spoken