GB2465383A

GB2465383A - A speech recognition system using a plurality of acoustic models which share probability distributions

Info

Publication number: GB2465383A
Application number: GB0820908A
Authority: GB
Inventors: Catherine Breslin; Matthew Stuttle; Kate Knill
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2008-11-14
Filing date: 2008-11-14
Publication date: 2010-05-19
Anticipated expiration: 2028-11-14
Also published as: GB2465383B; GB0820908D0

Abstract

A speech processing method comprising: receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set with a first dictionary; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set with a second dictionary; and outputting text determined from said first and second acoustic models; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models. The probability distributions may be Gaussian probability density functions associated with an acoustic model which may be a Hidden Markov Model (HMM). The acoustic models may be a phoneme model and a grapheme model. The observations may be converted into an n-dimensional feature vector in an n dimensional space which is then further converted into a plurality of sub-vectors each within a reduced dimension subspace of said n-dimensional space and said shared probability distributions may have been pre-calculated for said sub-spaces.

Description

A Speech Recognition Method and System The present invention is concerned with the general field of speech recognition. More specifically, the present invention is concerned with the field of speech recognition methods and apparatus which operate using multiple acoustic models.

Speech recognition systems incorporating two or more acoustic models, so-called hybrid systems are used in a number of situations. Combinations of multiple models, including the combination of graphemes and phonemes models have been shown to yield improvements in accuracy, Grapheme based systems are derived directly from the letters of the words whereas phoneme based systems from an expert assessment of the word pronunciation. When considering a language such as English where there are different dialects, for example native English and American English, two phoneme based systems may yield better results than a single based acoustic model.

When using a hybrid system, multiple sets of models must be stored. Thus, the amount of memory needed for storage increases in proportion with the number of models being combined.

Multiple Hidden Markov Models may be decoded either synchronously, where the models are constrained to be in the same state for each observation, or asynchronously where this constraint is not applied. However, synchronous decoding with two or more model sets does not work when the models have different architecture, such as a different dictionary or phone set. Asynchronous decoding is computationally expensive, and is not flexible with regards to the level at which combination can be performed. Thus, it is typical that a combination of two systems.with different phone sets must be done using two separate decoding passes.

The 2-stage architecture for decoding the data twice with two model sets typically takes twice as long to run as a single decoding pass. This approach takes no advantage of possible caching between the multiple systems, which could lead to improvements in efficiency as the multiple model sets are trained on and decode the same data If multiple model sets are built on the same training set, then the conventional approach does not take advantage of this, e.g. sharing similar Gaussians between the two systems.

Handcrafted phoneme dictionaries also need to be stored, and thus increase the footprint of the system. Furthermore, it is not possible to create a dictionary containing every possible word, and so for open vocabulary recognition a grapheme to phoneme conversion module is necessary. Grapheme to phoneme conversion can introduce mispronunciations into the decoding, which are fixed from that point onwards. They do not easily handle misspellings and foreign words. The grapheme to phoneme conversion needs to be done for all new vocabulary words.

If multiple IT[MM sets are trained on the same data, then it is expected that many of the resulting Gaussians will be similar, even if the architecture and topology of the model sets differ. Thus, individual Gaussian components could be shared between the multiple HMM sets. This has two advantages. First, the memory needed for storing the multiple acoustic models is reduced as parameters are shared among the models.

Second, likelihoods can be cached between multiple systems, allowing for a more efficient decoding. Sharing Gaussians between the two systems imposes no restriction on the methods used both for decoding with and combining the two systems.

By concatenating the multiple systems and performing subspace compression, Gaussians can be shared. This differs from previous use of subspace compression, which is to compress a single model.

Sharing Gaussians between multiple systems overcomes the problem of increased memory needed to store the multiple model sets. By sharing the parameters between systems, the combined model size is reduced. Additionally, likelihoods can be cached and shared between the two systems, leading to gains in decoding time.

No operational differences noticeable when using each model set separately. Data can be decoded with one or both model sets independently.

Practically, sharing Gaussians between models leads to a smaller model size as fewer parameters need to be stored. Fewer Gaussians mean less memory is needed during decoding for caching the likelihoods when using more than one system.

Thus, in a first aspect, the present invention provides a speech processing method comprising: receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set with a first dictionary; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set with a second dictionary; and outputting text determined from said first and second acoustic models; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models.

One acoustic model may be a phoneme model with a phoneme dictionary and the other may be a grapheme model with a grapheme dictionary. Although the system may run with multiple phoneme and grapheme models, or just multiple phoneme models or multiple grapheme models.

The sharing or the probability distributions or Gaussians may be complete or partial.

Only the means of the Gaussians may be shared and not their variances. However, in the preferred embodiments all parameters from the probability distribution functions are shared so that results from the first acoustic model are cached for use in the second acoustic model.

In a preferred embodiment each observation from said sequence of observations is converted into an n-dimensional feature vector in an n dimensional space which is then further converted into a plurality of sub-vectors each within a reduced dimension subspace of said n-dimensional space and said shared probability distributions have been pre-calculated for said sub-spaces.

In a further preferred embodiment each acoustic model comprises an atom table containing the probability distributions to be used by said model and an index which links each state of the model to its probability distributions, and wherein the atom table is the same for both models.

In the above, the atom table is preferably formed from codewords formed by clustering probability distribution functions based on states using both the first and second dictionaries. The clustering may be performed using the technique of subspace compression. Here the Gaussians are clustered in the reduced dimension subspaces and Gaussians within a cluster are replaced by the codeword for the cluster. Thus may be further improved by sharing or tying Gaussians prior to subspace compression. For example, a two stage clustering process may be used, the first stage being performed in the n dimensional acoustic space of the observations and the second clustering process being performed in subspaces of the acoustic space of the observations.

The output form the two models may be combined using a number of different methods, for example the word with the highest score may be selected. Language model constraints may also be used in combining the output from the two acoustic models.

The above method is of particular use when applied to database retrieval for systems such as music retrieval by artist or song name, car navigation systems, switchboards, voice activated mobile telephone applications such as contacts or e-mail, electronic program guides etc. Here the method comprises: processing a speech input signal as above; and using the outputted text as a search term for said database.

In a second aspect, the present invention provides a method of speech processing comprising: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combining said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; providing a plurality of acoustic models associated with said dictionaries which share said representative distributions; receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set from said plurality of acoustic models; and outputting text determined from said first and second acoustic models.

Combining said probability distribution functions to form representative distributions may comprise clustering said distributions to form codewords. Clustering may be used to form codewords using subspace compression. In a preferred embodiment tying is performed in the n dimensional acoustic space of the observations and the second clustering process being performed in subspaces of the acoustic space of the observations.

The training process may be performed independent of the recognition process. Thus, in a third aspect, the present invention provides a method of speech processing comprising: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combining said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; providing a plurality of acoustic models associated with said dictionaries which share said representative distributions; receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models.

In a fourth aspect, the present invention provides a speech processing method comprising: receiving a speech input comprising a sequence of observations; providing a first acoustic model set with a first dictionary; providing a second acoustic model set with a second dictionary; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models, the method further comprising selecting between the use of the first model, second model or the use of both models to process speech and outputting text determined from said acoustic models.

Selection between the use of one model or both models may be automatic. For example, if the system is used to retrieve data from a database keyed with non-common words, then the system may decide to use just one type of model. Alternatively, the user could be made to specify the type of request and the system choose the appropriate model.

The present invention also extends to a computer program configured to cause a computer to perform any of the above methods.

In a fifth aspect, the present invention provides a speech processing system comprising: a speech input comprising a sequence of observations; a processor adapted to: determine the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set with a first dictionary; determine the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set with a second dictionary; and output text determined from said first and second acoustic models; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models.

The speech input may be a microphone or other means to receive speech arid convert it into a form for processing by a computer.

In a sixth aspect, the present invention provides a speech processing system comprising: a processor being adapted to: pool probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combine said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; and provide a plurality of acoustic models associated with said dictionaries which share said representative distributions.

In a seventh aspect, the present invention provides a speech processing system comprising: a processor being adapted to: pool probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combine said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; and provide a plurality of acoustic models associated with said dictionaries which share said representative distributions; the system further comprising a speech input comprising a sequence of observations; and a further processor adapted to: determine the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models; determine the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set from said plurality of acoustic models; and output text determined from said first and second acoustic models.

The processor and the above further processor may be the same or different processors.

In an eighth aspect, the present invention provides a speech processing system compn sing: a speech input comprising a sequence of observations; a processor adapted to provide a first acoustic model set with a first dictionary; and provide a second acoustic model set with a second dictionary; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models, the processor being further adapted to allow selection between the use of the first model, second model or the use of both models to process speech and outputting text determined from said acoustic models.

In the above system, the acoustic models may share an atom table.

Although the above has been mainly concerned with decoding using two models, three or more models may be used all sharing probability distribution functions.

The present invention will now be described with reference to the following non-limiting embodiments in which: Figure 1 is a schematic of a known speech recognition system; Figure 2 is a schematic of the standard components of a speech processor; Figure 3 is a schematic of a Gaussian distribution; and Figure 4 is a schematic plot of acoustic space wherein observation is represented by an observation vector; Figure 5 is a schematic of a further known speech recognition system operating using Hidden Markov Models (HMM); Figure 6 is a schematic of a system in accordance with an embodiment of the present invention using sub space compression; Figure 7 is a schematic of a known system using both grapheme and phoneme based acoustic models; Figure 8 is a schematic of an embodiment of the present invention using both phoneme and grapheme based acoustic models; Figure 9 is a flowchart of the training process in accordance with an embodiment of the present invention; Figure 10 is a schematic flowchart showing the steps of an automatic speech recognition system in accordance with an embodiment of the present invention; and Figure 11 is a schematic of a switchable system in accordance with an embodiment of the present invention.

Figure 1 is a schematic of a very basic speech recognition system. A user (not shown) speaks into microphone 1 or other collection device for an audio system. The device 1 could be substituted by a memory which contains audio data previously recorded or the device 1 may be a network connection for receiving audio data from a remote location.

The speech signal is then directed into speech processor 3 which will be described in more detail with reference to figure 2.

Speech processor 3 takes the speech signal and turns it into text corresponding to the speech signal. Many different forms of output are available. For example, the output may be in the form of a display 5 which outputs to a screen. Alternatively, the output could be directed to a printer or the like. Also, the output could be in the form of an electronic signal which is provided to a further system 9. For example, further system 9 could be part of a speech translation system which takes the outputted text from processor 3 and then converts it into a different language and is outputted via a further text or speech system.

Alternatively, the text outputted by a processor 3 could be used to operate different types of equipment, for example, it could be part of a mobile phone, car etc. where the user controls various functions via speech.

Figure 2 is a block diagram of the standard components of a speech recognition processor 3 of the type shown in figure 1. The speech signal received from microphone, through a network or from a recording medium 1 is directed into front-end unit 11.

Front end unit 11 digitises the received speech signal and splits into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space". The parameters which are derived will be discussed in more detail later.

The front end unit also removes signals which are not believed to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, Melfrequency Ceptral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.

The input vector is then fed into decoder 13 which cooperates with both an acoustic model section 15 and a language model section 17. The acoustic model section 15 will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on connectionist models and hybrid models.

The acoustic model unit 15 derives the likelihood of a sequence of observations corresponding to a word or part thereof on the basis of the acoustic input alone.

The language model section 17 contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language.

Generally a static model is used. The most popular method is the N-gram model.

The decoder 13 then traditionally uses a dynamic programming (DP) approach to find the best transcription for a given speech utterance using the results from the acoustic model 15 and the language model 17.

This is then output into via output device 19 which allows the text to be displayed, presented or converted for further use e.g. in speech to speech translation or to control a voice activated device.

This description will be mainly concerned with the use of an acoustic model which is a Hidden Markov Model (1{MM). However, it could also be used for other models.

The actual model used in this embodiment is a standard model, the details of which are outside the scope of this patent application. However, the model will require the provision of probability density functions (pdfs) which relate to the probability of an observation represented by an acoustic vector being related to a word or part thereof.

Generally, this probability distribution will be a Gaussian distribution in n-dimensional space.

A schematic example of a generic Gaussian distribution is shown in figure 3. Here, the horizontal axis corresponds to a parameter of the input vector in one dimension and the probability distribution is for a particular word or part thereof relating to the observation. For example, in figure 3, an observation corresponding to acoustic vector x has a probability p1 of corresponding to the word whose probability distribution is shown in figure 3. The shape and position of the Gaussian is defined by its mean and variance. These parameters are determined during training for the vocabulary which the acoustic model, they will be referred to as the "model parameters".

In a HMM, once the model parameters have been determined, the model can be used to determine the likelihood of a sequence of observations corresponding to a sequence of Figure 4 is schematic plot of acoustic space where an observation is represented by observation vector or feature vector x. The open circles g correspond to the means of Gaussians or other probability distribution functions plotted in acoustic space.

During decoding, the acoustic model will calculate a number of different likelihoods that the feature vector x1 corresponds to a word or part thereof represented by the Gaussians. These likelihoods are then used in the acoustic model and combined with probabilities from the language model to determine the text spoken.

Hybrid speech recognition systems uses two acoustic models when decoding the speech data. Figure 5 shows schematically the decoding process using Hidden Markov Models for two acoustic models using different phone sets and dictionaries. Two 3-state models are shown in figure 5a, one corresponding to the grapheme Ia! 51 and the second corresponding to the phoneme /dh153. Hidden Markov Model are used to calculate the likelihood of a word constructed from these graphemes or phonemes being related to the observations. Each state has a plurality of Gaussians described with reference to figures 3 and 4 corresponding to it. In a standard HMM model, there are two parts, an atom table and an index. The atom table represents the actual Gaussians (i.e. the means and variances of the Gaussians). The index is the part which links each state of the model 51, 53 to the relevant Gaussians.

This information is then fed into the standard system as shown in figure 5. Figure 5 has a decoder (an example with reference to figures 1 and 2)13 which corresponds with a lexicon or dictionary 55, the acoustic model 57 which calculates likelihoods using HIMItvIs 51 and 53 extra and it may also correspond with other resources, for example a language model 58 as described with reference to figures 1 and 2.

A speech input 59 is then provided to the decoder and the words 61 are output as text.

Now, a system in accordance with an embodiment of the present invention will be described with reference to figure 6.

In figure 6, the decoder and system is similar to that described with reference to figure 5. Therefore, to avoid any unnecessary repetition, like reference numerals will be used to denote like features. However, the way in which these two acoustic models 51 and 53 interact is completely different.

In figure 6a, the first acoustic model 51 and the second acoustic model 53 share Gaussians. Therefore, there is a single atom table which serves both models 51 and 53.

Even though model 51 is a grapheme model whereas model 53 is a phoneme model, it is possible to share Gaussians between these two models. How this is achieved will be described in more detail with reference to figure 9. An example of the types of Gaussians which can be combined between the phoneme and a grapheme are for when the phoneme corresponds to a particular grapheme. For example, the letter B and the sound for the phoneme of letter B should correspond roughly to the same distribution.

Therefore, if the likelihood has already been calculated for a word containing the phoneme B, the results for the state of the phoneme B can be cached and used again for the grapheme B when performing the grapheme model.

Although the above explanation is concerned with a phoneme model and a grapheme model, it will be appreciated by those skilled in the art that the system may comprise two phoneme models. For example, the phoneme model for American English is different to the phoneme model for native English.

The idea of using a system which contains both the ability to decode a sentence using a phoneme model and the same sentence using a grapheme model is also known. See for

example figure 7.

In the system of figure 7, there is a first model set Si and a second model set S2. The first model set, for example is a phoneme set and is linked to the Gaussians defined by Gaussian set 71. The grapheme model is linked to second set of Gaussians 73.

Although some of these Gaussians may be very similar for example where a phoneme corresponds to a grapheme, in the prior art, these systems are kept completely separate.

Then, decoding is performed separately for Si using decoder 13 which uses dictionary or lexicon LI. Similarly, it is carried out separately for step S2 using decoder 13.

There is no correspondence between these two models for sharing data until the final step. During the final step, the output from using acoustic model Si and the output from using acoustic model S2 will be compared. Possibly, the words formed from the phoneme model or the grapheme model with the highest likelihood score will be selected. However, it is possible to use more sophisticated selection techniques using language models. For example, if the output of text 1 is "the hat sat on the mat" and the output from text 2 is "the cat flat on the mat" then the language model can be designed to take from both outputs and construct the sentence "the cat sat on the mat".

In the present invention, as shown in figure 8, a common set of Gaussians 81 is used for both the phoneme model Si and the grapheme model S2. One of these models will usually be run first. Likelihoods calculated for the acoustic model Si can then be cached and then simply looked up by the model S2 as opposed to recalculating.

How the Gaussians may be combined and retraining to the system will now be described with reference to figure 9.

Using extensive training data, the Gaussians means and variances are derived for both model sets in SlOl.

The Gaussiaris from both sets are then pooled in step S 103. At this point, in the preferred embodiment, these Gaussians are n-dimensional Gaussians to match the n-dimensional nature of the acoustic vector which is derived from the observations.

These Gaussians are then clustered over both sets to form codewords in step Si 05. The codewords are based on the mean and variances from the Gaussians which are clustered to form the codewords. Clustering of Gaussians is well known in the art and algorithms for clustering will not be discussed here.

All Gaussians which are assigned to a codeword will then be replaced by the codeword i.e. distribution based on the means and variances of all Gaussians. Usually, the mean of the codeword will be the mean of all the Gaussians which are clustered to form that codeword. Since the clusters are formed from both sets, the codewords will represent probability distribution functions from both sets.

After this first clustering stage which takes place in n-dimensional space, sub space compression is then performed in step SI 07.

As mentioned above, the Gaussians which have been clustered previously were n-dimensional Gaussians. However, in this next step, the n-dimensional feature vector is sub divided into a plurality of sub space feature vectors each representing a reduced dimensionality from the original feature vector. For example, if the original feature vector was expressed in five dimensions, the two sub space sub vectors could be expressed in two dimensions and three dimensions.

During sub space compression, the original five dimensional Gaussians are then split in to three dimensional and two dimensional Gaussians or other reduced dimension Gaussians. These Gaussians are then further clustered to form codewords in these reduced dimensions and a compressed shared set of pdfs is obtained.

Figure 9 shows a method in accordance with a particularly preferred embodiment. The clustering step does not need to be in two stages. However, beneficial performance has been found if the clustering step is taken in two stages.

As explained above, the HMMs consist of two parts: the atom table and the index. The atom table is the actual Gaussians (i.e. means and variances of all the subspace Gaussians). The atom table size is the same for both approaches, because the subspace compression algorithm controls how many Gaussians will be in the final atom table.

The index is the part which links each HMM state to the relevant Gaussian or subspace Gaussians. There are two parts to this -the link from each state to its Gaussian components, and the link from each Gaussian to its subspace Gaussians. The index changes depending on how the tying is done.

If there are three states, each with 2 Gaussian components. For the first part of the index 6 links are stored, to show which Gaussians belong to which state:

for example:

State 1 -> Gaussians #1 #2 State 2 -> Gaussians #3 #4 State 3 > Gaussians #5 #6 If subspace compression as explained with reference to Si 07 is performed then 18 links for the second part of the index need to be stored to link each Gaussian to it sub-Gaussians: Gauss #1 -> Sub-Gaussians #1 #2 #3 Gauss #6 -> Sub-Gaussians #16 #17 #18 Thus a total of 24 links for this simple basic model with subspace compression is required.

It is possible to replace the first part of the index with a link directly from the state to the sub-Gaussians.

However, if clustering as per step S105 is performed before subspace compression then to tie the original 6 Gaussians to 3 it is necessary to store 6 links for the first part of the index: State 1 -> Gaussians #1 #2 State 2 -> Gaussians #2 #3 State 3 -> Gaussians #1 #3 However, if subspace compression is then performed it is only necessary to store 9 links for the second part of the index: Gauss #1 -> Sub-Gaussians #1 #2 #3 Gauss #4-> Sub-Gaussians #7 #8 #9 i.e. 15 total links for the model which has Gaussian tying followed by subspace compression.

A real system can have 650 states and 20832 Gaussiaris. Thus, tying the 20832 Gaussians to 15624 (75%) before doing subspace compression can give a significant saving in index size. The original model size of 5.5M is reduced to 1.3M, but is further reduced to 1M if subspace compression is preceded by Gaussian tying.

FigurelO is a flowchart of how a speech recognition system operates which has been trained with reference to figure 9.

In step Si 21, in the standard manner, a speech input is received a sequence of observations. In the same manner as before, each observation is converted into an input vector and step S 123. In this particular example, each observation is converted into a plurality of input vectors in different sub spaces. For example, if the input vector is a five dimensional vector, it can be sub divided into a two dimensional vector and a three dimensional vector.

A first acoustic model is then run using a first acoustic model set having a first dictionary. For example, the first acoustic model set and first dictionary may be a phoneme based model in step S 125. This acoustic model will contain an atom table which is shared with the second acoustic model. The likelihoods which are calculated during the running of the first acoustic models will be cached in step S 127.

When the second acoustic model is then run in step S 129, the second acoustic model may be a grapheme based model or other phoneme model. Thus, the run time on the system can be significantly reduced.

For example, Si is a phoneme system with 67 phones, and S2 is a grapheme system with 26 letters. Individually, they are 2.8MB, but can be compressed to 770K. By concatenating Si and S2 to give SC, and then performing subspace compression with the same number of codewords, the model size can be reduced to 1.2MB. This is larger than the 770K due to the extra space needed to account for the extra models.

Experiments using these models with a) G2P dictionary and b) grapheme dictionary give the following results.

Original models: Si: 82.60 Accuracy (G2P dictionary) S2: 83.46 Accuracy (grapheme dictionary) Original models after subspace compression: Si: 81.83 Accuracy (G2P dictionary) S2: 82.18 Accuracy (grapheme dictionary) i.e. there is some loss in performance from the subspace compression.

Combined models after subspace compression: SC: 82.09 Accuracy (G2P dictionary) SC: 81.75 Accuracy (grapheme dictionary) The performance of the compressed SC model is about the same as the individual models after subspace compression, although in total there are half as many Gaussians due to sharing between Si and S2.

There are fairly specific situations where the above system is of particular use. For example, in retrieval of real names for a database, for example in the field of music retrieval.

Therefore, the system may sometimes operate just running the phoneme model, other times it may operate just running the grapheme model and sometimes it may run using both models and caching the results between both models.

A schematic example of such a system is shown in figure 11. Here, at step Si 51, the instruction to the system is characterised. For example, if the system is retrieving music using the name of a particular artist, then it may be desirable to use just the grapheme model as shown in step Si 55. If on the other hand the instruction is spoken text which does not contain unusual or unnatural words, then it may simply be acceptable to use the phoneme model S 153. For systems which are a combination of both types of input, both models can be used as shown in step S 157. However, in all cases, the system shown in figure 11 will be using a common atom table between all models.

Claims

CLAIMS: 1. A speech processing method comprising: receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set with a first dictionary; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set with a second dictionary; and outputting text determined from said first and second acoustic models; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models.
2. A method according to claim 1, wherein one acoustic model is a phoneme model with a phoneme dictionary and the other is a grapheme model with a grapheme dictionary.
3. A method according to any preceding claim, wherein likelihoods from the first acoustic model are cached for use in the second acoustic model.
4. A method according to any preceding claim, wherein each observation from said sequence of observations is converted into an n-dimensional feature vector in an n dimensional space which is then further converted into a plurality of sub-vectors each within a reduced dimension subspace of said n-dimensional space and said shared probability distributions have been pre-calculated for said sub-spaces.
5. A method according to any preceding claim, wherein each acoustic model is a HMM model comprising an atom table containing the probability distributions to be used by said model and an index which links each state of the model to its probability distributions, and wherein the atom table is the same for both models.
6. A method according to claim 5, wherein the atom table comprises codewords formed by clustering probability distribution functions based on states using both the first and second dictionaries.
7. A method according to claim 6, wherein the atom table comprises codewords formed by clustering probability distribution functions in a reduced dimensionality subspace.
8. A method according to claim 7, wherein the atom table comprises codewords formed by clustering probability distribution functions based on states using both the first and second dictionaries using a two stage clustering process, the first stage being performed in the n dimensional acoustic space of the observations and the second clustering process being performed in subspaces of the acoustic space of the observations.
9. A method according to any preceding claim, wherein the output of the first model is combined with the output from the second model.
10. A method of retrieving data from a database, the method comprising: processing a speech input signal according to claim 1; and using the outputted text as a search term for said database.
11. A method of speech processing comprising: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combining said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; providing a plurality of acoustic models associated with said dictionaries which share said representative distributions; receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set from said plurality of acoustic models; and outputting text determined from said first and second acoustic models.
12. A method of speech processing comprising: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combining said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; providing a plurality of acoustic models associated with said dictionaries which share said representative distributions; receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models.
13. A method according to either of claims 11 or 12, wherein combining said probability distribution functions to form representative distributions comprises clustering said distributions to form codewords.
14. A method according to claim 13, wherein clustering is performed in a reduced dimensionality subspace.
15. A method according to claim 14, wherein a two stage clustering process, the first stage being performed in the n dimensional acoustic space of the observations and the second clustering process being performed in subspaces of the acoustic space of the observations.
16. A speech processing method comprising: receiving a speech input comprising a sequence of observations; providing a first acoustic model set with a first dictionary; providing a second acoustic model set with a second dictionary; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models, the method further comprising selecting between the use of the first model, second model or the use of both models to process speech and outputting text determined from said acoustic models.
17. A speech processing system comprising: a speech input comprising a sequence of observations; a processor adapted to: determine the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set with a first dictionary; determine the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set with a second dictionary; and output text determined from said first and second acoustic models; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models.
18. A speech processing system comprising: a processor being adapted to: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combine said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; and provide a plurality of acoustic models associated with said dictionaries which share said representative distributions.
19. A speech processing system comprising: a processor being adapted to: pooling probability distribution functions relating the probability of an observation to a state, the states being defined by a plurality of dictionaries; combine said probability distribution functions to form representative distributions, where the representative distributions are formed from probability distribution functions associated with said plurality of dictionaries; and provide a plurality of acoustic models associated with said dictionaries which share said representative distributions; the system further comprising a speech input comprising a sequence of observations; and a further processor adapted to: determine the likelihood of a sequence of observations corresponding to a word or part thereof using a first acoustic model set from said plurality of acoustic models; determine the likelihood of a sequence of observations corresponding to a word or part thereof using a second acoustic model set from said plurality of acoustic models; and output text determined from said first and second acoustic models.
20. A speech processing system comprising: a speech input comprising a sequence of observations; a processor adapted to provide a first acoustic model set with a first dictionary; and provide a second acoustic model set with a second dictionary; wherein each model uses a plurality of pre-calculated probability distributions to determine the said likelihood and wherein probability distributions are shared between the models, the processor being further adapted to allow selection between the use of the first model, second model or the use of both models to process speech and outputting text determined from said acoustic models.