WO1993004468A1 - A pattern recognition device using an artificial neural network for context dependent modelling - Google Patents

A pattern recognition device using an artificial neural network for context dependent modelling Download PDF

Info

Publication number
WO1993004468A1
WO1993004468A1 PCT/BE1991/000058 BE9100058W WO9304468A1 WO 1993004468 A1 WO1993004468 A1 WO 1993004468A1 BE 9100058 W BE9100058 W BE 9100058W WO 9304468 A1 WO9304468 A1 WO 9304468A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
sigmoid
recognition device
class
values
Prior art date
Application number
PCT/BE1991/000058
Other languages
French (fr)
Inventor
Hervé BOURLARD
Nelson Morgan
Original Assignee
Lernout & Hauspie Speechproducts
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lernout & Hauspie Speechproducts filed Critical Lernout & Hauspie Speechproducts
Priority to DE69126983T priority Critical patent/DE69126983T2/en
Priority to PCT/BE1991/000058 priority patent/WO1993004468A1/en
Priority to EP91914807A priority patent/EP0553101B1/en
Priority to JP51351991A priority patent/JP3168004B2/en
Publication of WO1993004468A1 publication Critical patent/WO1993004468A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns

Definitions

  • a window of 32 msec (512 points) is for example used as the input to a spectral analysis module, with one analysis performed at regular intervals, for example every 10 msec (160 points).
  • a feature vector is an acoustic vector.
  • HMM Hidden Markov Models
  • the sound has a beginning, middle, and end, each with its own properties. This speech is assumed to remain entirely in one of these "classes" for each frame (e.g., 10 msec), at which time it can proceed to the next permissible class.
  • q k ) for each class q k
  • q k ) for each transition q k ⁇ q 1 .
  • the observed features have a probability (for any hypothesized path through the possible classes) that is the product of the emission probabilities for each class and the corresponding transitions. This is true because of an assumed independence between the local probabilities.
  • x t ) is estimated by a first neural network provided for modeling phonemes where the input field contains the current feature vector x t only and the output limits are associated with the current class q k .
  • a neural network is described in detail in the article of H. Bourlard and C. Wellekens, entitled “Links Between Markov Models and Multilayer perceptions", published in IEEE Transactions on pattern analysis and machine intelligence, Vol. 12, n° 12 December 1990, p. 1167-1178. is estimated by a second neural network (as illustrated
  • * p(q k ) is the a priori probability of a phoneme as also used in the standard hybrid ANN/HMM phonetic approach, and is simply estimated by counting on the training set. No neural network is required for determining this probability.
  • * p(x t ) is a constant value independent of the classes and is therefore not important for classification purposes. No neural network is required for determining this probability.
  • the probability q k , x t ) is determined by the
  • a comparable set-up as the one illustrated in figure 3 is also applied for the third neural network which is provided for determining q k , J
  • a pre-sigmoid value y j (q k ) can be pre-calculated and stored in a

Abstract

A pattern (for example speech) recognition device comprising an artificial neural network set-up having K X M output units and provided for calculating probabilities of observing a feature vector (xt) (12) on a class (qk) (1 « k « K) (11) conditioned on predetermined contextual models (cm) (1 « m « M) (17), each of said classes being represented by at least one model belonging to a finite set of models (M) governed by statistical laws. Said neural network set-up is splitted into a first neural network having K output units and being provided for calculating a posteriori probabilities of said class (qk) given said observed vector (xt), and at least one further neural network having M output units and provided for calculating a posteriori probabilities of said contextual models, conditioned on said class.

Description

"A pattern recognition device using an Artificial Neural Network for context dependent modelling"
The invention relates to a pattern recognition device comprising an artificial neural network set-up having K X M output units and provided for calculating probabilities of observing a feature vector (xt) on a class (qk ) (1≤k≤K) conditioned on predetermined contextual models (cm) (1≤m≤M), said device having an input for receiving a data stream and comprising sampling means provided for sequentially fetching data samples by sampling said data stream and for determining said feature vector (xt) from a data sample, each of said classes being represented by at least one model belonging to a finite set of models (M) governed by statistical laws.
Such a device is known from the article of H. Bourlard and C.J. Wellekens, entitled "Links between Markov Models and Multilayer Perceptrons" and published in IEEE Transactions of pattern analysis and Machine intelligence, vol. 12, N° 12, December 1990 p. 1167-1178. In the known device the patterns to be recognized are human speech patterns. For the recognition of the patterns use is made of hybrid ANN (Artificial Neural Network) / HMM (Hidden Markov Models) speech recognition. Hidden Markov Models have provided a good representation of speech characteristics. Artificial Neural Networks are used to solve difficult problems in speech recognition and algorithms have been developped for calculating emission probabilities. According to HMM the speech is supposed to be produced by a particular finite state automaton build up from a set of classes Q = { q 1 , q2, ...,qκ } governed by statistical laws. In order to recognize the inputted data stream, the inputted speech is sampled and transformed into a sequence of acoustic vectors X = { x1 x2,...xt, ...xT } where x represents the acoustic vector at a time t. In the known device use is made of Multilayer Perceptrons (MLP) which is a particular form of ANN's. The MLP are trained to generate Bayes probabilities or a posteriori probabilities p(qk | xt ) which can be transformed for determining emission probabilities by using Bayes rule.
A drawback of the known device is that for more complex models such context-dependent models, a lot of parameters must be estimated with a same limited amount of data. Indeed, if there are K possible classes and M possible contexts, and if use has to be made of the information of left and right contexts of the considered class there are K
X M X M possible combinations of qk , where and
Figure imgf000004_0001
Figure imgf000004_0002
Figure imgf000004_0003
respectively represent left and right contexts belong to a set C = { c1 , c2, ..., cM } of possible contextual models. Whether likelihoods are generated by MLP's or by standard training methods for HMM's, neither will be a good probabilistic estimation for a phonetic condition that is rarely or never observed. A simply use of the known device for calculating emission probabilities of observing a vector (x t) on a current class (qk ) within predetermined contextual rrodels would lead to an
Figure imgf000004_0004
output layer with thousands of units and millions of parameters to train.
In order to solve this problem interpolations in HMM systems are used.
This solution reflects the trade-off between detailed models which are poorly estimated because not enough training material is available and rough models which are well estimated because of a limited number of parameters. However, this interpolation still causes errors in the recognition of the pattern which renders the device not liable enough.
It is an object of the present invention to mitigate the afore mentioned drawbacks.
A device according to the invention is therefore characterized in that said neural network set-up is splitted into a first neural network having K output units and being provided for calculating a posteriori probabilities of said class (qk ) given said observed vector (xt), and at least one further neural network having M output units and provided for calculating a posteriori probabilities of said contextual models, conditioned on said class. By splitting the network set-up into a first and at least a further neural network where each network is provided for the determination of a particular a posteriori probability as set out herebefore it is no longer necessary to make any assumption or simplification to obtain the emission probability of observing a vector (x t) on a class (qk ) conditioned on predetermined contextual models
Figure imgf000005_0003
In comparison to the known device, where a direct network implementation was used, this solution greatly reduces the number of parameters and thus the memory capacity of the device.
A first preferred embodiment of a device according to the invention is characterized in that said further neural network is provided for determining independently of each other a first Zj(xt) respectively a second Yj(c) pre-sigmoid output value, wherein said first respectively said second pre-sigmoid value being determined on an inputted feature vector respectively on inputted classes, said further neural network comprises a set of upper units provided for determining p(c| qk, xt) values from said pre-sigmoid output values. The pre-sigmoid output values Zj and Yj being independent of each other implies that they can be determined independently from each other, which simplifies even more the neural network set up. The determination of the output value is then easily realised by a set of upper units receiving the pre-sigmoid values, resulting in a more efficient set-up.
Preferably in that said further neural network comprises a first hidden layer provided for determining upon a received feature vector xt values
Zh = f d.. xt)
ln
Figure imgf000005_0001
wherein dih is a weighting factor, f a sigmoid function and 1 h H, H being the total number of hidden units in said first hidden layer, said first hidden layer being connected with summing units provided for determining said first pre-sigmoid value by Zj(xt) = zh
Figure imgf000005_0002
wherein bh j is a weighting factor. An efficient architecture for determining the first pre-sigmoid value is thus obtained. A second preferred embodiment of a device according to the invention is characterized in that said further neural network comprises a memory provided for storing said second pre-sigmoid output value Yj(c), said device further comprises an address wherein the second pre-sigmoid value Yj(c) assigned to said class qk is stored. The independency of Zj and Yj enables a pre-calculation of the contextual contribution to the output. This computation is for exemple realised at the end of the training phase and enables thus a storage of the second pre-sigmoid values for each model. Because those pre-sigmoid values are now stored into a memory it is no longer necessary to calculate them each time, thus saving a lot of computation time. The pre-sigmoid value has thus only to be read from the memory, once it has been stored.
Preferably it comprises a second hidden layer provided for determining upon a received class qk further values ]
Figure imgf000006_0001
wherein wk l are trained weighting factors and f a sigmoid function, said second hidden layer being connected with a further summing unit provided for determining said second pre-sigmoid value a
Figure imgf000006_0002
wherein alj are trained weighting factors, 1 1 L, L being the total number of hidden units in said second hidden layer. An efficient architecture for determining the second pre-sigmoid value is thus obtained.
A third preferred embodiment of a device according to the invention is characterised in that it comprises a memory provided for storing third presigmoid output values Yj (qK, c m ) determined on inputted classes (qk ) and contextual models (cm), said pre-sigmoid values being storable according to a K X M X N matrix, said device further comprises an address generator provided for generating upon a received q, , m set an address wherein the third pre-sigmoid values assigned to said set are stored. Because the pre-sigmoid value Yj (qk , cm ) is also independent of the feature vector a pre-calculation and storage thereof is possible which thus reduces the calculation amount.
Preferably said class and said contextual models together form a triphone said first network being provided
Figure imgf000007_0001
for calculating p (qk | xt), said further networks comprising a second, respectively a third, a fourth and a ifth network provided for calculating p( qk, xt), respectively
Figure imgf000007_0003
qk ,
Figure imgf000007_0004
xt), q
Figure imgf000007_0005
k,
Figure imgf000007_0006
and p( qk ). Triphone recognition is thus easily realised.
Figure imgf000007_0002
The invention will now be described in detail in connection with the drawings in which :
Figure 1 shows a schematic view of a device according to the invention ;
Figure 2 shows a flow diagram illustrating the operation of a device according to the invention ;
Figure 3, respectively 4 schematically illustrates neural networks belonging to a device according to the invention.
Patterns to be recognized can be of various kinds such as for example pictures or speech. The present invention will be described by using speech recognition as an example. This is however only done for the purpose of clarity and it will be clear that the described device can also be used for other pattern recognition than speech.
Speech is build up of phonemes. For example the word "cat" is composed of three phonemes : the "k" sound, the short "a" and the final "t". Speech recognition signifies the determinantion of the sequence of elements at least as large as phonemes in order to determine the linguistic content.
An example of a pattern, in particular speech, recognition device is schematically shown in figure 1. Data, in particular speech, is supplied via a line 1 to sampling means 2. After being sampled by the sampling means the data samples are supplied to a processing unit 3 comprising an Artificial Neural Network set-up in abreviation ANN, provided for determining emission probabilities. Those emission probabilities are then supplied to a further processing device 4 provided for the recognition of the inputted data, for example an inputted sentence in the case of speech.
Automatic speech recognition (ASR) performed by the device illustrated in figure 1 comprises several steps such as illustrated in the flow diagram of figure 2. In a first step 5 the inputted data is collected, for example by means of a microphone in the case of speech. The electrical signal outputted by the microphone is thereafter preprocessed 6, which comprises for example a filter operation in order to flatten the spectral slope using a time constant which is much longer than a speech frame.
After the pre-processing step 6 a feature extraction
7 is performed which comprises the determination of representations of the speech signal which are independent of acoustic variations, but sensitive to linguistic content. Typically, speech analysis is realised over a fixed length "frame", or analysis window. For instance, suppose speech is sampled at 16 kHz after being filtered at 6,4 kHz to prevent spectral
"aliasing". A window of 32 msec (512 points) is for example used as the input to a spectral analysis module, with one analysis performed at regular intervals, for example every 10 msec (160 points). In this way, the speech signal is transformed into a sequence of feature vectors X = { x1 , x2, ..., xt , ...xT } , where xt represents the feature vector at time t. In the case of speech such a feature vector is an acoustic vector.
Once the feature extraction has been realised, a hypothesis generation 8 starts where use is made of the neural networks. The hypothesis generation step comprises i.a. a classification for producing a label for a speech segment, for example a word, or some measure of similarity between a speech frame and a "reference" speech fragment. Alternatively the input can be fitted with statistical models yielding probabilistic measures of the uncertainty of the fit.
After the hypothesis generation step 8 starts a cost estimation step 9 where for determining the minimum cost match use is for example made of the dynamic programming algorithm of Bellman which is for example described in an article of R. Bellman R.S.Dreyfus published in Applied Dynamic Programming, Princeton University Press 1962. The recognition 10 itself is then realised once the cost estimation is achieved. Before starting a detailed description of the invention some general knowledge of speech recognition will be given in order to have a clear definition of the used terminology.
Most state-of-the art speech recognizers are based on Hidden Markov Models (HMM), which is a statistical approach. In this formalism, the speech is supposed to be produced by a particular finite state automation built up from a set of classes Q = { q 1 , q2, ...qκ } governed by statistical laws. In that case, each speech unit (e.g. each vocabulary word or each phoneme) is associated with a particular HMM made up of L classes q1∈Q, with 1 = 1, ...,L, according to a predefined topology. In the HMM approach, one has to estimate the probability of an observed spectrum for each hypothetical speech sound, as well as the probability of each permissible transition. Negative log of these probabilities can be used as distances in the Dynamic Programming algorithm [Bellman & Dreyfus, 1962] to determine the minimum cost path
(defined as the match with the minimum sum of local distances plus any cost for permitted transitions). This path represents the best warping of the models to match the data.
In a model for a speech sound (phoneme), the sound has a beginning, middle, and end, each with its own properties. This speech is assumed to remain entirely in one of these "classes" for each frame (e.g., 10 msec), at which time it can proceed to the next permissible class.
Associated with each transition is a probability p(xt, q1 | qk ) of emitting a speech feature vector x when moving from a present class q. to a new class q1. A distinction is made between an emission probability p(x t | qk ) (for each class qk ) and a transition probability p(q1 | qk ) (for each transition qk → q1). For any particular utterance, the observed features have a probability (for any hypothesized path through the possible classes) that is the product of the emission probabilities for each class and the corresponding transitions. This is true because of an assumed independence between the local probabilities. For instance, suppose a path of q1→ q 1→ q 2, and input features x1 , x2, x3. The probability of the assumed path would then be p(x1 | q1).p(q1 | q1).p(x2 | q1).P(q2 | q1).P(x 3 l q2).
Taking negative logs to get costs, addition takes the place of multiplication, and dynamic programming can be used to determine the least-cost path.
For continuous speech recognition, phonemic HMMs can be concatenated to represent words, which in turn can be concatenated to represent complete utterances. Model variations can also be introduced to represent common effects of coarticulation between neighboring phonemes or words. In particular, context-dependent phone models such as for example triphones can be used to represent allophonic variations caused by coarticulation from neighboring phones. In this case, sequences of three phonemes are considered to capture the coarticulation effects. Each phoneme has several models associated with it depending on its left and right phonemic contexts. Of course, the drawback of this approach is the drastically increased number of models and, consequently, the number of parameters to determine. If there are M phonemes and K possible classes, there is a maximum of K X M X M possible phonemic contexts for each class ; even if all of them are not permissible (because of phonological rules or the clustering of similar contexts), the number of possible triphone models remain very large.
The article entitled "Continuous speech recognition using Multilayer Perceptrons (MLP) with hidden Markov Models" written by the present inventors and published in IEEE 90 CH 2847-2 p. 413-416 describes how MLP, a particular form of ANN'S are used to compute the emission probabilities used in HMM systems. In these studies it is shown that if each output unit of an MLP is associated with a particular class q, of the set of classes Q = { q 1 , q2, ...qk } on which the Markov models are defined, it is possible to train the MLP to generate probabilities like p(qk | xt) when x is provided to its input. Probabilities like p(qk 1 xt) are generally referred to as Bayes probabilities or a posteriori probabilities and can be transformed into likelihoods for use as emission probabilities in HMMs by Bayes' rule :
Figure imgf000011_0004
As shown in the referred article, the advantage of such an approach is the possibility to estimate the emission probabilities needed for the HMMs with better discriminant properties and without any hypotheses about the statistical distribution of the data. Since the result holds with a modified input field to the MLP which takes the context or other information into account, it is also shown how this approach allows to overcome other major drawbacks of HMMs.
As described above, MLPs are provided for estimating emision probabilities for HMMs. There has also been shown that these estimates have led to improved performance over counting estimation or Gaussian estimation techniques in the case where a fairly simple HMM was used. However, current state-of-the-art continuous speech recognizers require HMMs with greater complexity, e.g. multiple densities per phone and/or context-dependent models. State-of-the-art HMM-based speech recognizers now model context-dependent phonetic units such as triphones instead of phonemes to improve their performance. For instance, returning to the example already given, the English word "cat" is composed of three phonemes : the "k" sound, the short "a", and the final "t". In the standard phonetic approach, the Markov model of word "cat" is then obtained by concatenating the models of its constituting phonemes, i.e. "k-a-t". In the triphone approach, the model of a phoneme depends on its left and right phonetic contexts, and the sequence of models constituting the isolated word "cat" is now
"#ka-kat-at#", where "#" is the "nothing" or "silence" symbol. In this example, "#ka" represents the model of phoneme "k" with a phoneme "#" on its left and a phoneme "a" on its right. This approach takes phonetic coarticulation into account. In this case, emission probabilities p(xt | qk ) that have to be estimated for use in HMMs (or hybrid ANN/HMMs) are replaced by p(xt | qk i.e. the probability of observing acoustic
Figure imgf000011_0001
feature vector xt on the current phonemic class qk with phonemic contexts on its left and on its right. The contextual models
Figure imgf000011_0002
Figure imgf000011_0003
and belonging to a set C = { cl , ...,cm , ...CM } . Each class qk being represented by at least one model. The models of said set C being governed by statistical la ws.
However, the difficulty with these more complex models is that many more parameters must be estimated, with the same limited amount of data. Indeed, if there are K possible classes and M possible phonemic contextual models, we have K X M X M possible combinations of With neural networks as well, this is a
Figure imgf000012_0001
significant problem. Whether likelihoods are generated by MLPs having K X M X M output units or by standard training methods for HMMs, neither will be good probabilistic estimate for a phonetic condition that it is rarely or never observed. Furthermore, a simple application of the known techniques to tπphones, for instance, would result in an output layer with thousands of units, and many millions of connections (i.e. parameters) to train. This is rather cumbersome for the present data sets, which have on the order of 100,000 to 1,000,000 training tokens. In HMM systems, these problems have been handled by interpolating between levels of context-dependence, i.e. phones, biphones and tnphones, depending on the frequency of occurrence of each level. In this case, p(xt | qk , is expressed in terms of estimates of p(xt | qk , p(xt | qk, p(xt | qk) and p(xt | qk). In fact, this solut
Figure imgf000012_0002
ion
Figure imgf000012_0004
Figure imgf000012_0003
reflects the trade-off between good (i.e. detailed) models that are poorly estimated because of not enough training material and rough models that are very well estimated thanks to their limited number of parameters. As in the earlier discussion, each of these probabilities is typically estimated using restrictive assumptions, such as the form of distribution or statistical independence between multiple features.
The main problem in this contextual modelling lies in the estimation of emission probabilities like :
p(χt qk, (1)
In o
Figure imgf000012_0005
rder to estimate those probabilities, use is made of an artificial neural network set-up having K X M X M output units. Based on statistical mathematical rules the following relations are given :
P qk, m xt) P qk, cr m, xt) . | qk, xt) . p(qk, xt) (2)
Figure imgf000012_0006
Figure imgf000012_0007
and [ R
Figure imgf000013_0001
Application of Bayes' law on the emission probability (1) now gives p
Figure imgf000013_0002
Substitution of (2) and (3) in (4) now gives :
Figure imgf000013_0003
As will be described hereunder, this transformation, based on a well defined mathematical transformation of the emission probabilities to be calculated, will enable a precise calculation of the latter without making any assumption. The gist of the present invention is to have made a precise choice among the different mathematical possibilities to transform the emission probability (1) expression to be calculated. This choice enables a substantial simplification of the neural networks to be used for calculating the latter emission probability.
As can be deduced from expression (5) the neural network having K X M X M output units can now be splitted into networks having K + M + M or K + M output units. The probability can now be determined without any particular simplifying assumption. Based on theory of hybrid ANN/HMM for phoneme models, as briefly discussed herebefore, i.e. in classification mode where the output values of the ANN are estimated of the a posteriori probabilities of the output classes conditioned on the input, all probabilities present in expression (5) can be estimated by a respective neural network.
*p(qk | xt ) is estimated by a first neural network provided for modeling phonemes where the input field contains the current feature vector xt only and the output limits are associated with the current class qk. Such a neural network is described in detail in the article of H. Bourlard and C. Wellekens, entitled "Links Between Markov Models and Multilayer perceptions", published in IEEE Transactions on pattern analysis and machine intelligence, Vol. 12, n° 12 December 1990, p. 1167-1178. is estimated by a second neural network (as illustrated
Figure imgf000014_0001
in figure 3 and in which the output units (117) are associated with the right phonemes
Figure imgf000014_0002
of the triphone and in which the input field is constituted by the element xt i (s≤ i≤ I) of the current acoustic vector xt and the current class qk associated with xt . is estimated by a third neural network as illustrated
Figure imgf000014_0003
in figure 4 and in which the output units are associated with the left phonemes of the tripones and in which the input field is constituted by the current acoustic vector x t, the current class qk and the right phonetic contexts in the triphones.
Figure imgf000014_0005
is estimated by a fourth neural network in which the
Figure imgf000014_0004
output units are associated with the left phonemes of the triphones
Figure imgf000014_0009
and where the input field represents the current class qk and the right phonemes This provides the a priori probability of observing
Figure imgf000014_0006
a particular phoneme in the left part of a triphone given particular current class and right phonetic context. is estimated by a fifth neural network in which the output
Figure imgf000014_0007
units are associated with the right phonemes of the triphones
Figure imgf000014_0008
and where the input field represents the current class qk . This provides the a priori probability of observing a particular phoneme on the right side of a particular class. Given the limited number of parameters in this model (i.e. K X M), this probability can also be estimated by counting (i.e., this does not require a neural network).
* p(qk ) is the a priori probability of a phoneme as also used in the standard hybrid ANN/HMM phonetic approach, and is simply estimated by counting on the training set. No neural network is required for determining this probability. * p(xt) is a constant value independent of the classes and is therefore not important for classification purposes. No neural network is required for determining this probability.
As set out herebefore the calculation of the emission probability is thus done by the first neural network and by further neural networks which each are provided for calculating a posteriori probabilities on each of said contextual models, i.e. on
Figure imgf000015_0002
and conditioned on the current class qk . For limited training sets, thes
Figure imgf000015_0001
e estimates may still need to be smoothed with monophone models, as is done in conventional HMM systems. Additionally, if cl and cr are represent broad phonetic classes or clusters rather than phonemes, the above results apply to the estimation of "generalized triphones". Finally, when only left and right context is used, this technique is valid for only 2 networks, the monophone network and one of that estimates p(c | qk, xt ).
The input field containing the acoustic data (e.g., xt) may also be supplied with contextual information. In this case, the xt probabilities have to be replaced by This leads then to the
Figure imgf000015_0003
estimation of triphone probabilities given acoustic contextual information, which is even more important in the case of triphone models.
As set out herebefore the emission probabilities for triphone models can now be calculated without making any assumption. However the amount of calculation to be performed by each neural network remains rather large. For example in the case of the second neural network (figure 3) a K X M times computation is required. If enough performant neural networks would be available this wouldn't be a major problem.
The amount of calculation to be done can however be reduced by making a simple restriction on the network topology. As is shown in figure 3 the networks comprises two separate sections which only at an end layer are joined. Calculations applied on the inputted feature vector xt are at lower layers separated from the one applied on the classes qk . This restriction is possible since the classes have a binary value and belong to a finite set of states. This restriction permits the pre-calculation of contextual contributions to the outputs. This computation is done at the end of the training phase, prior to the recognition of any pattern.
Considering the second neural network shown in figure 3 and provided for determining q
Figure imgf000016_0001
k, xt), the feature vectors x t inputted on unit 12 are supplied to the hidden units of layer 14. Each hidden unit h ( 1≤h≤H) provides a weighted sum value wherein f is a standard sigmoid function
Figure imgf000016_0002
and dih is a weighting factor.
The weighted sum values zh are then supplied to j summing units Zj which are provided for determining the first pre-sigmoid value zj M(xt) =∑ bhj zh
Figure imgf000016_0003
bh j being a weighting factor.
A comparable set-up is realized for the supplied status qk (1≤k≤K) to input 11. A hidden layer 13 is provided for determining a further weighted sum value.
Figure imgf000016_0004
wherein wk l are trained weighting factors. The latter weighted sum value yl is supplied to L summing units Yj (15) which are provided for determining the pre-sigmoid value
Figure imgf000016_0005
wherein al j are also trained weighting factors.
The probability qk, xt) is determined by the
Figure imgf000016_0006
upper layer 17 provided for calculating f(Yj + Zj).
By partitioning the net so that no hidden unit receives input from both the context (c) and the input feature vector
(xt), a simplification is obtained. Further since for each of the contextual models c the pre-sigmoid value Yj is independent of the inputted feature vector, the pre-sigmoid value can be pre-calculated for all possible contextual models. Those pre-sigmoid values Yj are then stored into a memory so that it is no longer necessary to calculate that pre-sigmoid value for each probability p qk , xt ) to
Figure imgf000017_0001
be determined. In order to provide a suitable adressing for the predetermined value Yj(qk) a 2-dimensional matrix set-up is chosen wherein the K values associated with each possible model are
Figure imgf000017_0002
stored. A simple addressing by means of the inputted qk and
Figure imgf000017_0006
values will provide the corresponding Yj values which will then be used for calculating f(Yj + Zj). The major new computation (in comparison with the monophone case) then is simply the cost of some lookups both for the contextual contribution and for the final sigmoidal nonlinearity, which must now be re-computed for each hypothesized triphone.
The set-up as described hereabove gives the maximum pre-calculation possibilities together with the storage of the pre-calculated values. It will however be clear that alternative embodiments with less precalculation are also possible. So it would be possible to pre-calculate only the yl values and store them into a memory addressable by the inputted ck values.
A comparable set-up as the one illustrated in figure 3 is also applied for the third neural network which is provided for determining qk, J For each set of inputted qk, a
Figure imgf000017_0003
Figure imgf000017_0008
pre-sigmoid value yj(qk, can be pre-calculated and stored in a
Figure imgf000017_0007
memory. The calculation for the feature vector x t in order to obtain the pre-sigmoid value
Zj(xt) = bhj zn is analogous as described with respect to figure 3.
Figure imgf000017_0004
Since there is now an input to hidden layer 22 as well from unit 20 to which the qk values are presented as from unit 21 to which the values are presented the calculation of the
Figure imgf000017_0005
Pre-sig moid value Yj (qk, will be described in detail. Hidden layer
Figure imgf000018_0007
22 is provided for determining the values \
Figure imgf000018_0001
where f is again a standard sigmoid function and sk l and r nl are trainedweighting factors. The pre-sigmoid value is then determined by the adders 23
Figure imgf000018_0002
As can be seen from the expression (6) and (7) the pre-sigmoid value Yj (qk , is dependent on both input values qk and which thus provid
Figure imgf000018_0003
es K X M values for Yj. In order to
Figure imgf000018_0008
provide a suitable addressing for the pre-deter mined Yj(qk , stored
Figure imgf000018_0004
into a memory, a 3-dimensional matrix set-up is chosen wherein the K
X M X M values associated with each possible (1≤j≤M) are stored according to a K X M X M matrix. Given
Figure imgf000018_0005
a particular this
Figure imgf000018_0009
provides a matrix wherein a kth row and nth column of the matrix there is then stored the pre-sigmoid value Yj(qk , The stored values are thus easily addressed by the inputted qk and value together
Figure imgf000018_0006
forming an address indicating the matrix position wherein the pre-sigmoid value is stored.
The set-up described herebefore is not only applicable in the case of triphones but also in a more general context of calculating probabilities of observing a feature vector (xt) on a class q conditioned on predetermined contextual models c. For the estimation of the probability of observing a current class q with a particular neighboring contextual model c the expression
p(q,c | xt) = p(q| xt) . p(c| q, xt).
The probability is thus decomposed into the product of a posteriori probabilities. This reduces the training of a single network with K X M outputs to the training of two networks with K and M outputs respectively, thus providing a potentially huge saving in time and in parameters. By assuming that no hidden units are shared between inputs of q and x t, the contribution to the output vector (pre-sigmoid) originating from q can be pre-computed for all values of q and c.

Claims

1. A pattern recognition device comprising an artificial neural network set-up having K X M output units and provided for calculating probabilities of observing a feature vector (xt) on a class (qk ) (1≤5 k≤K) conditioned on predetermined contextual models (cm)
(1≤m≤M), said device having an input for receiving a data stream and comprising sampling means provided for sequentially fetching data samples by sampling said data stream and for determining said feature vector (xt) from a data sample, each of said classes being represented by at least one model belonging to a finite set of models (M) governed by statistical laws, characterized in that said neural network set-up is splitted into a first neural network having K output units and being provided for calculating a posteriori probabilities of said class (qk ) given said observed vector (xt), and at least one further neural network having M output units and provided for calculating a posteriori probabilities of said contextual models, conditioned on said class.
2. A pattern recognition device as claimed in claim
1, characterized in that said further neural network is provided for determining independently of each other a first Zj(xt) respectively a second Yj(c) pre-sigmoid output value, wherein said first respectively said second pre-sigmoid value being determined on an inputted feature vector respectively on inputted classes, said further neural network comprises a set of upper units provided for determining p(c| qk,xt) values from said pre-sigmoid output values.
3. A pattern recognition device as claimed in claim
2, characterized in that said further neural network comprises a first hidden layer provided for determining upon a received feature vector xt values
Figure imgf000020_0001
wherein dih hs a weighting factor, f a sigmoid function and 1≤h≤H, H being the total number of hidden units in said first hidden layer, said first hidden layer being connected with summing units provided for determining said first pre-sigmoid value by
Figure imgf000021_0001
wherein bh j is a weighting factor.
4. A pattern recognition device as claimed in claim 2 or 3, characterized in that said further neural network comprises a memory provided for storing said second pre-sigmoid output value Yj(c), said device further comprises an address generator provided for generating upon a received class qk an address wherein the second pre-sigmoid value Yj(c) assigned to said class qk is stored.
5. A pattern recognition device as claimed in claims 2 or 3, characterized in that it comprises a second hidden layer provided for determining upon a received class qk further values
Figure imgf000021_0002
wherein wk l are trained weighting factors and f a sigmoid function, said second hidden layer being connected with a further summing unit provided for determining said second pre-sigmoid value
Figure imgf000021_0003
wherein al j are trained weighting factors, 1≤1≤L, L being the total number of hidden units in said second hidden layer.
6. A pattern recognition device as claimed in claim 2 or 3, characterized in that it comprises a memory provided for storing third presigmoid output values Yj(qk ,cm) determined on inputted classes (qk ) and contextual models (c m), said pre-sigmoid values being storable according to a K X M X N matrix, said device further comprises an address generator provided for generating upon a received qk , mn set an address wherein the third pre-sigmoid values assigned to said set are stored.
7. A pattern recognition device, in particular a speech recognition device, as claimed in anyone of the claims 1-6, characterized in that said class and said contextual models together form a triphone said first network being provided for calculating
P (qk | x t)
Figure imgf000022_0002
, said further networks comprising a second, respectively a third, a fourth and a fifth network provided for calculating
xt), respectively and
Figure imgf000022_0004
Figure imgf000022_0003
8. A pattern recognition device as claimed in claim 7, characterized in that said network is provided for outputting V V n
Figure imgf000022_0001
9. A memory destinated to be used into a pattern recognition device as claimed in 4 or 6, characterized in that said pre-sigmoid values are stored into said memory.
PCT/BE1991/000058 1991-08-19 1991-08-19 A pattern recognition device using an artificial neural network for context dependent modelling WO1993004468A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
DE69126983T DE69126983T2 (en) 1991-08-19 1991-08-19 DEVICE FOR PATTERN RECOGNITION WITH AN ARTIFICIAL NEURAL NETWORK FOR CONTEXT-RELATED MODELING
PCT/BE1991/000058 WO1993004468A1 (en) 1991-08-19 1991-08-19 A pattern recognition device using an artificial neural network for context dependent modelling
EP91914807A EP0553101B1 (en) 1991-08-19 1991-08-19 A pattern recognition device using an artificial neural network for context dependent modelling
JP51351991A JP3168004B2 (en) 1991-08-19 1991-08-19 Pattern recognition device using artificial neural network for context-dependent modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/BE1991/000058 WO1993004468A1 (en) 1991-08-19 1991-08-19 A pattern recognition device using an artificial neural network for context dependent modelling

Publications (1)

Publication Number Publication Date
WO1993004468A1 true WO1993004468A1 (en) 1993-03-04

Family

ID=3885294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/BE1991/000058 WO1993004468A1 (en) 1991-08-19 1991-08-19 A pattern recognition device using an artificial neural network for context dependent modelling

Country Status (4)

Country Link
EP (1) EP0553101B1 (en)
JP (1) JP3168004B2 (en)
DE (1) DE69126983T2 (en)
WO (1) WO1993004468A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU667405B2 (en) * 1992-10-30 1996-03-21 Alcatel N.V. A method of word string segmentation in the training phase of a connected word recognizer
US9367490B2 (en) 2014-06-13 2016-06-14 Microsoft Technology Licensing, Llc Reversible connector for accessory devices
US9384334B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content discovery in managed wireless distribution networks
US9384335B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content delivery prioritization in managed wireless distribution networks
US9430667B2 (en) 2014-05-12 2016-08-30 Microsoft Technology Licensing, Llc Managed wireless distribution network
US9614724B2 (en) 2014-04-21 2017-04-04 Microsoft Technology Licensing, Llc Session-based device configuration
US9874914B2 (en) 2014-05-19 2018-01-23 Microsoft Technology Licensing, Llc Power management contracts for accessory devices
US10111099B2 (en) 2014-05-12 2018-10-23 Microsoft Technology Licensing, Llc Distributing content in managed wireless distribution networks
US10691445B2 (en) 2014-06-03 2020-06-23 Microsoft Technology Licensing, Llc Isolating a portion of an online computing service for testing

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9728184B2 (en) 2013-06-18 2017-08-08 Microsoft Technology Licensing, Llc Restructuring deep neural network acoustic models
US9589565B2 (en) 2013-06-21 2017-03-07 Microsoft Technology Licensing, Llc Environmentally aware dialog policies and response generation
US9311298B2 (en) 2013-06-21 2016-04-12 Microsoft Technology Licensing, Llc Building conversational understanding systems using a toolset
US9324321B2 (en) 2014-03-07 2016-04-26 Microsoft Technology Licensing, Llc Low-footprint adaptation and personalization for a deep neural network
US9529794B2 (en) 2014-03-27 2016-12-27 Microsoft Technology Licensing, Llc Flexible schema for language model customization
US9520127B2 (en) 2014-04-29 2016-12-13 Microsoft Technology Licensing, Llc Shared hidden layer combination for speech recognition systems

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
COMPUTER SPEECH AND LANGUAGE. vol. 3, no. 1, January 1990, LONDON GB pages 1 - 19; BORLARD, WELLEKENS: 'Speech pattern discrimination and multilayer perceptrons' *
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE vol. 12, no. 2, December 1990, NEW YORK US pages 1167 - 1178; BOURLARD, WELLEKENS: 'Links between Markov models and multilayer perceptrons' cited in the application *
INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING vol. 1, 14 May 1991, TORONTO CANADA pages 109 - 112; HOCHBERG ET AL: 'Hidden Markov Model/Neural Network training techniques for connected alphadigit speech recognition' *
INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING vol. 1, 3 April 1990, ALBUQUERQUE NEW MEXICO USA pages 413 - 416; MORGAN, BOURLARD: 'Continuous speech recognition using multilayer perceptrons with hidden Markov models' cited in the application *
INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING vol. 1, 3 April 1990, ALBUQUERQUE NEW MEXICO USA pages 417 - 420; NILES, SILVERMAN: 'Combining Hidden Markov Models and Neural Network classifiers' *
INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING vol. 1, 3 April 1990, ALBUQUERQUE NEW MEXICO USA pages 421 - 423; MA, VAN COMPERNOLLE: 'TDNN Labelling for a HMM Recognizer' *
SPEECH TECHNOLOGY, vol. 5, no. 3, February 1991, NEW YORK US pages 102 - 107; LEVIN: 'Connected word recognition using hidden control neural architecture' *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU667405B2 (en) * 1992-10-30 1996-03-21 Alcatel N.V. A method of word string segmentation in the training phase of a connected word recognizer
US9614724B2 (en) 2014-04-21 2017-04-04 Microsoft Technology Licensing, Llc Session-based device configuration
US9384334B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content discovery in managed wireless distribution networks
US9384335B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content delivery prioritization in managed wireless distribution networks
US9430667B2 (en) 2014-05-12 2016-08-30 Microsoft Technology Licensing, Llc Managed wireless distribution network
US10111099B2 (en) 2014-05-12 2018-10-23 Microsoft Technology Licensing, Llc Distributing content in managed wireless distribution networks
US9874914B2 (en) 2014-05-19 2018-01-23 Microsoft Technology Licensing, Llc Power management contracts for accessory devices
US10691445B2 (en) 2014-06-03 2020-06-23 Microsoft Technology Licensing, Llc Isolating a portion of an online computing service for testing
US9367490B2 (en) 2014-06-13 2016-06-14 Microsoft Technology Licensing, Llc Reversible connector for accessory devices
US9477625B2 (en) 2014-06-13 2016-10-25 Microsoft Technology Licensing, Llc Reversible connector for accessory devices

Also Published As

Publication number Publication date
EP0553101B1 (en) 1997-07-23
EP0553101A1 (en) 1993-08-04
JP3168004B2 (en) 2001-05-21
DE69126983D1 (en) 1997-09-04
DE69126983T2 (en) 1998-03-05
JPH06502927A (en) 1994-03-31

Similar Documents

Publication Publication Date Title
Prabhavalkar et al. A Comparison of sequence-to-sequence models for speech recognition.
Ostendorf et al. From HMM's to segment models: A unified view of stochastic modeling for speech recognition
Morgan et al. Continuous speech recognition using multilayer perceptrons with hidden Markov models
Neto et al. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system
EP0553101B1 (en) A pattern recognition device using an artificial neural network for context dependent modelling
US5839105A (en) Speaker-independent model generation apparatus and speech recognition apparatus each equipped with means for splitting state having maximum increase in likelihood
Bourlard et al. CDNN: A context dependent neural network for continuous speech recognition
Bengio A connectionist approach to speech recognition
US5129001A (en) Method and apparatus for modeling words with multi-arc markov models
US5924066A (en) System and method for classifying a speech signal
EP0762383B1 (en) Pattern adapting apparatus for speech or pattern recognition
Konig et al. GDNN: a gender-dependent neural network for continuous speech recognition
Mohamed et al. HMM/ANN hybrid model for continuous Malayalam speech recognition
Frankel et al. Speech recognition using linear dynamic models
EP0725383B1 (en) Pattern adaptation system using tree scheme
JP3589044B2 (en) Speaker adaptation device
JPH064097A (en) Speaker recognizing method
CN114333768A (en) Voice detection method, device, equipment and storage medium
Jang et al. A new parameter smoothing method in the hybrid TDNN/HMM architecture for speech recognition
Verhasselt et al. Context modeling in hybrid segment-based/neural network recognition systems
Juang et al. Dynamic programming prediction errors of recurrent neural fuzzy networks for speech recognition
Ma et al. TDNN fuzzy vector quantizer for a hybrid TDNN/HMM system
Schussler et al. A fast algorithm for unsupervised incremental speaker adaptation
Schuster et al. Neural networks for speech processing
Rigoll Information theory-based supervised learning methods for self-organizing maps in combination with hidden Markov modeling

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1991914807

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1991914807

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: CA

WWG Wipo information: grant in national office

Ref document number: 1991914807

Country of ref document: EP