CA1236578A

CA1236578A - Feneme-based markov models for words

Info

Publication number: CA1236578A
Application number: CA000496161A
Authority: CA
Inventors: Lalit R. Bahl; Peter V. Desouza; Robert L. Mercer; Michael A. Picheny
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1985-02-01
Filing date: 1985-11-26
Publication date: 1988-05-10

Abstract

FENEME-BASED MARKOV MODELS FOR WORDS

Abstract In a speech recognition system, apparatus and method for modelling words with label-based Markov models is disclosed. The modelling includes:
entering a first speech input, corresponding to words in a vocabulary, into an acoustic processor which converts each spoken word into a sequence of standard labels, where each standard label corresponds to a sound type assignable to an interval of time; representing each standard label as a probabilistic model which has a plurality of states, at least one transition from a state to a state, and at least one settable output probability at some transitions; entering selected acoustic inputs into an acoustic processor which converts the selected acoustic inputs into personalized labels, each personalized label corresponding to a sound type assigned to an interval of time; and setting each output probability as the probability or the standard label represented by a given model producing a particular personalized label at a given transition in the given model. The present invention addresses the problem of generating models of words simply and automatically in a speech recognition system.

Description

v~lg83-136 ~3 Ei~7~3 FFNE~E-BASED M~P~OV ~ODELS FOR I~ORDS

FIEL~ OF INVENTION

The present invention is related to speech reco~nition, and more particularly to speech recognition svstems using statistical Markov models for the words of a given vocabulary.

BACKGROUND OF T~ INVENTION

.
In current speech recognition systems, there are tt~o commonly used techniques for acoustic modeling of words. The first technique uses word templates, and the matching process for word recognition is based on Dynamic Pro~ramming (DP) procedures. Samples for this technique are given in an article by ~. Itakura, "Minimum Predic~ion Residual Pr~nciple Applied to Speech Recognition,"
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, 1975, pp. 67-72, and in IJ. S. Patent 4,181,821 to F. C. Pirz and L. R.
R~biner entitled "Multiple Template Speech Recog-nition Syste~."

The other technique uses phone-based Markov models which are suited for probabilistic training and decoding alg~rithms. A description o~ this technique and related procedures is given in an YO983-13~
~;~36$~3 article by F. Jelinek, "Continuous Speech Recognition by Statistical Methods," Proceedings of the IEEE, Vol. 64, 1976, pp. 532-556.

Three aspect~s of these models are of particular inter~st:

~1) Word Specificity - wor~ templates are better for recognition because they are constructed from an actual sample of the word. Phonetics based models are derived from man-made phonetic baseforms and represent an idealized versi~n of the word which actually may not occllr;

(2) $rainab~ility - Markov models are superior to ~emplates bec~use thev can be txained, e.g. by the Forward-Backward algorithm (described in the Jelinek article). Word templates use distance ~easures such as the Itakura distance (described in the Itakura article), spectral distance, etc., which are not trained. One e~ception is a method used by Bakis which allows training of ~ord templates (R. Bakis, "Continuous ~peech Recogni-tion Via Centisecond Acoustic States," IBM
Research Report RC 5971, April 19'6).

(3) Computational Speed - Markov models which use discrete acoustic processor Otltput alphabets are substantially faster in computational speed than vogg3-136 ~65~

Dvnamic Programming matching (as used b~ Itakura) or continuous parameter word templates (as used by Bakis).

O~JECTS OF THE INVENTION

It is an object of the present invention to devise a method of acoustic modeling which has word specificity as with word templates but also offers the trainability that is available in discrete alphabet Markov models.

It is a further object to provide acoustic word models for speech recognition which are uncomplicated but allow high speed operation during the recognition processes.

DISCLOSURE OF TJ~E INVE~TIOI~T

According to the invention, for the generation of a word model, first an acoustic signal representing the word is converted to a string of standard labels from a discrete alphabet. Each label represents a respective time interval of the word. Each standard label is then replaced bv a probabilistic (e.g. Markov) model to form a baseform model comprised of a number of successive mod~ls -- one model for each standard label--without the probabilities ~et entered. Such "~983-136 3~iS78 baseform models are then trained by sample utter-ances to generate the statistics or probahllities, to be applied to the m~dels. Thereafter, the models are used for actual speech r~co~nition.

Advantages of this method are that the generated word models are ~uch more detailed than phone-based models while also heing trainable;
that the number.of parameters depends on the si~e of the standard label alphabet and not on vocabu-lary size; and that probabilistic matching withthese label-based models is computati.onallv much fast~r than DP matching with word templates.

These advantages are notable wit~ reference to a technique set forth in an IBM Research Report by P.aimo Bakis entitled "Continuous Speech Recog-nition via Centisecond Acoustic States," dated April 5, 1976. In the Bakis article, a word is defined as a sequence of states with each state being considered different. Hence, if each word typically.extended for sixty states and if the ~ocabulary were 5000 words, the technique dis-closed bv ~akis would have to consider 300,000 different states. In accordance with the present invention, each state is identir.ie~ as corresponding to one of ~n the order o~ 200 labels. The invention requires ~e~ory for the 200 labels which make up the words which can be stored ~'0983-1.6 simplv as a sequence of numbers (e2ch number representing a label) rather than 'or 300,000 states. In addition, less training data is required for the label-based model approach of the present invention. In the Bakis article approach, each word must be spoken to train fo~ each speaker; with the invention, the speaker need utter onlv enough words to set values associated with 200 standard label models. It is also to be noted that the techrique set forth in the Bakis article would treat two words -- such as "bov" and "hoys" -- independently. According to the invention, much of the training relating to the standard labels for the word "boy" would also be applied for the word "bovs".

The advantages of the invention will become even more a~parent by the following presentation o an embodiment which is described with reference to drawings.

BPI~F DESCRIPTION OF THE DP~I~IINGS

FIG. l is a schematic reprecentation of a label string obtained for a word from an acous.ic ~rocessor.

FIG. 2 shows an elementary Markov model for one label.

''~9~3~

~3~i7~3 FI~. 3 is ?.~ baseform for a ~ord, generated b~
re~laci.r.g e?~ch stand?.rd label of the string sho~m ln FIG. 1 by an elementArv Markov model.

FI~.. 4 is ?.~ block dia~r~m of the model gener?.tion 5- and recognition procedure according to the invention.

FI~. 5 is a block representation of the pro~ess for initial generation of a standard lahel ?.~ h?het .

FIC.. 6 is ~ block reDresent~tion of the operation of the acoustic processor for ~eriving a personalized label.string for ?. spoken wor~.

~ETAILED DESCP.IPTION
PRINCIPLr~ OF INVENTION
LABELING OF SPEECH INPUT SIGNAL

~ preliminar~ function ror speech reco~nition and model generation in this svstem ls the conversion of the speech input signal into ?. coded representation. This is done in a nrocedure tr.z~
was described for e~ample, in "Continuous Speech ~ecognition ~ilth Automatically Selec'ed Acoustic Prototypes Obtained bv Either Bootstrapping or Clustering" b,v A. Nadas et al, Proceedings ICASSP

Yose3-l3 ~ ~~

i9~1, pp. 1153-1155. For this conversion procedure, fixed-length centisecond intervals of the acoustic input signal are subjected to a spectral analysis, and the resulting information is used to assign to the respective interval a "label", or "feneme", from a finite set (alphabet) of labels each representing a sound tvpe, or more specifically, a spectral pattern for a characteristic 10-millisecond speech interval.
lG The initial selection of the characteristic spectral patterns, i.e., the generation of the label set, is also described in the above mentioned article In accordance with the invention, there are different sets of labels. First, there is a finite alphabet of "standard" labels. The standard labels are generated when a first speaker utters into a conventional acoustic processor --the processor applying conventional methods to perform clustering and labelling therefrom. The standard labels correspond to the first speaker's speech. When the first speaker utters each word in the vocabulary with the standard labels established, the acoustic processor converts each word into a sequence of standard labels. (See for example, TABLE 2 and FIG. 1.) The sequence of standard labels for each word is entered into storage. Second, there are sets of "personalized"

'fO983-1-6 ~36~i78 labels. These "personalized" labels are generated by a subsequent speaker (the first speaker again or another) who provides speech input to the acoustic processor after the standard labels and seauerces of standard labels are set.

The alphabet OL stardard labels and each set or personalized labels each preferably includes 200 labels althoush this number may dif er. The standard labels and personalized labels are interrelated by means of probabilistic moels associated with each standard label.
Specifically, each standard label is represented by a model having ~a) a plurality of states and transitions extending from a state to a state, (b) a probability for each transition in a model, and (c) a plurality of label output probabilit es, each output ?robability at a given transi-ion corresponding to the likelihood o ~he standard label model producing a particular personalized label at the given t~ansition based on acoustic inputs from a subsecuent "training" speaker.

~ 8,-136 36~

The t~ansition probabilit-.es and label output probabillties are cet during a training period during which !;nown uttera~nces are made bv the training speaker. Techniques for training ~1arkov models a~e known and are briefly discussed herein-after.

An important feature of this labelina technique is that it can be done ~utomaticallv on the basis of the acoustic s~gnal and thus needs no phonetic interpretation.

Some more det.ails of the labeling technique t~ill be presented in part ln the "DETAILS OF THE
EMBODIMENT" of thls description with re~erence to FI~,S. 5 and 6.

GENERA~I~I OF FENE~E~A.SED l~70RD ~IODELS

The present inven'ion suggests a novel merhod of generating models of words which is simple a~d automatic and which leads to a more e~act representatlor than that using phone-based mocels.
2Q For generatir.g .he model of a ~iord, the word is first spo~en once and a st~ing of .s'andard labels is obtained by the acoustic processor. (See FIG.
1 $or the princi?le and Table 2 $or samples).
Then each of the standard labels is replaced bv an elementarv Markov model ~FIG. 2) ~Jhich represer.ts ~:36578 1 an initial and a final state and some possible tran-sitions between states. The result of concatenating the Markov models for the labels is the model for ~he whole word.

This model can then be trained and used for speech recognition as is known for other Markov models from the literature, e.y., the above mentioned Jelinek article. The statistical model of each standard label is kept in storage as are, preferably, the word models formed thereof -- the models being stored preferably in the form of tables, TABLE 3.

In TABLE 3, it is noted that each standard label model M1 through MN has three possible transitions -- each transition having an arc probability associated there-with. In addition~ each model M1 through MN has 200 output probabilities -- one for each personalized label -- at each transition. Each output probability indi-cates the likelihood that the standard label corre-sponding to model Ml produces a respective personalized label at a certain transition. The transition probabilities and output probabilities may vary from speaker to speaker and are set during the training period. If desired, the transition probabilities and output probabilities may be combined to form composite probabilities, each indicating the YO9-83-136 - lO -`i~ 73 ~ i76 likelihood of producing a prescrihed output ~/it.h a prescribed tr?.nsition occurring.

For a qiven word, the utterance thereof by a first .speaker determines the order of labels .and, thus, the order of models corresponding thereto. The 'irst speaker ut~ers all words in the vocabularv to establish the respective order of st~ndard labels (and models) 'or each wo~d. Therearter, during training, 2 subsequent speaker utters l;nown acoustic inputs -- preferably words in the vocabu-lary. Fr~m these utterances of known acoustic inputs, the t-ansition probabilities 2nd out?ut prohahilities for a given speaker ?.re determined and stored to thereby "train" the mcdel. In this reaard, it should be realized that subsequent speakers need utter only so many acoustic inputs as is required to set the ~robabilities o- the ~00 models. That is, subseauent speakers need not utter all words in the vocabulary but, instead, need ut'er only so many acoustic inpu s as necessarv to traln the 200 mocels.

Employing the lahel-based model~ is o~ ~u~ther significance ~ith regard to addin~ ~ord.s to 'he vocabulary. To add a word to the vocabularv, all that is reauired is th~t the sequence of labels be determined -- ~s by the utter?nce of the new ~ord.
In that the label-based m~dels ~includin~

~23~iS78 l probabilities for a given speaker) have previously been entered in storage, the only data required for the new word is the order of labels.

The advantage of this model is that it is particularly simple and can easily be generated. Some more details will be given in the following portion of this descrip-tion.

DETAILS OF AN EMBODIMENT

MODEL GENERATION AND RECOGNITION PROCESS OVERVIEW

FIG. 4 is an overview of the speech recognition system, including model generation according to the invention and actual recognition.

Inpu-t speech is converted to label strings in the acoustic processor (1), using a previously generated standard label alphabet (2), in an initial step (3), a Markov model is generated for each word, using the string of standard labels caused by the initial single utterance of each word and previously defined elementary feneme Markov models (4). The label-based Markov models are intermediately stored (5).
Thereafter,in a training step (6), several utterances of words ~or other acoustic input) by a subsequent speaker are yo933-ij6 ~365~

matchea a~ainst the label-bAsed models to aener?te the statistics relating to the probabilitv values for transitions ard personali7ed lahel outputs for each model. In an actual recognition operation, S strin~s of personalized labels resulting Crom utterances to be recoqnized are matched ~aainst the statistical label-based models of words and identifiers for the word or words havlng the highest probability of producing the string of personalized labels are furnished at ~e output.

LA~FL VOC~BUTARY GE~ERATION AND CON~EPSIO~ ~F
~PEECH TO LABEI. ST~INGS

The procedure for gener~tin~ an alphabet of lahels and the actual conversion of speech into label strinas will now be described with reference to FIGS. 5 and 6 (though descrlp.ionc are also aYailable in the literature e.g. the above mentioned Nadas et al article).

~or gener?~tin~ the standard labels ~hich ~tpicall~
represent prototype vectors of sound t~pes (or ~ore specificallv snectraL parameters~ o speech a spea.ker talks for about five mi~utes to obt~ln a speech sample ~box 11 in FIC. 5). In an acoustic processor (of which ~o.e details are discussed in 25 connection with FIG. 6) 30 000 vectors o~ speech para~eters are obtained (box 12) each for one 10 ~2;~i57B
1 millisecond interval of the sample speech. These vectors are then processed in an analysis or vector quantlzation operation to group them in ca. 200 clus-ters, each cluster containing closely similar vectors (box 13). Such procedures are already disclosed in the literature, e.g., in an article by R.M. Gray, "Vector Quantization", IEEE ASSP Magazine, April 1984, pp.

4-29.

For each of the clusters, one prototype vector is selected, and the resu]ting 200 prototype vectors are then stored for later reference (box 1~). Each such vector represents one acoustic element or label. A
typical label alphabet is shown in TABLE 1.

FIG. 6 is a block diagram of the procedure for acoustic processing of speech to obtain label strings for utterances. The speech from a microphone (21) is converted to digital representation by an A/D converter (22). Windows of 20 msec duration are then extracted from the digital representation (23) and the window is moved in steps of 10 msec (to have some overlap). For each of the windows, a spectral analysis i5 made in a Fast Fourier Transform (FFT) to obtain for each inter-val r presenting 10 msec of speech, a vector whose parameters are the energy values for a number of spectral bands (box 24 in FIG. 5). Each ~ ~ 3 6 ~

cuxrellt vector thus obt~ined is compared to the set of prototvpe vectors ~14) that were generated in a preliminary procedure described above~ In this comparison step 25, the prototvp~ vector which is closest to the current vector is determined, and the label or ldentifier for this prototvpe is then issued at the Out~llt ~26!.
Thus, there will appear at ~.he output one label everv 10 msec, and the speech signal is av2ilable in coded form, the coding alphabet beinq the 200 feneme labels. In this regard, it should be noted that the invention need no~ he limited to labels generated at periodic int.ervals, it h~in~ contem-plated only that each label corresponds to a respective time interval.

~EMERATIOM OF ~IORD ~OD~J.S IISING LA~L STRI~I~S

For the generation of the word models, each word that is required in the vocAbularv is uttered onc~
and converted to a standard label str~ng as e~plained above. A schematic representation o~ a label string of standard labels for one word is shown in FIG. 1, consistinq of the sequence v1, y2,...vm which appeared at the output of _he acoustic processor when the word is spoken. This string is now taken as Ihe standard label base~or~
o the respective word.

~ ~ ,-, 9 ~

~3~ 8 To P-ocuce a basic model of the ~!ord tJhich ~a~es ir.to account ~he variationS in pronun~iation of the t~ord, e?.ch of the fenemes ~i of ~he ba~seform string is replaced by an elementary Maxkov model ~ !i) fo~ 'hat feneme.

mhe elementary ~arkov model can be o e~tre.melv sim?le form as sho~n in ~IG. 2. It corsists o an initial s'ate Si, a final state S, 2 t-ar~ition Tl leadin~ fro~ state Si to state Sf ard representir.g cne personalized la~el output, a transition T2 lea~ing and retu-nin~ to, the initlal state 5i, also renresen'ing one person~lized label output, and a r.ul? t-ansition TO from the initi~l to th~ final state, to which no personali~ed label output is assi~ned. This elementa-~ model caccounts for (c?.) a single an?earance o~ the personalized label ~ ta'~ln only transition Tl; (~) several appea-ances of a nersonalized label by taking trarsilion T2 several times; and (c) the ~epresentat on of a missina persoralized label bv .akina the nul? transitior TO.

The same elementarv model c~s shot~n in FIG. 2 can be selected for all 200 different s~andard ?abels.
Of course, ~ore com~licated elemen'ary models can he used, and different models can be assigned to ~he vari~us standar~ labels bu~ for ~he present 83-13r Çi5~13 embodimen' the model Oc F~. 2 is used ror ~ll standard labels.

The complete b~seform Markov model of the ~hole word whose label-based baseform is presented in FIG. 1 is illustrated in FIG. 3. Tt consists of a sim~le concatenation of the elementarv models M(vi) for all standard labels of the word the fina' state of each elementa_y model being joined with the initial state of the following elementarv model. Thus, the complete baseform rlzrkov model of a word that caused a string of m standard labels will comPrise m elementary r~arkov models.
The tvpical number of standard labels per word ~and thus the number of states per wor~ model) is about 30 to 80. Labels for four wnrds are illustrated in T~.~J.~ ~. The wor~ "th~nks" nr e~ample, hegins ~ith the label P~5 and encls with PX2.

To generate a baseform MarkQ~r model .or ?~ word means to define the different states and transitinns and their interrelation ror fhe respective word. ~o be useful for s?eech recognition the baseform model for the word must be made a statistical model by trainin~ it with several utterances i.e. hy accumulatins statistics for each transition in the model.

~2~iS7~3 l Since the same elementary model appears in several different words, it is not necessary to have several utterances of each word in order to train the models.

Such training can be made by the so called "Forward--Backward algorithm", which has been described in the literature, e.g., in the already mentioned paper by Jelinek.

As a result of the training, a probability value is assigned to each transition in the model. For example, for one specific state there may be a probability value of 0.5 for T1, 0.4 for T2, and 0.1 for TO. Further-more, for each of the non-null transitions T1 and T2, there is given a list of probabilities indicating for each of the 200 personalized labels what -the prob-ability of its appearance is when the respecti~re transition is taken. The whole statistical model of a word takes the form of a list or table as shown in TABLE 3. Each elementary model or standard label is represented by one section in the table, and each transition corresponds to a line or vector whose elements are the probabilities of the individual personalized labels ~and in addition the overall probability that the respective transition is taken)O

""',i;
.~. ..~.

':09R3-i35 ~;~3~iS~8 ~or ~ctua~ ~torage of all ~ord models, 'he following is sufficient: one vector ls stored 'or each word, the vector components bein~ the identifiers of the elementarv ~arkov models of which the word consists; and one s~atistical ar};ov model including the probabilitv values stored 'or each of the 200 standard labels of Ihe alphabet. Thus, the statistical word model shown in TABLÆ 3 need not actually be stored as such in memory; it can be stored in a distributed form with data being combin~d as needed.

RECOGNTTIOM P~OCESS

For actual speech recognition, the utter~nces are converted to strings of personalized labels as was explained in the initial section. These personalized label strings are then m~'ched ag~inst each ~rord model to obtain ~he prob~hilit~
that a strina of personalized labels was causec bv utterance of the ~ord represented by that .~odel.
Specificallv, matching is perfor~ed based on ma.ch scores for respective words, wherein ~ach ~2' ch score represents a "for~l~rd probability" ~.s discussed in the above-identified Jelinek article.
The recoanition process employing label-b~sed models is analogous to kno~n processe~ which emplov ~hone-based models. The w~rd or words ~ith 6$7~3 l the highest probability is lare) selected as output.

CONCLUSION

CHANGING OF LABEL ALPHABET

It should be noted that the standard label alphabet used for initial word model generation and personalized label sets used for training and for recognition may be, but typically, are not all identical. However, although the probability values associated with labels generated during recognition may differ somewhat from personalized labels generated during training, the actual recognition result will generally be correct.
This is possible because of the adaptive nature of the Markov models.

It should, however, also be noted that inordinately large changes in respective training and recognition alphabets may affect accuracy. Also, a subsequent speaker whose speech is vastly different from the first speaker will train the models with probabilities that may result in limits on accuracy.

The advantages of the label-based Markov models intro-duced by the present invention can be summarized as follows:

3~7~3 (a) these models are an improve~ent on phone-based mcdels hecause they represent the ~JO-d at a much more detai.led level;

(h) unlike word templates using DP matching, label-based word models are trainable using the Fonrard-Backwa~d algorithm;

(c) the number of parameters depends on the size of the feneme alphabet a~d not on voc~bularv size -- storaqe requirements g-owinq slotJly with an increase in vocabular~ size;

~d) the recognition procedure using label-based models is computationally much faster than Dvnamic Pro~ramminq matching and the usage of continuous parameter word templates; and (e~ the modelling of words is d~ne automaticallv.

In addition to the variations and modi'ications to applic2nt's disclosed appar?~tus which have been suagested, manv other Vari?tiOnS and modi-ications will be apparent to those skilled in the art, and accordin~ly, the scope of applicant's invention is not to be ccnstrued to be limited to the particular embodiments shown or suggested.

~:~9~_ 7~

36~

rO- e~.ample, al~hou~h shown preferahly as a sin~le acoustic prncessor, the processin~ ~unctions ~.~v be ~istributed over A plurality of proce~ssorS
includ~ ,ng ~ for e~amDle: a first processor lnitially determin-ns th~ ~lph~bet o. ~tandard l~bels; a second processor for producing ?.
se~uence or .stand?rd labels for each word initiall~ spoken once; a third orocessor for selecting an alphabet of pers~n?~lized lahels; and a fourth orncessor for con~ertin~ tra.~nin~ i~pu.s or words into a st~^ing of Dersonali.~e~ labels.

YO9~3-136 ~;2 3~

TAP7;F: 1 T~r, Tl~7~ LETTERS R~IJ~.HL~' REP~E~EI~T T~E S~rT~ OF T~E
EI,FMENT.
TW~ DIGITS ARE ASSOCIATED ~ITH ~'OIlELS:
FIFST: STRESS OF SOUND
sEcoNn: CIIRRE?TT IDE~TTIFICATT~J NUr~BER
ONE DIGIT ONLY IS ASSOCIATED WIT~ ~,O~SONANTS:
SINGLE DI~IT: CURRENT IDENTIFICATIOM NUMBER

001 AA11 029 BX2- 057 EH02 148 TY5- 176 ".X11 002 AA12 030 BV3- 058 E'rr11 149 TX6- 177 XX12 003 AA13 031 BX4- 059 EH12 150 UH01 178 Y.Y13 004 AA14 n32 BX5- 06n EH13 151 U~0? 179 YV14 005 AA15 033 BX6- 061 E~14 152 UH11 180 ~Y15 006 AE11 034 BX7- 062 E~.r15 153 U~12 181 ~X16 007 A7~12 035 ~X8- 126 RX1- 154 U~13 182 X';17 008 AE13 036 RX9- 127 S~1- 155 UH14 183 XX18 009 AE14 037 DH1- 128 S~r~2- 156 UU11 184 ~X19 010 A~15 038 D~2- 12~ SYl- 157 ~IT~12 185 X,Y?,-011 AT~T11 Q39 DQ1- 130 SX2- 158 Tl',~1 186 X,'?,0 01? A~l12 nço D~2- 131 SX3- 159 ITXr,? 18' XX?1 013 A~113 041 DQ3- 13? sxa- 16~ UX11 188 ~`'?2 014 AX11 042 DQ4- 133 S'.C5- 161 UX12 189 XY23 015 AY.12 043 DX1- 134 SX6- 162 'l''13 190 V"24 n16 AX13 n44 ~X?- 135 SX7- lh3 VX1- 191 y~v3_ 017 AX'14 045 EE01 136 TH1- 164 t7X2- 192 .~X4-018 AX15 046 EE02 137 T~?- 165 ~7',~3- i93 XV5-019 ~,Y16 047 EE11 138 T~3- 166 '~'Xi'- 194 ';X6-020 AX17 048 EE12 139 TH4- 167 ~ 195 YY7-0?1 BQ1- 049 EE13 140 T~5- l68 ~7Y2- 196 ~v~_ 022 BQ2- 050 EE14 141 TQ1- 169 ~X3- 197 ~Y9-023 ~3- 051 EE15 142 T02- 170 ~7X~- 198 ZX~1-024 ~Q4- 052 EE16 143 TX3- 171 r.?Y.;- 199 Z,X2-075 RX1- 053 EE17 144 T."1- 172 1-7X6- 200 z,v,3_ 026 BX10 054 EE18 145 TX?- 173 r,~7y7_ 028 ~X12 n56 EH0i 147 TX4- 175 XY10 ~'~983-135 ~23~5~3 .SAMPLE L~REL STRI2~7C.S E7~ OR FOUR 1~70P~DS
~THAN'~S - FOR - ~OUP - LETTER) ' th~nks ' PX5 TX4 KX5 ~2_ T~Y4 K~5 KX2 7.~ 2 THl D03KQl UX13 AX17 EH12 EHl2 EH12 E?H12 EH12 E~12 EHl2 EH12 EH12 E~'I12 EH15 EH15 EH15 E?.~15 E.H15 AE12 AE12 AE12 ~,XGl EI12 EI12 ETl2 EI12 B~2 ~TVl MX6 HVl NG2 G`.'.5 NG2 GX5 DXl 13X10 E~'3 GX4 BX10 pv;
PX4 Tl'12 PX4 HX.l SX4 S~5 SX5 SX5 SX5_ SX~
SY.2 SV5 SX2 S"~5 SY~5 S~5 SY.5 SX4 SY~4_ SX2 ~Hl TJl2 DQ3 R.'4_ TY6 X,~12 PX2 ' for ' p~r,5 TX4_ BX8 r~'2_ D~)3 rnY2_ XX-J_ TY2_ TH?_ TH2_ FX2_ TH2_ HX1 ,Y~ TXr9_ ~,~X4 _ T~?X1_ I~;7~4_ WXd_ 7,.X1_ I~l"4_ LXl_ LXl_ ~' 1 7,',~;1 T~X~l ~Xl_ TXl_ L"l_ 7.X1_ L~l_ TJX1 LX1 TJXr1 _ r.xl - TX1 - TJX1 Al~713 Ar`713 Al713 A~713 AW13 Al^713 Al~.713 AW13 UY.12 A~711 A~?ll AIJll AI~Jll A~.711 AWll A~Jll ER14 'r'P14 ER14 RXl_ RXl RXl_ R~l RXl MX4 M~4 MX4_ MX4_ D(~4_ BX7_ Dpl_ F~X7_ BX7_ TX5 XX14 BX9 X~C3 X`'7 _ "~our ' GX4 GXl G,Xl ~7Xl EEl') .JXl_ E'-Ell ,E01 I~0 rXPl IXOl G5~1_ U~S12 IJX 7 UX12 Url14 [;lI14 UH12 A~!ll Al~713 A~13 AW13 A1~l13 AI~J13 rW13 Al~713 Ah713 AI~J13 Ah7i3 A~.711 Ar~lll AI~Jll AI~Jll Al.7l 1 AI~Jll A1111 Al'll Arl711 ER14 ERl d ER14 ER14 ER14 ER14 EP14 ERl 4 :RVl RXl RXl_ .`-IX4_ MX4_ r~x4_ MX4 r~`'4 Dnl_ ~IX2 ~; 7 DQl B`'.7 ~'X5_ 30 X~l 4 X~l ' letter ' B"~3 BX3 BX8 BX2 ~.7'.'.5 17X2 ~ ' r,JX'6 1i7X6 r~ 5 l~i7X6 1~,7X5 ~1X4 1^,7X3 ~JX3 OUll UHi4 AX15 AX15 AX14 AX14 EH12 E)~12 EH12 EH12 EH,12 EH12 EH12 EHl') EH12 A.;'14 EH02 UH02 EEl'02 5H2 E_n2 TQl _ IX05 TQl _ IXGl P~)l ER02 EROl r.R13 F:R13 r'P13 EP~13 ERIl ERll EP13 ER13 ER13 EsR13 ER13 RXl _ R.~'l _ '?Xl ERn2 RX'l EPQ2 rA.x4 MX4 DQl DQl_ DQl_ ~)01_ DC)l_ r)Ql DQl_ DQl_ D01 D~l_ D~l_ BX7 TX5_ TX5 TX5 TX5 XX2. XX14 TX4_ v~8~

6~

TARI.E 3 STATISTICAJ. MAP.KOV MOnE~ OF 1~7ORD
(FF.NEME-7~ASED ~OI~ErJ) FENEME ST~TE TRANS AP~C OUTPUT PROBA~ILITIES
~soDEL ITION PROB.I,001 ~002 ...... L200 .
Ml Sl Tll 0.50.03 o.on 0.13 T12 0.40.02 0.01 0.00 T10 0.1 ~1?. S2 T21 0.73 T22 0.2?
T20 0.05 - _ _ M13 S3 T31 Q.5 T32 0.45 T3n o . ns .
MN SN TNl 0.6 TM2 0.3 TN0 0.1 - - _ a~

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A speech recognition system comprising:

first processor means for generating an alphabet of standard labels in response to a first speech input, each standard label representing a sound type assignable to an interval of time;

second processor means for producing a respective sequence of standard labels from the alphabet in response to the uttering of each word in a vocabulary of words;

third processor means for selecting a set or personalized labels in response to a second speech input, each personalized label representing a sound type assigned to an interval of time; and means for forming a respective probabilistic model for each standard label, said model forming means for each probabilistic model including means for associating with each model (a) a plurality of states, (b) at least one transition extending from a state to a state, (c) a transition probability for each transition, and (d), for at least one transition, a plurality of output probabilities wherein each output probability at a given transition in the model of a given standard label represents the likelihood of a respective personalized label being produced at the given transition.

2. A system as in claim 1 further comprising:

means for setting the output probabilities and transition probabilities of each standard label model in response to a training speaker speaking training utterances.

3. A system as in claim 2 wherein all the transition probabilities and and the output probabilities for each transition in each respective probabilistic model are set in response to the spoken input of the training speaker, the system further comprising:

fourth processor means for converting spoken input into a string of personalized labels from the set thereof, wherein each person-alized label has a corresponding output probability at each of at least some of the transitions in each probabilistic model; and means for determining which words represented by concatenated probabilistic models have the greatest probability of having produced the string of personalized labels.

4. A system as in claim 3 wherein said first processor means, said second processor means, said third processor means, and said fourth processor means comprise a single acoustic processor.

5. A system as in claim 3 including means for storing, for each word in the vocabulary, the sequence of standard labels corresponding thereto.

6. A system as in claim 5 further comprising means for storing, for each standard label, the respective transition probabilities and personalized label output probabilities associated therewith; and means for combining, for each word in the vocabulary, the stored model probabilities associated with each successive standard label in the respective concatenation which forms said each word, to determine the likelihood that said given word produced the string of personal-ized labels.

7. A method for modelling words in a speech recogni-tion system comprising the steps of:

producing, for each word in a vocabulary, a sequence of concatenated standard labels in response to a speaker uttering said each word into an acoustic processor wherein each standard label corresponds to a time interval of speech and each label in the sequence is from a fixed alphabet of standard labels;

storing, for each word, the produced sequence corresponding thereto;

generating a set of personalized labels in response to a training speaker uttering preselected acoustic inputs into an acoustic processor, the personalized labels of the set representing the sound types corresponding to the training speaker;

representing each standard label by a probabilis-tic model, the model for each standard label having associated therewith (a) a plurality of states, (b) at least one transition, (c) a transition probability for each transition, and (d) for each transition, an output probability for each of at least some of the personalized labels, each output probability at a given transition corresponding to the likelihood of the standard label model producing a respective personalized label at said given transition.

8. A method as in claim 7 comprising the further step of:

setting the label output probabilities and transition probabilities of each given label model in response to the training speaker speaking training utterances.

9. A method as in claim 7 comprising the further step of:

adding a word to the vocabulary including the steps of:
determining the sequence of standard labels for the word to be added by speaking the word once to an acoustic processor and storing the sequence therefor;

replacing each standard label in the added word with the probabilistic model corresponding thereto thereby forming a word model for the added word comprised of the concatenated models for the standard labels, the probabilities of each label-based standard label model being applicable to the word model for the added word.

10. A method of modelling words in a speech recognition system comprising the steps of:

entering a first speech input, corresponding to words in a vocabulary, into an acoustic processor which converts each spoken word into a sequence of standard labels, where each standard label corresponds to a sound type assignable to an interval of time;

representing each standard label as a probabilistic model which has a plurality of states, at least one transition from a state to a state, and at least one settable output probability at some transitions;

entering selected acoustic inputs into an acoustic processor which converts the selected acoustic inputs into personalized labels, each personalized label corresponding to a sound type assigned to an interval of time; and setting each output probability as the probability of the standard label represented by a given model producing a particular personalized label at a given transition in the given model.

11. A method as in claim 10 wherein each standard label corresponds to a sound type assignable to a fixed period of time.

12. In a speech recognition system having an acoustic processor which produces a string of periodically generated labels in response to acoustic input where the produced labels are from a set determined by the acoustic input, apparatus for modelling words comprising:

means for storing each word as a sequence of periodic standard labels produced in response to each word being initially uttered and processed by the acoustic processor; and means for representing each standard label as a Markov model having (a) a plurality of states, (b) at least one transition extending from a state to a state, (c) a transition probability for each transition, and (d) output probabilities at the transitions, each output probability at a given transition corresponding to the likeli-hood of a standard label model producing a personalized label at the given transition, the personalized labels being from a set thereof determined by the acoustic processor in response to acoustic inputs entered after the initial utterance of each word.

13. Apparatus as in claim 12 further comprising:

means for concatenating models of standard labels to form a label-based Markov word model for each sequence of standard labels representing a word.

14. Apparatus as in claim 13 further comprising:

means for matching vocabulary words formed as Markov word models with an input string of personalized labels generated by the acoustic processor in response to speech inputs to be recognized, said matching means determining the probability of each of at least some of the words in the vocabulary producing the input string.

15. Apparatus as in claim 12 wherein each standard label model has (a) two states, (b) a first transition between the two states whereat each personalized label has an output probability associated therewith, (c) a second transition extending from the first state back to itself whereat each personalized label model has an output probability associated therewith, and (d) a null transition between the two states whereat no output occurs.

16. A speech recognition system having a speech input processor and a recognition processor, comprising:

(a) means to control the input processor to generate label strings representative of speech input words;

(b) means to replace each label in each of a plurality of sample word label strings by an elementary Markov model;

(c) means to store a vocabulary of label-based Markov models for words;

(d) means to train the Markov models during a training period with training utterances of words; and (e) recognition processing means to compare label strings generated by the input processor for actual speech input with label-based Markov models of the stored vocabulary for recognition;

wherein recognition is carried out on the basis of label-based Markov models, wherein recognition storage requirements are related to the number of labels, and wherein storage requirements for actual words are limited to label-based Markov models of vocabulary words.