CA2023424C - Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation - Google Patents
Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation Download PDFInfo
- Publication number
- CA2023424C CA2023424C CA002023424A CA2023424A CA2023424C CA 2023424 C CA2023424 C CA 2023424C CA 002023424 A CA002023424 A CA 002023424A CA 2023424 A CA2023424 A CA 2023424A CA 2023424 C CA2023424 C CA 2023424C
- Authority
- CA
- Canada
- Prior art keywords
- speech
- vector
- elements
- data
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 238000012545 processing Methods 0.000 title claims abstract description 40
- 230000002829 reductive effect Effects 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 154
- 241000269627 Amphiuma means Species 0.000 claims description 2
- 238000012544 monitoring process Methods 0.000 claims 6
- 230000008569 process Effects 0.000 abstract description 8
- 239000011159 matrix material Substances 0.000 description 32
- 238000011161 development Methods 0.000 description 22
- 230000003044 adaptive effect Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 14
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 11
- 238000012935 Averaging Methods 0.000 description 10
- RTZKZFJDLAIYFH-UHFFFAOYSA-N Diethyl ether Chemical compound CCOCC RTZKZFJDLAIYFH-UHFFFAOYSA-N 0.000 description 8
- 150000002500 ions Chemical class 0.000 description 7
- 230000015654 memory Effects 0.000 description 7
- 238000010606 normalization Methods 0.000 description 7
- 230000008707 rearrangement Effects 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 6
- 239000011269 tar Substances 0.000 description 6
- AYFVYJQAPQTCCC-GBXIJSLDSA-N L-threonine Chemical compound C[C@@H](O)[C@H](N)C(O)=O AYFVYJQAPQTCCC-GBXIJSLDSA-N 0.000 description 5
- 230000009467 reduction Effects 0.000 description 5
- XUKUURHRXDUEBC-KAYWLYCHSA-N Atorvastatin Chemical compound C=1C=CC=CC=1C1=C(C=2C=CC(F)=CC=2)N(CC[C@@H](O)C[C@@H](O)CC(O)=O)C(C(C)C)=C1C(=O)NC1=CC=CC=C1 XUKUURHRXDUEBC-KAYWLYCHSA-N 0.000 description 4
- 101150087426 Gnal gene Proteins 0.000 description 4
- 239000000470 constituent Substances 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000036961 partial effect Effects 0.000 description 4
- 230000000717 retained effect Effects 0.000 description 4
- 244000046052 Phaseolus vulgaris Species 0.000 description 3
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 3
- 239000000872 buffer Substances 0.000 description 3
- LTNZEXKYNRNOGT-UHFFFAOYSA-N dequalinium chloride Chemical compound [Cl-].[Cl-].C1=CC=C2[N+](CCCCCCCCCC[N+]3=C4C=CC=CC4=C(N)C=C3C)=C(C)C=C(N)C2=C1 LTNZEXKYNRNOGT-UHFFFAOYSA-N 0.000 description 3
- 101100203601 Caenorhabditis elegans sor-3 gene Proteins 0.000 description 2
- 101100152865 Danio rerio thraa gene Proteins 0.000 description 2
- 239000001856 Ethyl cellulose Substances 0.000 description 2
- 101100270435 Mus musculus Arhgef12 gene Proteins 0.000 description 2
- 235000001537 Ribes X gardonianum Nutrition 0.000 description 2
- 235000001535 Ribes X utile Nutrition 0.000 description 2
- 235000016919 Ribes petraeum Nutrition 0.000 description 2
- 244000281247 Ribes rubrum Species 0.000 description 2
- 235000002355 Ribes spicatum Nutrition 0.000 description 2
- 241000282887 Suidae Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 239000003973 paint Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101150014006 thrA gene Proteins 0.000 description 2
- DNXHEGUUPJUMQT-UHFFFAOYSA-N (+)-estrone Natural products OC1=CC=C2C3CCC(C)(C(CC4)=O)C4C3CCC2=C1 DNXHEGUUPJUMQT-UHFFFAOYSA-N 0.000 description 1
- KRQUFUKTQHISJB-YYADALCUSA-N 2-[(E)-N-[2-(4-chlorophenoxy)propoxy]-C-propylcarbonimidoyl]-3-hydroxy-5-(thian-3-yl)cyclohex-2-en-1-one Chemical compound CCC\C(=N/OCC(C)OC1=CC=C(Cl)C=C1)C1=C(O)CC(CC1=O)C1CCCSC1 KRQUFUKTQHISJB-YYADALCUSA-N 0.000 description 1
- SVTBMSDMJJWYQN-UHFFFAOYSA-N 2-methylpentane-2,4-diol Chemical compound CC(O)CC(C)(C)O SVTBMSDMJJWYQN-UHFFFAOYSA-N 0.000 description 1
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- FDQGNLOWMMVRQL-UHFFFAOYSA-N Allobarbital Chemical compound C=CCC1(CC=C)C(=O)NC(=O)NC1=O FDQGNLOWMMVRQL-UHFFFAOYSA-N 0.000 description 1
- 101100152767 Arabidopsis thaliana TET7 gene Proteins 0.000 description 1
- 241000370685 Arge Species 0.000 description 1
- 241001235099 Armina Species 0.000 description 1
- 241001167018 Aroa Species 0.000 description 1
- 101100129922 Caenorhabditis elegans pig-1 gene Proteins 0.000 description 1
- 241000518994 Conta Species 0.000 description 1
- 206010011732 Cyst Diseases 0.000 description 1
- 241001633942 Dais Species 0.000 description 1
- 206010012289 Dementia Diseases 0.000 description 1
- 101100520057 Drosophila melanogaster Pig1 gene Proteins 0.000 description 1
- 101150039033 Eci2 gene Proteins 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 244000287680 Garcinia dulcis Species 0.000 description 1
- 241001481828 Glyptocephalus cynoglossus Species 0.000 description 1
- 241001175904 Labeo bata Species 0.000 description 1
- 241000211181 Manta Species 0.000 description 1
- 101100005554 Mus musculus Ccl20 gene Proteins 0.000 description 1
- 101100225689 Mus musculus Enah gene Proteins 0.000 description 1
- 101100032932 Mus musculus Raly gene Proteins 0.000 description 1
- 102100026933 Myelin-associated neurite-outgrowth inhibitor Human genes 0.000 description 1
- 240000002390 Pandanus odoratissimus Species 0.000 description 1
- 235000005311 Pandanus odoratissimus Nutrition 0.000 description 1
- 241001208007 Procas Species 0.000 description 1
- 241000022563 Rema Species 0.000 description 1
- 241001415395 Spea Species 0.000 description 1
- 101150092197 Stimate gene Proteins 0.000 description 1
- 244000269722 Thea sinensis Species 0.000 description 1
- 241000767684 Thoe Species 0.000 description 1
- ATJFFYVFTNAWJD-UHFFFAOYSA-N Tin Chemical compound [Sn] ATJFFYVFTNAWJD-UHFFFAOYSA-N 0.000 description 1
- 241001017614 Tipulodes ima Species 0.000 description 1
- 241000536431 Urvillea ulmacea Species 0.000 description 1
- 241001416177 Vicugna pacos Species 0.000 description 1
- 240000008042 Zea mays Species 0.000 description 1
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 1
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 244000052616 bacterial pathogen Species 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 210000003323 beak Anatomy 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 210000005056 cell body Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003638 chemical reducing agent Substances 0.000 description 1
- DGBIGWXXNGSACT-UHFFFAOYSA-N clonazepam Chemical compound C12=CC([N+](=O)[O-])=CC=C2NC(=O)CN=C1C1=CC=CC=C1Cl DGBIGWXXNGSACT-UHFFFAOYSA-N 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 235000005822 corn Nutrition 0.000 description 1
- 208000031513 cyst Diseases 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- YNKFCNRZZPFMEX-XHPDKPNGSA-N desmopressin acetate trihydrate Chemical compound O.O.O.CC(O)=O.C([C@H]1C(=O)N[C@H](C(N[C@@H](CC(N)=O)C(=O)N[C@@H](CSSCCC(=O)N[C@@H](CC=2C=CC(O)=CC=2)C(=O)N1)C(=O)N1[C@@H](CCC1)C(=O)N[C@H](CCCNC(N)=N)C(=O)NCC(N)=O)=O)CCC(=O)N)C1=CC=CC=C1 YNKFCNRZZPFMEX-XHPDKPNGSA-N 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- XWVFVITVPYKIMH-UHFFFAOYSA-N ethyl n-[4-[benzyl(2-phenylethyl)amino]-2-(2-fluorophenyl)-1h-imidazo[4,5-c]pyridin-6-yl]carbamate Chemical compound N=1C(NC(=O)OCC)=CC=2NC(C=3C(=CC=CC=3)F)=NC=2C=1N(CC=1C=CC=CC=1)CCC1=CC=CC=C1 XWVFVITVPYKIMH-UHFFFAOYSA-N 0.000 description 1
- 101150100121 gna1 gene Proteins 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- NFGXHKASABOEEW-LDRANXPESA-N methoprene Chemical compound COC(C)(C)CCCC(C)C\C=C\C(\C)=C\C(=O)OC(C)C NFGXHKASABOEEW-LDRANXPESA-N 0.000 description 1
- HYIMSNHJOBLJNT-UHFFFAOYSA-N nifedipine Chemical compound COC(=O)C1=C(C)NC(C)=C(C(=O)OC)C1C1=CC=CC=C1[N+]([O-])=O HYIMSNHJOBLJNT-UHFFFAOYSA-N 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000005204 segregation Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- HEMHJVSKTPXQMS-UHFFFAOYSA-M sodium hydroxide Substances [OH-].[Na+] HEMHJVSKTPXQMS-UHFFFAOYSA-M 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 229940034337 stimate Drugs 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Complex Calculations (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Image Processing (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Machine Translation (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
Abstract
A phoneme estimator in a speech-recognition system includes energy detect circuitry for detecting the segments of a speech signal that should be analyzed for phoneme content.
Speech-element processors then process the speech signal segments, calculating nonlinear representations of the segments. The nonlinear representation data is applied to speech-element modeling circuitry which reduces the data through speech element specific modeling. The reduced data are then subjected to further nonlinear processing. The results of the further nonlinear processing are again applied to speech-element modeling circuitry, producing phoneme isotype estimates. The phoneme isotype estimates are rearranged and consolidated, that is, the estimates are uniformly labeled and duplicated estimates are consolidated, forming estimates of words or phrases containing minimal numbers of phonemes. The estimates may then be compared with stored words or phrases to determine what was spoken.
Speech-element processors then process the speech signal segments, calculating nonlinear representations of the segments. The nonlinear representation data is applied to speech-element modeling circuitry which reduces the data through speech element specific modeling. The reduced data are then subjected to further nonlinear processing. The results of the further nonlinear processing are again applied to speech-element modeling circuitry, producing phoneme isotype estimates. The phoneme isotype estimates are rearranged and consolidated, that is, the estimates are uniformly labeled and duplicated estimates are consolidated, forming estimates of words or phrases containing minimal numbers of phonemes. The estimates may then be compared with stored words or phrases to determine what was spoken.
Description
SPEECH-RECOGNITION CIRCUITRY EMPLOYING NONLINEAR
PROCESSING, SPEECH ELEMENT MODELING AND PHONEME ESTIMATION
FIELD OF INVENTION
The invention is directed to speech recognition, and more particularly to those parts of speech recognition systems used in recognizing patterns in data-reduced versions of the speech. It is an improvement to the circuit disclosed in U.S.
Patent 5,027,408 entitled "Speech-Recognition Circuitry Employing Phoneme Estimation" issued 25 June 1991 in the names of the same inventors.
BACKGROUND OF THE INVENTION
Most systems for recognizing speech employ some means of reducing the data in raw speech. Thus the speech is reduced to representations that include less than all of the data that would be included in a straight digitization of the speech signal. However, such representations must contain most if not all of the data needed to identify the meaning intended by the speaker.
In development, or "training", of the speech-recognition system, the task is to identify the patterns in the reduced-data representations that are characteristic of speech elements such as words or phrases. The sounds made by different speakers uttering the same words or phrases are different, and thus the speech-recognition system must assign the same words or phrases to patterns derived from these different sounds. There are other sources of ambiguity in the patterns, such as noise and the inaccuracy of the modeling process, which may also alter the speech signal representations. Accordingly, routines are used to assign likelihoods to various mathematical combinations of the reduced-data representations of the speech, and various hypotheses are tested, to determine which one of a number of possible speech elements is most likely the one currently being spoken, and thus represented by a particular data pattern.
The processes for performing these operations tend to be computation-intensive. The likelihoods must be determined for various data combinations and large numbers of speech elements. Thus the limitation on computation imposed by requirements of, for instance, real-time operation of the system limit the sensitivity of the pattern-recognition algorithm that can be employed.
It is accordingly an object of the present invention to increase the computational time that can be dedicated to recognition of a given pattern but to do so without increasing the time required for the total speech-recognition process.
It is a further object of the invention to process together signal segments corresponding to a longer time period, that is, use a larger signal "window" without substantially increasing the computational burden and without decreasing the resolution of the signal data.
SUMMARY OF THE INVENTION
The invention provides a method of identifying a speech element of interest in a speech signal, said method comprising the steps of: A. generating a first vector each of whose components represents a component of said speech element;
B. comparing said first vector with a first set of model vectors representing known speech elements and for each 2a comparison deriving a value representing the degree of correlation with one of said model vectors, thereby generating a second vector each of whose components is one of said values;
C. calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. comparing said third vector with a second set of model vectors representing known speech elements and producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
The invention also provides a speech-recognition device for identifying a speech element of interest in a speech signal, said device comprising: A. means for generating a first vector each of whose components represents a component of said speech element; B. first modeling means for comparing said first vector with a first set of model vectors representing known speech elements, for each comparison deriving a value representing the degree of correlation with one of said model vectors, and generating a second vector each of whose components is one of said values; C. means for producing a third vector, said means calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. second modeling means for comparing said third vector with a second set of model vectors representing known speech elements, the comparing means producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
2b The foregoing and related objects are achieved in a speech-recognition system that includes a phoneme estimator which selectively reduces speech data by a first modeling operation, performs nonlinear data manipulations on the reduced data and selectively reduces the data by a second modeling operation. The reduced data are then further manipulated and re-arranged to produce phoneme estimates which are used to identify the words or phrases spoken.
In brief summary, the phoneme estimator monitors the energy of a data-reduced version of the input speech signal and selects for further processing all speech segments with energy GE3. lra. GG 11 :~.f,~ AM *hIUTTEI.ZVjcCLE~TIJErJ, FISH hG~, which exceeds a certain thrashald. Such signal segments typically xepree~a~nt voicing or unvoiGad axpiratian within apaac~h, and thus phonamara. The phoneme estimator then manipulates a further data-reduced representation of the signal segments through a series oaf speech modeling operat~.ons, nonlinear operations and furtha~c speech modeling operat~.ons to calGUlate which phanema patterns the data most closely r~sambla.
Tha ~paactl modeling is used to reduce i~ha speech signal data by ignoring data which, through ~sxp~rigna~a, era found to be ralati~raly insignificant or redundant in terms of phonema-pstte~rn e~atima~tion. Tha more ~ignificant data are than manipulated using eomputatian-intensive nanlinear operations resulting in dafi.a patterns which aru us~sd tea determine the .
likelihood of the intended phone~mas mare accurately. The tuna rsguirad ~Cor such computations is minimixad by so reducing the data, The phoneme estimator else looks at the time between signal energy, ~r phoneme, dete~ations in sa~.r~oting the most likely phanamas. Considering the time between the phoneme defections, the astimx~tax may conoatanata what would otharw~Lsa ba ocansidared a series of distinct phonam~s into a group of mufti-phoneme patterns, far exempla, diphonaB. Th~asa mufti--phoname patterns often oanvay the intended meaning e~i' th~a sp~ach mare ol~arly tk~an the individual phonemes.
~R~C~F I3ES~t~:PT~o~1 ~~ TFi~ p~w~C~I~S
Fear a mare aomplste undarstmnding of the features, aduantagas, arid objects of the invantian, raforanca should be made to the follawinc~ d~tailad desariptian and the accompanying drawings, in whiexh:
lFig. 1 is ~. b~.ock dias~ram of a speech~r~cogn~.tion ~ty~atam G 8 . 1 5 . 9 G 1 1 : 4 F~ AM %k $~'LT T T ~ ~ ydg ~ C, L ~ I 7 h7 E DJ, F I
E,' H F~ C~ ~, employing the teachings of the present inventions Fig. 2 is a blook diagram depia~irig a phoneme estimator shown in 'Fig. ~. s Fig. 3 ies a blook diagram depiciting calculation og an estimation c~~ the signal power spectxum, shown as block 19 in Fig. 2:
Fig. 4 is a block t~~.agram depicting aalou~.ation of a r~duction of the power apea~Grum estimation, Shawn as block ao in Fig o 2 Fig. S is .~ blook ds.agram a~ an ~nsrgy ds~te~r~. praGessor, shown as block 22 in Fig. 2s Fig. s is a b3LOak diagram depicting a resoeptiva ~i~ald _.
pros~ssor, shown as block 24 in F~.g. ~_ Fig. 7 is a block diagram dapir~ting an adapti~rs no:rmalize~r, shown as block 2a in Fig. 2~
Figs,. 8 arid ~ taken together illustrate a recsptiv~ field non7.inear pre~c~eaaor, shown as blool~ ~8 in Fig. 2 ;
Fig. 10 is a blook diagre~m i~.~.ustrating naril~,rie:ar roc~ssor-2, shown ova blook 30 in Fig. Z;
Fig. 11 is a block diagram illustrating a normalization prooeasor arid sp~sch~alem~arit modal-1 psoaersax°, shown as blo~xka 32 and 34 iri Fig. 2i Fig. 12 is a block diagram illustrating a concatenation of vectors into tripled, shown as bloc~Jc 3~ in Fig. ~s Fig. 13 is a~ block diagram depicsting nonlinear pxocessor-3, shown as blook 3~ in Fig. 2,°
Fig. 14 is a kalook diagram illustrating speeoh-element model-2 and calou7.ation of the log~ar~.thm of the lik~lihood xati~, shown as blocks 40 arid 42 ~.n Fig: 2t Fig. 15 illust~a'~es phoncame ieo'~ypa eatima'~e rearrangement, shown as block 44 in Fig. 2s Figs. 16, 17 arid ~.e together are a block ~liac~ram ilJ.ustrating an estimator integrator, shown as block 4~ in Fig.
~7 5 . 1 5 . 9 O 1 1 : 4 ~ a n~i m td U ~ 'r ~ iz rot a c: t. ~, r~ r.T ~
r.~s, F i s~ :~ ~ r~ F
.~~_ a;
~'ig. 19 illustrxstas the calculations of parameters used in the adaptive normalizar of Fig. 7~
Fig. 20 illustrates thc~ calculation of a covariance matrix R for a~,lcul2~tit7g paramataxs~ used, for axampl~, in the nanlinaar procassar-2 of Fig. 1,0:
F'~.g. 21 illustrates the Calculation of an aigenmatrix ~s, using covariana~ matrix R of P'ig. 20~
Fig..22 illustrat~x~ the calculation of an aiganmatrix Eab used in the nonlinear processor-2 of ~'ig. 7.0;
k'ig. 23 illustrat~s the calculation of further parameters used in the nonlinear procassar-2 of Fig. 7,at Fig. 24 illustrates the calculation of p~rametsrs used in the normalization procassar of F'ig. la.:.
Fig. 25 illustrates marking of the ape~oh signal;
F,ig. 26 illusstxatag tho stctaxmina,tian cat speech label vectara used in formulating a ksrn~l;
Eig. 27 illustrates calculation c~ eigenmatrix and kernel parameters for further calculating of param~t~re uee~d in speech-~1~mant modal-1 proGeasor of Fig. 7.1~
Fig. 2~ illustrat~s tho frrrmulation of a cc~mbinad kern~1 ~~ using th~ parameters of ~°ig. 27, the combined kernel is then uge~a in the spaaCh-element modal-~. processor of Fig.
gig. 2~ illustrates the calculatign of aiganmatrix ~3~, usadt in nonlinear p~oc~ssor-3 shown in gig. l3;
~'ig. 30 illustrates tire a~~armina~.ian of speech ~.aB~al tractors used in formulating anoth~~° kernels ~'ig. 3~, illustrates calculation of aigenma~.xix anc~ kernel parameters fr~r further calculating of parttm~tars used in speech-element r~ocle~,-2 procossor o~ gig. 14 ~
Fig. 32 i~.l.ustrata~s th.a formula~.~.an of a combines ~Sarnal KZ using the paramator~s crt Fig. 3~., the cambirxad kars~ai is then tasad ia~ tho speechmslsan~nt model--2 processor of Fig. l4 r r~ a . z 5 . ~ ra z z : 4 a A x.~ :~ y~.~ U T T E FL r/y o c z. ~ r.~r rr ~
rr, y ~; ~ p ~~ -7 r ~ ,..
F~.gs. 33 and 3~ illustrate the aal.aulat_'Lon of mean and standaxd deviat~.on parameters whioh are used in aalaulating the logz~rithm of the likelihood ratio as illustrated in Fig. 14:
Fig. 35 i,lluatxatgs th~ gen~ration of tables for diphone: _ and phoneme maps whioh ar~,used in the phonem~a estimate rearrangement dapioted in Fig. 1~:
Fig. 36 is a table a! lab~ls used in marking the speech ag illustrated in Fig. 25i Fig.,37 ie~ a table of diphone and aonatituent phoneme labels to ba used in the parame'~ex~ aalculatians of Figs. 26, 30 anc~ 3aE
Fig. 38 ie a table of isolat~ad forma of phonemes us~rl in th~ parameter ~salaulat.ions depicted in Figs. ~G and 30.
Fig. 39 i~ a tabla~ of diphonea and oonstitu~nt phonemes used in the parameter valaulation~ dep~.oted in Figs. 30 and 351 Figs . 4 o and 4 ~, ar~ tables of diphane~ and aona~titu~snt. .
phonemes to bc~ used in determining the parameters depicted in Fig. 3~t Fig. 42 is a tabl~ of s~ae~ah element labels used ire speech-~1~ment model~~,~
Fig. 42 is a table of phoneme 3sotyp~ lab~sls used in s~peeah~sl~m~nt modal-~2:
Fig. 4~ ie a taD~le vt plar~n~me labels used in p~hvn~ame re-az~rang~a~ent prvar~aaor 44 of Fig. 2;
fig. 45 is a block diagram of as hardware configuration of the speech res~ognition system of Figa. 1~2~
Fig. 45 is a block diagram of a second hardware configuration of the ep~ae~ala~reac~gn.itian gyhtem of Figs. 1~2;
Fig. 47, is ~t block diagram of a third hardware configuration of the speraall~reaognition system of Figr . ~,-2 ;
Fig. 48 ~.s a block diagram of a Eourt3a hardware confi~guratiori of the epeea~nmr~e~ognitioax system ~f Figs. ~.~2:
and O 8 . 1 5 . 9 O 1 1 : 4 8 A 2~/I m N U T 'I' E I~ ~J(, c C. 1 E I'.1 t7 ~ 1~7, ~ I = H F' fj r Fic~. 4~ ie a t~ab~.e explaining th~ rolationahip between the processing asystem tigure~a, Figs~. ~ - xs, and the parameter d~velopment figuxes, Figs. 19 - 3a. .
~~ ~,a~~ ~~ ~z~TZON ~F err mz,~u~~,TwE ~zsonx av~~va~w This specification describes, with reference to Figs. 1-is, a pxoaessing system Eras x~~aogn~.xing speech. The parameters used in calculations performed by procassox~s in the processing system and their development area c~eaara,becd with re~e~rence to Fags. ~.9-35 and th~ various tabl~s shown in Figs. 36-44.
Hardware configurations of the pray~ssing system are describ~d With ~'efer~nc~ to P'iga~. 4545.
With.rdferanc~ to Fig. Z, a speech recognition syat~m ~.0 ina~ludes a phoneme estim~at~r 12, ~a wArd/phrase determiner 14 and a word/phraae library 1.G. The phoneme ~stimator ~.2 receives a SFEE~H inpu~~ signal from, for e~e~mple, a microphone or a talsphc~ne line. The phoneme estimatpr 1~ senses the ~n~rgy of the SPh~7ECH input signal and determines whether the energy exceeds a pradetermin~ed threshold. If it does it ~~ndicsateB that sp~m~ch, and thus phonemes, are presewt ~.x~ the ~~~i~~~ signal, r Th$ ph~3n~x11~ ~St3.m$ta~" ~.~ t~'l~n C'.idlc'l,~l~tEl~
oorxeepon~iing phonem~ estimates, that is, a group of output signals, ~ach of which is an estimate c~f how Xikely it is that the 9PE~CH signal aanstitutes the phoneme a~asociated with that autput. ~°he estimator also o~nloulat~s th~ time 'between phan~ms det~ct3.ons, that i~, delta time.
The delta tim~ values and the estimates are appliet~ to the woxd/phraea dat~rminer 14. The. waxd/phxage det~rminer ~4, using thca Limn and estimate valu~s, aonsulte the word/phrase~
library 16, which aonta~,ns Wards and phrases listed in terms of ConBtituent phonemes. The word/phrase determiner 14 than O 8 . 1 5 . J O 1 2 : Z O P R!I >x ~ U T T E R Ivg c C L, ~ IvT I~7 E I J, 1F
I =, H y O 1 ! G'1 ~ ~ ") i ~, ~ ~d z3 '_t ivd =~:
asa~,gns a word or phrase to the aPEECH signal and txansaribe~r the ~ap~ac~h. Tho output of the word/phx~ase determiner 14 may tako othor formm, for examplce, an indication o~ whioh o~ a group of possible ~xpeate~d answers has been spoken.
Th~a dotaila~ of the word/phraae det~armiriar 14 will not ba set forth here because the apeaific way ~.n which the phoneme estimates are further processed is not part of the present invention. However, it is of interest that the word/phrase determiner 1~ determines the m~aariing of the Sp~~C~i input signal based strictly on the phoneme estimate values an~i the delta time values produced by the phoneme estimator l2, rather than on ~a more primi.~,~.ve form of data, for example, the :raw speech or its frequanay spectrum.
F'ig. 2 is ari overview of the phoneme estimator 22 shown in pig 1. gt should be noted at this point that the drawings represent the various processes as being performed by separate processors, ox blocks, as they oould be in an appropriate hard-wi~ed s~rat~am. This segregation into separate pra~cessc~rs ~acil~.tates the d~aoription, but these skilled in the art will recognize that moat of these ~'unotions will typically be performed by a relatively small number of common hardware elements. Spsci~iaally, ms~st oil the steps would ordinarily be~
carried out by aria or a very small number of m~.croproar~ssors, Referring again to ~'ig. 2 the phoneme estimator 12 reo~ives a, raw sp~~c~ signal and praaassas it, reducing the data through power~apectrum eatimatian in bloaDt 1~ and power spe~atrum reduct~,on in block 2tD ~a$ deaaribed in more detail below with rsf~xer~ae to digs. 3~~. The data-r~ac~uced signal is applied to broth an energy ~ieteot processor 22 and a receptive field praosssor~2~.
If the energy in the data-reduced signal is abov~ a pres~ete~mina~d threahol.d, indicating the preaenc~ of s~aeeoh, the energy detect processor 22 asserts a I~~T~CT signal on line ~2~.
r~ a . i 5 . ~ 0 1 ~ : i 0 ~ x.~ m ra zi -r ~r ~ ~: we ~ c z. ~ r7 rr ~ rJ, F
z ~ x ~ r~
~~ :~.~
Ths~ asserted DETECT signal energixas they receptive field proaesaor ::4, which then further processes the data, agse~mbi~.ng a receptive field. :Cf the signal energy is below the threshold, the DETECT esignal i~t not ;~saerted and the reo~ptiv~e fi~sJ.d prooassar 24 is not energixac~, prohibiting further processing of the SPEECH signal. The energy d~tect processor ~2 and the raae~ptiv~ field processor 24 are described in more detail below with refereno~a to Figs. 5-6.
Det~gting the preganae of phonemes in 'the received speech using the ~nargy processor a~ replaces the two~pa~th processing performed by the speech~recognition system desarib~d in ao~
pending appli~oation entitled "SpeeahbRacognitian circuitry employing Phoneme E~timatton" of which this ie an improvement.
Th$ eaxliar system, which is hex~inafter referred to as speeoh-x~oognition system-I, examines ~.he speech signal and dataot~s the pros~nae of either initial oont~onaxWs or vowels in one processing path, and the presence of final ~onsanantg in the ether proaassing path. Depending on which path produaes~ the det~at signal, the speech signal is further proaesged by a vowel, initial consonant, or final oongonant proo~ssor. Thus the speech-r~:aogn~,tion system-I raguiras th.raa ret~eptive field processors, each processing the speech signal to match it to a a~3bset of phonemes, instead of the one used in the present system. The present system, through anhanaed modeling and data reductions is able to compare the signal rapras~ntation with 'the full set of poss9.bl~ phdnemas.
Ref~rring agail~ to ~°ig. 2, when a ~ETEC'I s~.gn~ti asserted on line 22A, the energy effete~t proaeseor ~~ also produces on line ~2~, a signal proportional tn 'the integrated energy of the speech signal as described in mor~ detail b~low with reference t~ Fig .5.
The integrated energy signal is applied to an adaptive noraie~al~.zsr Z~ which also rgoaiv~ss '~ha output of the receptive 1 5. 90 7. 2 : 1 0 f'M :xT$'LTTTER~c GLFx7TJEx~T, F I -;F-I ~r>;3 ~lo~.
field processor 24. .The int~grated ~nergy signal is used by the adaptive naxmalizer 26 to impoad z~ ascend, higher energy threshold.
The ada,ptiv~ normalizex 26 r~moves an ~stimatsd mean from the data, that is, Erom the output a~' the receptive field procas~aor 24. The estimated mean is incre~mentaily updat~d only iii tote int~gratrad energy level o~ the data ie above the higher predetermined energy threslar~ld, signifying a speech signal with a relativ~ly large signal-to-no~,se~ ratio. ~'hua the adaptive normalizes 26 do~s not update 'oho estimated mean if the integrated energy ls~rel a~ the data ~,a below the thr~ahold since the ~stima$e~s may not #~e arsaurat~ in such oases. The egfect of th~ oparatiana of the adaptive noranaliz~r 26 can data with a high int~gratad anere~y signal is to apply to the data an adapted mean which dccaya ~s~ponentially over a lung "time constant".
the time constant is d~.fferent for different aituatians.
Speai~~.cally, tho time aon~stant in this sass is measured, not in time iteal~, but in the numla~r o~ instances input wectore ar~ applls~d to the adaptiv~ normalizes. A ~.arge number signifies that a particular ag~eaker is continuing to speak.
Thus, thc~ cha~caatariatias o~ the apeaah and the aaaoaiat~d audio channel, should raa'~ Change drastically fear this epe~aCta.
Accordingly, a ~.~ong time oonatant may ba used and the mean of the data aasociat~d with this apaaCh Can ba reduced to near ZBr~.
~onveraely, a small number of inatanCes in ~ah3ah the input.
vectors are applied to the ad~a~ative nox~snaliz~r ind3uates that a new spsakgr may be starting to speak. Thus, the aharaateriatica of th~ speech ane~~or the audio ahann~al are rest yet known. accordingly, a relatively short time constant ie us~d, and the adaptive average is quiokly ad~ustad to redoes the mean of t.he~data as alo~a to z~ro ae possible. ~lae f~ i3 . 1 5 . ~ (J 1 c: : 1 C', P 2V1 ~ N U T T E R M a G' L., E P.J t7 E DJ, F T ~ H F' rJ 4.
i -13.~
adaptive average is adjusted to accommodate the new apaaker~s pronunciation of the various sounds, for exempla, and also to accommodate the dif:~ez~snaee in the ~aoundm due to the quality of the audio channel. The operation of the adaptive normalizer is disr~uest~ad in more detail below with refer~nce to Fig. 7.
The normalized data are next applied to a rya~ptive field nonlinear proc~aaaeor a8 and thar~aafter to another nonlinear processor-2 30.~ The nonlin~ar proaeeec~re 28 and 30, described in mere detail below with referen~e to Figs. a~s and lo, respectively, manipulate the data, and each paes~es linear first order data tex~m~s and nonlinear second, third and/or fourth order terms. These terms are~then passed to normal2~ation processor 32. The normalization proc~a~eor 32 normaliz~ae th~a data and app~.~,~a them to a fzret of two speech-elemont models.
The normal9.zation processor 32 is deeoribed in more detail below with reference to Fig. 10.
The ape~eeh-element model-~1 proaeegox 34 reduaas the data appli~ad to it using parameters, that is, selected spe~aoh labels, formulated from the development data. The ape~ah-elemen'~ model-1 processor ~4 thus sal~atg for further procea~aing only the moat significant data. xhe reduaad diets, whioh represent figures of merit xelated to the likelihood that the speech contains reepec'Give speech e~.aments associated witty the components, are then canaatenated into triple vectors in block 36. Each input vector applied to processor 36 induaas an autput, which ie formed, usually, ~x~om the input vector, the previous input vector and 'the eubmetiuen~t input vector. Tha output may in the a~.tarnative be Formed with zero~~iller vectors, the choice depending upon the delta time signal 2zC
from the energy detect processor ~2. The use of the r~ubsequent input vaster induces ac delay in prooeeeor 3s which is described in mare detail below with reference to F'ig. 12.
~'he triple ve~tore axe then applied to a third nonlinear OB. 1S. 9O 12: 1O PTA %khtUTTERNiaCLErJrJE2~7, FISH PO5 2~~~ ~~%~
-i2-praa~ss~sor-3 38. The nonlinear prr~aassor--3 38 manipulates the data by Computa~tionwinte~neiwe, nonlinsar operations and 'then applies the data to a second speech-~el~ment model-2 proaessox~
40 which produvee estimates that th~~ apseah contains the speech elements (which wa later xefex to as phoneme isotypes) listed ire T2~bla 8 (Fig. 43) . '~ha sperach-element model processors 34 and 4a are described in more detail below with reference to Pigs. 11 and 24, respectively. The nonlinear processor-3 3a is described , in more detail below with r~,~erence to ~'~,g. a3.
Thereafter, the estimates: are applied to a logarithm proasesor 42 which aalau7Late~s the logarithm of the like~.~.hood ratio for etch. The ecstimatas acre then further simplified, that is, re-arranged and integrated, in pros~seors ~4 and 4s to ready the data Ear tho word/phrase datexminer 14 (Fig. 1). The a~impligier~ eatimat~ss and the delta time signal 22~ from the energy detect prop~saor 22 (Fig. 2) are then applied to the ward/phraae de~term3ner ~.4 whicsh assigns ws~rds~ or phrases to ~ths ~spe~ah. The various pros~saaors 42, 44 arid 46 are deeaGxi,b~d in more detail below with reference to Figs. 14-18.
PHONEME &~ROCESSINO ' Referring now to fig. 3, the pawer spectrum estimation processor 18 calculates a power spectrum ~acstimate ref the HPEECH
signal by f~.rst~ aonvsrting the analag SPEEC~T signal to a digital repreeenta~tion in an Analog-to-~ig~,ta1 (A/r~) converter The A/~ ~onvarter, which ae of eonvsntional design, samples the SF~EEC~I signed a't a rate of 8-kHx and produa~s Z6-bit digital data symbols, an, representing the amplitude of ache signal. The 8 k~3x sampling rate i~ consistent witty the current t~lephone industry standards.
The d$.gital delta samples, a~, are then segmented into ~eg<aences of 128 data samples, as illustrat~d in block 102.
C1 8 . 1 5 . 9 C1 1 ~ : 1 ~_J P NL %ts 3~J U T T E R M a C L E IV ~ J E nJ ~ F
Z ~ H F' C'a f~
~'~"°~-~
~.~~3~~i Each of these aaa~uenoss aarraaparids t,d a 12-m~.lliae~aond ~segm~nt of the SPEECH signal. The ~aeque~riaee can be thought, of a:a vectors bm 104, with oath voator having olEmentg bk~m. The bm vaators overlap by 32 data samples, and thus each bm veotor contains 96 new ~alemants and 32 s:lemante from the previous vector. Next the mean,. or 1~.G., value of th~ signal eogmont r~presented by the bm vector ins romav~ad in blank loG, producing a vector c~ 108. The mean valuo oonveys information whioh ie of littlo,or no valuo ~.n phoneme estimation.
R~f~rring again to ~'~.g. 3, tho vo~stc~r Cm 108 is applied to a 128-point discrete Fourier Transform (i3FT) circuit 110. Up to this paint, the pow~r spectrum estimation proaes~ is similar to the gpe~ch-element praproar~ssor of tho speech-recognition system-I. Haw~ver, in ord~r to increase the resolution of the rosuits of the DFT, the current system p~rformc~ the DFT urging lz8 data olem~nts as appaaed to the system--I, which uses 64 data ~lAm~nt~ anll ~1~ ~~r~~~ o The 12s distinct olomonts appaiad to the pFT circuit aro r~ai and thus only sixty--five of tho izs, mostly aamplea~, output valuos o~ the DFT, dk,m, repra~ont non-redundant data.
Tho power spectrum is thus calculated by multiplying the DP'T
values dk,~ by their ro6poctivo complex conjugates d*k,~, to produces carx~eepandirig r~ai values, ek~~. The eis~ty-five non-redundant valuos aro retain~d in a vector a~ 114. The data era thus reduced by onr~-half whilo tho ~.nfarmation b~liaved to be the most important to phoneme estimation ie retained.
Tho power spectrum valuos o~~~ are appi~.ed simultaneously to a 'won Mann window's circuit 11S and a band-limited ~norgY
circuit ~.a~ (Fig. 4) . The von tiann window circuit o~smooths'r the epeotrum in $ conventional manner, roduc~.ng the sid~lobes that result from truncation in the time domain.
The smoothod voatar fm is applied to laloc~k l2tD who'ro eari~eus o~.~ments, Ekr~, of vect.o~C f~ are combined, producing a C1 B. 1 5. JO 1 2 : 1 O PM mF~IUTTEI2~r. CLEh7h7Eh1r F I =;H F'~,'7 strateg~.cally Y~duGGd VCCtdr gm 122, ThB reduC~d V~actor includes 'germs from a fraguanoy range of 2.8.75 Hz - 353.75 Hz. ~hia range corresponds to s~.gnals~ reGeivsd using telephone line communisation.
The band~l.imited energy hm from circuit 118 includes the ari~Ygy within the same fr~quenoy ranges art that seed for the vector gm 122. ~h~ prewxous speech-xeaognition system-I used ari energy t~rm that was not band-limited iri dais fashion, but inst~ad was the average power of th~a entire gpeatrum. using the average power introduced soma noise into the energy which was not derived from th~ speech itself.
~ha band-limited energy value, hm, is aonaatoriated with the vector c~~ 122 in circuit 124 to form a vors~tar pm lxe. Thus vaster pm contains a data~reduGed version oS~ frequ~nay and ~nergy ~,n~ox7mation representing, for the most part, the center band frequ~nci~s of the SF~EE~H signal. F~eduainc~ the data in thin way retains information of particular value for forth~r computations while reduairig the data to a manageable size.
~'he phoneme-idsritific~~ltion information probab~.y resides in the relative, rather than in t;ha absolute, sixes caf variations of the indiv~Ldua~. elements pk~,~ of vaator p~ 12f>. Avvord~,ngly, as iri the speech reac~griition system-x, tt~e elements pk~m, which are all pos~.t~.vs or zero, are inarement~d by one, and the ~,ogarithms of the results are Gomputed,as indicated ~.n block 128. incrementing the vector pm elements by ori~ ensures that the resulting logarithm values are zero or positiv~ (~.og~~.
0) » ~'he resulting values c~x,m axe than applaed to energy detect processor 22 and receptive Geld processor ~~ (~a,c~. 6), f'ig. 5 dspiotet the energy detect pros~ssor 22 iri block diagram form. the energy Gompor~ant~ of vsator c~ 130, e~.ement q~~~, is integrated over a three time unit time ssgm~nt iri 3.ntsgrating circuit 132. Each time unit is 1.~ m~.17.~.seas~nds ~,oi~c~, ae dieaussed above, and thus the gnargy is intsgrat~d O t3 . 1 5 . 9 O 1 2 : ~ p T' 2,n ac $~y U T T E R ~ o C L F h1 I~.7 ~ N, ~' I
~ H F~ fj F~
over 36 milliseconds. T~ the integrated energy, rm, ~xceeds a predetermined threshold a dataator 134 asserts a DETECT signal 22A, sm, iridiaating the pxasence oP speech. The DETECT signal, sm, may bs asserted at most ona~:a~v~ry three time units, as shown in block 134, sincr~ the subscript, m, of the energy parameter rm must be zero in modulo three arithmetic.
Each time the DETECT signal 22A is asserted, block 3.36 produces ~a delta time signal (gym) aorraaponding to the time betty~an this DETECT signal and the previous one. The delta time signal is applied to x~n .interval e~ctractie~n cirouit 13a, which produoas a time signal. ~" 2ZC. An associated energy axtraation circuit ~,4~ produces an integrated energy signal, t~
228. Hoth the ~~ and the tn signals correspond to th~ SPEECH
signal ~iv~ time units aarll~ar, as diec~usaad below with raf~ranca to F'ig. 6. . The parameter index has changed from ~,m~~
to "nn t.c~ ~amphasi~~ 'that tkl~ extracted delta time and integrat~d ~nargy signz~ls era produaad for only.cartain s~gmants of the ~pEECFi signal, that is, segments for which a DETB~T signal is asserted. .
The DETECT signal 22A is appliod, slang With the vaster corn 130, to th~ receptive field p~.'orie~lsor 24 shown in Figs 6. The intagrat~d energy signal 22B is applied to the adaptive normalizar 26 sh~awn in Fig. 7. The delta tim~ signal 2~C is applied both to the fox-matian of ~,riple vectors in processor 36 as shown in gig. 12, and to the eavimator int~c~ra~or 46 as discussed below with r~farsna~ to F~c~s. a.~ and 3,~.
R~sfarr~.ng now to ~'ig. 6, the nETECl' s~.gn~~. 22A energizes a r~~~ptiva fla~.d axtr~c~tion circuit 2a0 which aas~mbla~c a racap~tiwa field 202, that ia, a group of vectors containing frequency inf~rmation covering a signal segmon'~ 12 'time units long. The DETECT signal cGrrst~pands to a signal segment in the middle of the receptive field, that is, i~t~ a s~.gnal segment a time units earlier, or to the mp5 column ~.n the raaapt~.va field o ~ , i 5 . s o i 2 : 1 o p xvx ~ r~ a ~ T ~ x5 a~a a c L E rr rr F rJ, F x :~, ~ g r~
matrix 202. The delay is xamaes~sary to synchronize the delta time and int~agrated energy aignalo producsed by the onergy d~teot proc~,~ssor 22 (Fig, a) w~.°~t1 th~a racept~,ve field, centering as aloaely as possible thra signal segment for which t.h,~ DETECT signal is assert~ad. The receptive fields are xe~.atively large, 12 time units, and thus no information is Xost in limiting the D~T1~CT s~igraal to at most one every three time units.
~1n averaging circuit 204 averages pairs of adjacent veotorc~ of the reoeptive field matrix 202, that is. elements axed c~9~m.10 are averaged, elements t~osm~g and c,~~m.8 are av~raged, ~ta. This op~ration r~ducas data by ons~-halt, producing matrix U~ 206. The parameter index is aga~,n changed from "m°' to "nn to emphasize that ~r~captive fields a>id ~.ntegrated energy signa~,s are produced for only certain segments of 'the ~pEECH sigatal.
The sspesch-recognition system-I diaaussed above reduces the data by two-thirds by averaging the data over thxee time units. The reduced data are thereafter aubjeated to nanlin~ar processing. Using the currant cyst~m, however, better raaolut~.on ~.s achieved by avexaginr~ the matrix elements over only two time unite and r~taining more data. The ~oextra" data may be retaitaed at this point in the proc~sa because of enhane~ed data reduction within the reaeptivs field nonlinear px°ocessor 29, discussed below in connection with Figs. 8 and ~.
Ths matrix Ltn a0s, is neart ag~liad to the adaptive normaliz~r 26 Shawn in Fig. 7. Th~ adaptive normalizsr 2~
produces a matrix vn 210 by aubtraa~ting a fiat~d~-~oarameter mean ~a~9 sand then dividing by a fixed~~aramo~tar standard deviation, ~r',. Th~ fined-parameter mean and staa~dard d~,viation values are calaula'ted f~COm the development database as described below with r~afersnas to Fig. 1.9.
If the statistics of th~ inac~ming sF~EECFi signal are O 8 . 1 5 . 9 O 1 2 : 1 O P ZvI m N'U T T E R ~ c C; L ~ h71w7 E ~7, F j y' H
F' 1 _17_ sufficiently close to those of da~Ca in the development database than th~x "nc~rmali~~ad~~ matrix V~ 2~.0 has a mean close td zero and a standard deviation close '~o one. However, it is likely that the statistics of the incoming SPEECI~ signal era somewhat diff~ren~. from those of data in the development database.
end~ed, individual voic~ samples from the development database may have mtatistics that are different from those in the aggr~gate. Hence, for an individual SPEECH signal, we expect that matrix Vn will have a mean different from zero and a standard deviatirrn different from one. Accordingly, further as~aptive norma~.i~at~.on is applied in the circuitry of Fig. 7 in cxrder to allow at least the mean 'Go decay toward zero.
xf the matriu V~ 210 data correspond to a SPEECH signal B~gment for which th~ integrated energy, tn z2H (Fig. ~), is abave a predetermined value, indicating a high signal-to-noise ratio and thus voicing, the data are further proaesaed by calculating Chair adaptive average in blocks 212-z~,s and then subtracting tha~e~verage in block 220. .First, the data are averaged over time, that is, over th$ matrix rows, in averaging circuit ~2~.2 producing vector w" 21~. v~actor w~ thus contains only the signal frequency information. This information adequately aharaaterizes they speaker's voice and the. audio sshannel. These characteristics should not vary significantJ.y ~ver time, particularly over the time coxresponding to the matrix data. Averaging' the data over tim~ thus reduces them fr~m ld~ parametors, that is, 5.05 elements of matrix V~, to twenty-~ne parameters, that is, twenty-one elemenfis s~~ vector w~.
The elements of veotar wh Z1~ axe applied to exponential averaging circuit 216. Adaptive averaging circuit 27.6 thus Gampareg 'the int8grated energy, tn 22~, calculated in energy detect processor 22 cfig. a) with a predetermined threshold value which is higher than the det~ct threshold used in the O 8'. 1 5 . 9 CJ 1 c : 1 U F' IVI he T d U T T F R. ~ c C L, E Iv7 I 7 ~ N, F
I S' H F' ~, -ls-energy datavt processor a2. Thus averaging circuit 216 detects which signal aegm~nts hava~ h~.gh signal-to-noise ratios, that is, which segments have significant voice components.
Tf ths~ iritagratad energy do~g not exceed the "voias'r threshold-value, the adaptive average vaCtor x'~ 218 rema~,ns what it was at the previoum instande, x'~~~. In this case the sxpon~ntial av~rage is subtracted in k~~lock 22o as b~fore, however, the average itself is not changed. Signal segments with energy values below ~.he voice-threshold may correspond, on the one hand, to unvoiced fricative or nasal phonemes, but may also correspond, on ~th~a athsw.hand, to breathing by the speaker or other cyuiet no~,s~s, pax°ti.~sular7.y at the and of breath groups. Such low-energy signal segments may not be re~.~.~ab~.a in ohmracterixing the mean of V~c~tor W~ 214 ~Or purposes of xeaogniaing phonemes.
the exponential averaging is performed using a time period which is relatively long with raspaa~t an individual phoneme but shoat wham compaxed with a serieas cf words or phrases. The averaging thus does not. greatly affect th~ data r~alating to a single phon~me but it do~s redue~ to near zero the mean of the data xalat~.ng to words ox phrases.
mhe time period us~ad varies depending on the length of t~.me the system has bean proeessing the speech. Spaai~iaa7.ly.
exponential av~raging is performed over sith~r a short time period corresponding to, for example, loo receptive fields with sufficient energy m approximately 3.6 seconds - ar a long~r time period corresponding to, for axamplg, 30~ receptive fields with sui~ficient, energy - approximate7.y 10 seconds - the length c~f' time depends an the number of times t~ha integrated energy s~,gnal 2z8 has e~tcsadad the Voice-threshold, that. is, the number of times that t~ ~ 25. ~ha ~ahoxtor t.ima period is used when the system ~naountars a new spr~ak~zr. It thereby adapt~a ~uia%ly 'Go the speaker's eh~craat~ristics and the 08. 1 ra. 90 1 2 : 1 O PIVL ~NUTTEPt/$c CLENI~7E2~7, y I ,H y j, ,v -.~
.~x.9_ raharaataristias of th~ audio ahanne~l. ~hsrea~'ter, the system uses the longer lima period to proaass that s~pr~aker ~ s spaaoh because his or her voit~a tsh~lrar~teri~stias and those of the audio ohannel are assumed to remain r~lativa~,y oonstant.
once the csalaulatians Ear the adaptive av~raga vector x~~
21~ are completed, the adaptive avar2~ge vector is subtracted from the matrix ~h 21.o elem~nta in blank 22o to proc9uca matrix x~ 222. the mean of the data in matrix X~, GdrrBE~ponding tp a long time period and representing speech signals that include voicing, is now o~.oao to Sara. Matrix Xn im ttaat~. t~pplied to raaeptive field nonlinear processor 2~ shown in black diagram ~arm in Figs. ~ and 9.
In comparison with the oorroa~p~nding nonlinear processing described in our previous application, the nonlinear processing of Figs. 8 and 9 calaulat~s Eew~r nonlinear al~ments. ~s a conseequenae, w~ have b~en ab~.e to retain more data in ~arlie~r ~procassin~ arid thus supply high~r~r~solution data to the t~alaulation of the more-important nonlinear produots that w~a do caloulat~.
with r~egerariae to Figs. a and g, the alemants of matri.x Xn 222 are combined as linear terms and also as specific, partial outer products in blpoks 224~23~. ~ssentia~.ly, the J.in~ar terms and the partial c~ut~ar produc~t~ ar~ added over the time dimenaian of th~ r~ceptive field. xhess specific products arse designed to canv~Y o~rtain intorrnation about the speech signal wt~i~,~ significantly reducing the data from wha~G thoy would b~
iE a straight outer product, that is, all products of distinafi mat~rix~element ~~irs w~re calculated. ~'ha ~arlisr spseah-reaognition system-z calculatoa a straight outer product at 'this paint in the processing, mnd thug, data aura required to bs significantly reduced during prior processing. 3'he current systatn, era the other hand, may retain mare data up to this point clue to this roox~lin~ar ~srace~s~~ing step, and it thus OR. 1 ~. 9O 1 2 : 1 O PM >xI~IUTT~R,NLc, CL~NN~h7, 1~ I SH P 1, maintains better resolution caf the inoaming data.
Th~x recaptivsa-field nonlinear proe~essor 2& pa~oduaes fbur vector groups. ~aoh vector group oontains v~otors y~~, z°,n~
and x~n and is assaaiated with a diffex~ent time delay, Th~ y~~
vectors c~onta~~,n seta which era a linear ar~mbinatiosa of the terans used in forming the two associated '~z'~ veatars. Th~ z'~,~
vectors contain the results of combining certain partial. outer products form~d using the energy, or first, terms in the variaus matrix Xn 222 aolumns, and the z~~ vectors contain the resu~,t.s of specific partial outer products f~armod using the ndn~e~nargy. or frequency, terms of the an~atrix x~ co~.umns. The ~mrmation of aa~ah aE these vec~tc~rs is discussed below.
significant time averaging 3s performed in reaGptiv~a-field non~.in~ar processor Vie. It is assumed that a phoneme is °'atat;onary~~ within a receptive field and thus that th~a location c~f a given frequency ~c~alumn within the receptive field does nr~t convey muoh us$ful signal infermation. However, th~
nonlin9ar combinations cf frsc~usncy columns, averag~d over the time window of the rac~apt~,va fia~.d, do represent information which is useful ~'ar speech reaognitian.
As set forth above, a vector group ie formed for each ~f four t~.me~-diffarsnae segm~nts. vector groups for further time d~.ft~er~nees ar~ not aalculat~sd because information r~lating to variar~aas near greater time dif~aranoes appa~ra o~ l,ittla value.
spaoi~E~.aal~.y, the v~s~tor gx~axp for a time diff~arenae of zero ~a = O) is farmed in k~loolCS 224-228 of Fig'. 8. sl~ck 224 praducas the first ~alemeni~ of the 'veatcr yaon by adding together the first aiamar~ts ia~ a~.l of the columns of the matriX X~ 222.
xt produces the second vector e~.~ment by adding tcs~sthar the second el~m~nts in all the columns, and sa on. Accordingly, thg vector y~ has as its elements the matrix data summed cover time.
L~1 8 . 1 5 . 9 C1 1 2 : 1 CI P T~4 m ~ U T T E F: M c r' L ~ ?'J D3 E h7, ~ I
, H y y, 4, 2~~~~~~
The a~aonc~ vector in than veaatar group, vaatvr z'o~~ is formed using the matrix enorgy terms, wh~.ah are the fir~t elements of the columns. Block Z26 forms, ~r~r Gash Galumr~, the product of the energy term and all the ether elements in the same column. The products are then summed to farm the elements ' of vector z'o~n. The vecter al~m~ant~a art thus the energy praduata summed over time.
The third vector in th~ ve~atvr group for the time diffe~renGe of sera, so,n, is formulated in block 228. ~hia block forma all the products among the matrix X~ 222 fxer~ttency elements, that ia, among all the elemerita in a column except the first. Ones Cauld here use the outs~r product, taking all of these products separately. Instead, a sum is formed of these products which is like that in an autoaorrelation. This sum is aallerl a "self-praduat" in block 228 ainaa it ~.~ f4xmed from within the frequency elements of a, aing~,e aalumn. This eelf-produat is then summed through time, i.e., ovex all the columns. Caking the self-produota within the frequanay aolumns instead of the full outer product strategically reduces the output v~ator from what ~.t~would hav~a been iE a dull outer product were calculated. Thug the ncrnlin~sar processor can.
process a larger input vector containing more signal ~requenay data, that ia, data with h~.gher f~ree~uency xesGlut.ian.
V~ator graupa far time differences of ~,, 2, and ~ arc°
ca~lculat~d in blacks 230~2~4 ahawn 3.n F'ig. 9. The vector Y n aontaina linear aambinationa of all the ~al~ments used in formulating 'the two associated ~~zn vgatora. Thus for a ~~.me difference o~P 1 (~ ~ 1) , vector ylsn cvrat~ains vombinations of all the el~manta that are one column apart, i.e., the elements in adjacent columns. Similarly, the y~~ vectors for time d~.~ferenaea 2 and 3 ~ar~ formulated by combining all the elements that era at least two and three columns apart, reapectiv~ly.
O~. 15. 9U 12: 10 P2JZ %~NUTTEFIMcCLEBJhTEIT, FZ~73 yl:., vector z~~~~ is formulated in block 232 by combining the energy farms with matrix r~lements which are ore column apart.
8imiletrly, the vector ~~~n im formulated in block 23A by Combining frequency elements that are, one column apart. 7Chus, the "~" veCtore contain e~,°smenta r~presenting certain combinations of the en~rgy and frequency terms from columns relating to the $ppropriate time difference. 8lmilarly, the vector groups for time differences a and 3 (e ~ z, s) are formed by,Cambaning elements which are two and three columns apart, respectively.
Vectors ~~~ are formed in block 234 by Cambining all the products of the fr~qu~ency ta~ema i~ram paixs o~ columns. mhe produata are ~ummed in a fasshion lake that of a Cross-correlatian be°twe~n frequency vectors, mhe sum in block 234 is called a ~~aroas pradu~st'° since it is formed be~tw~en the ~requanoy elements of two d~.ff~arent. columns. this c;roes product is then summed through tim~, i.~., over all the pairs of aolttmna adh~ring to the times d~.:~ferenae ~. Again, taking the cross prs~duvt of block 234 strategically r~duces the output veatar Erom what it would have bean if a full cuter product were Calculated. Hence, the input vector may be larger.
They veatar groups are theta Concatenated in block 236 to form a 431~e1,oment vector an 238, which :~a a nr~nli.near representa~:~.on of they data. ~'h~ superscript "T" in block X36 dgn~t~~ th4 '~1"~n~p0~9 ~7~ th8 v~G°~.'b~° ~"~d wx'2t~~n a It ie impartant to note that, although the nonlinear processor 28 emp~,oya multiplic~aticn to produce non:lineax interactions between olementaa ether nonlinear funct~.ona COUld be u~ec~ in plac~ of multiplication; the important feature is that saome type of nonl3~asar interaot~.c~n aCCUr. we employ multiplication mere9.y because it is simple to impiemet~~t.
The vector an 238 is appli~d to the aeaond nonlinear pxaceasax~m2 30 (dig. 2) which is depicted in Fig. 10. she f78. 1 5. ICJ 1 ~~~' : Z [a F'h?, m~UTTER~c C,'LEN2~1E~7, ~ I ~H y I t=i ~~r~ ~' -~3-elements of ers~otar an are :firs' dea~xxolat~d and data-reduced by multiplying th~am by an a~igenmatxix Fee. Th~a eigenmatrix FZa is formulated Eram the davalapme~nt database as illustrated in Fig. zz. They e~ir~anmatrix Faa aan~Ga~.ns aige~nvaatora corresponding to ~.ha tw~anty~s~ix larg~agt aigenvalu~s aaiculated from the devaalapment data aorresponding to the veator groups.
Thus multiplying a~ by tlx~ eiganmatrix reduces the data to the components of a~ lying in the directiane o:~ the twenty-eiac sigenvectors sel~!eted as accounting for the most variance.
The data are thus reduced from the 431, elements in v~ctor a~ to twenty~~ix ~~.ements in v~atar bn 242. By so reducing the dat$, we. 1~~a only abQUt 4~ of the information relating to s~.gna1 variance. Aaaordingly, without sacritiatng mush at the impartant signal intarxnatian, a c~omprt~mis~ is arshieved betwe~n (~,) retaining complete signal information and (ii) limiting the number at parameters subjected to nonlinear processing and thus to a geometric ~xpansion in the number of pmrameters. W~s belleva that by selecting the intarmation oorrespo~ac~~,ng to 'the laxg~st e3genvectore we are selecting the infr~rmation which is mast impar~tan~ t~r phQname recognition attar further processing.
The resulting twenty-six-~slsment ver~tar, b~ 242, is subjected to fixed~parameter normalization in block 244. The mean values, ~5~, depleted in blac~C 244 are formulated tram corresponding elements in ~a group of twenty-~xix~element vectors 1~~ in the development database as discussed in mots detail balaw with reference to Fa.g 23. The twenty~ix elements in the v~atGr b~ generated for the incoming SpFI~CFi signal ar~ compered with the av~rags of corresponding elements in the development database. The relative data values, rather than the actual values, are important for phoneme estimation. The m~an values, which may add little ihtormatian, ere thus elimina~.ed from the vector elements. ode may omit this normalization pr~aaess~.ng OB. 15. 9U 12: ~[~ FM >k~'UTTEI2~oC:LErIPTEPJ, F'IS'H y1~7 '.~ ~ =3:
e~tep from future embadimente.
A fu~,1 cuter praduat of the twenty-six elements of "narmaliaada~ vaator on 246 is th~n formulated in block 248.
The result 18 & 361-~elament veatAr d~ 250 Gdrttaining third°~ and fourth-order terms re~lativa to the adaptive receptive field mat~r~.x %n 222 (fig. 7) . This vector do is concatenated with the elements of vector a~ a38, flaming a 782-element vector a~
254. The. concatenated data are .th~n ,applied to iChe normalization proces8or 32 (fic~. ~
again, while we employ multiplication in step 246 beaausa that ie a ~timpl~ way to produce nonlinear in~.eraation results, other nonlinear functions can also be emp~.sayed for this puxpose.
RaiCerring to fig. 11, the Vector e~ 254 is sub~eated to another fix~d-parameter norrnalizatian in block 256. Thereafter the data ~.n the resultant vector ~" 258 are sub~ec~ted to ver~tox-by-vector normalization. That is, each individual veatox tn ie normalized sa that, across its 7a2 alements~, the mean is zero arid the standard d~viation ~.s one. The resulting normalized veotor g~ 262 is applied to the speech-element model-1 prorsessor ZG4. The data are thus raduc~d to a se's of spe~oh element estimates, with each estimate raorre~sponda.ng to one o~ the labels in Table 7 (dig. 43). Furthax nonlia~ear processing can ba p~r~ormed on the reduced data to better eetima~:e which particular epee~aYa element the data represent.
The speech-element modal-1 prvaesgor 264 multiplies the x~armal3.zed vector gn 262 by a kernel K'. The kexriel Ka cax~tains parameters relating to specific speech clement labels calculated using the data 3n thc~ development database. These labels are listed in Table 7 (fig. 42). formulatian of the kernel Xc~ is discuss~d with reference to dig. 28 b~low.
Multiplication by 'the kernel K~ effectively multiplies vectrar ~n by each o~ ninety-faux vectors, each o~ which ie aesc~aiated C7 8 . 1 5 . 9 C7 1 2 : 1 G F' T,/I >k r.1 U T T ~. R N( c C; L E I~T rJ ~
2~7, F I =3 ~Z y 1 g ~ .ra ,~
~~a~~~E,,~~.
With a different speaGh element listed in Table 7. The multipa~,~aat~.on generates a vaator hn whose avmpon~nts are n~.no~Gy-four figures of me~xit, sash of which is related to the likelihood that th,r~ ~spssch aontaina 'the speech a~l~ment a~soaiatad with it. The speech-element model-1 pros~asor 264 thus strategically reduces the data x~alatinc~ to tho incoming ~PRECFT signal, that ig, veotar g~, from 782 elements to ninety-four elements.
The veatox hn 266 conta~.ning tho r~duaed data is then concatenated with v~ators from two previous time periods in pxoc~essor 35, shown in Fig. 12. The de~l~ta time signal 22~
(Fig. aj is also app~,ied to the proaessox 36. 8peai~ically both the vector h~ and the ~l~lta tuna signal 22C are applied to buffers wtl~re the respective values for the two pz~ew~.ous time periods of each are stored. ~'hua the two buffers contain information relating to the same thxes-time-unit-long time period.
~f twc~ aoneeautiv~ vectors correspond to delta time signals longer than 12-milliaeaonds, we assume that the v~etors are derived from non-ovcarlapping r~ceptivo fisaldn. Thues the vector corresponding to the long d~lt.a time s~,gnal, that is, either the fix~at. ~r third vea~or atoreel in the buffer, will add liittla information wh.~ch is helpful in assigning phoneme estimates to the c~enGer vector hn. Accordingly, the corz~esponding vector is replaaad with all ~~RCa. This ensures that they triplea~ formed in block X04, i.e., vectors p~ 3c6, do not contain non-contiguous data. ~'he triple vector p~ 306 thus covers an ~nlarg~d ~~windaw~l in aontinuoue speech, ~orme~d from data r~eri.ved a~rom three overlapping rec~aptive fields.
Tn subsequent model.»g the speci~ia phoneme label associated with, the larger window are those of the aentra~.
receptive ~i~ld, so that the phonemes reaogni~ed are aentsred as much as possible in thr~ larger window. Many ph~snem~a, for l~ Et . 1 ~a . H U 1 2 : 1 O F DQ m N U T T ~ R 7~1i c C L ~ 1~7 r.7 E tJ, p I
~ H P 1 g ':e L.i exempla the °°au°~ in the word °~thougand°°, are heard morn distinctly over a relatively long time period anti thu~ should be more easily reaogniaed using this larger winc9ow. However, it the system is receiving ~PE~~H signals aorreaponding to rapid s~paaah, the lonc~nxr time period may result in mare than one phon~me per window. purther nonlinear processing and speech modeling .allow the system to recognize and s~parate such phonemes.
Referring to Fig. 12, enlarging the phoneme estimat~a time window a~t~this point.in the prac~easing is mor~ effective in phoneme recognition, for example, than increasing the size, that is, th~ r~levant time period, og the reaeptiva field.
Increasing the time period covered by the receptive field increases the numb~r of param~ters, assuming the resolution o~
the data remains the same. Then, in order to perform non~.~.naar processing using the larger receptive field without unduly expanding the number of,parametera the system must handle, the resolution Q~ the data-~ e~.ther per time pericsd ar pa:r frequency distribution-- mur~t ba radua~ad. ~Gengthening the ~,~.~n~ window at this point in the proaeasing, that ia, after a first speech-element mode~.ing step reduces the c9ata by selecting data x°elatint~ tc particular speech elements, inst~ad of lengthening the receptive field time period, all.ow~~ the system to look at data representing a longer segment of the incoming gpEECH signal without unduly inereasing the number of data parameters andOox without re~uaing the resolution o~ the data.
Referring still t.o Fig, ~.2, by enlarging th~ phoneme-estima'~e t~.me window w~ eliminate some of th~ contsxt-dependen~sy labeling s~~ th~ earlier speech r~aognition system-x.
Thg epeeah-recognition system-~ alters phonem~ labels d~pending upon context. F'or example, i~ a ~rc~we3 was preceded immediately by an unvoiced consonant or a voiced consonant, then the label OED. 15. 00 12: 1 O PM *?dUTT~R~cCL,EP71'dErJ, FI 5H F~Jra t G, r,~
~aa_ of that vowel was changed accordingly. As a consequence, phonem~ labels, particularly those for vowels, proliferated.
Tn the pre~tent system, however, the great ma~~ority of phaneme~s hav~ only one label, and the increased nonlinearity of the data aonveyst fi.hs~ Gontaxt of the phoneme labels to the word/phz~ase determ~,ner 1~ (F'ig.1). The number of label~ and thus spellings stored in the determiner is ~aign~.fioantly reduced, and bhis reduction expedites the search for the appropriate word ox phrase.
Referring now to gig. ~.3, the: output triple vec'~or p~ 3os from F'ig. 12 is apg~lied to the third nonlinear prncessar-3 ~8.
~ehis nonlinear processor is similar to nonlinear proa~assor-2 30, shown in dig. 10, faith two differences. ~°irs~t, there is no fixed-°paramater normal~.xat~.Qn here. Second, and more importantly, th~re is a thre~sho~.d here.
lPr~.ox to forming ths~ outer product in proaessc~r~3 38, the data arty csmmpared w~:~th a threshold in block 308. The threshold is set at ~~ro: Ves~tor p~ 306 contains estim~atas of the likeliho~d of sash sp~~ch e~,ement. thus an el~ment of vector pn that iB below zero indicates that a speecsh element that leas been proaeased by speech~elsmsnt ~nadel-~, 266 (pig. 11,) ~,s unlik~ly to have oeau~°red at the corresponding pleas in the concatenated windrow.
the rat~.onale for applying thg threshold ~0~ is as fellows: Vector p~ 3~6 is deeaomposed info eigenvector ~somponents in block 33.2, and then passed through an outer product in black 316 which greatly expands the s~.~e of the vectc~x. the expansion in vectrar sire means that a relatively large number of.parameters will be devoted to processing th~
ves~tc~r in subsequent prooest~ing. ~I~nce, care should be taken to forrrEU~.ate a vector Containing only the most important information before the eacpansir~n in s~ixe. In the interest of deploy~.ng parameters in they ~t~st efficient manner, it is batter OE3. 1 5. 90 1 2 : 1 U F'~/i mNUTTEF:Mo CLEITP.TET7, ~ I aH F'2 2 to ignore the model values of the great majority of spes~ch elements that era unlikoly to hmve oaauxrad at a given tame.
~'hese apseah elements have mode~l,valu~a below zero. Thus, using the thra~~hold 308, what is paaaad~to further nonlinmar processing is characterized lay the model valuQa a~s~aociated with the speech elements that era J.ikely to have occurred.
Referring still to Fig. ~.3, vector pn 306 aomponenta exceeding the predetermined thr$aho~.d era strategically decorrelatcad and reduced by multiplying the data lay an aig~nmatrix.R~~ in block 312. Tae eiganmatrix E33 is formulated from the aiganvactora associated with the thirty-three ~,argest eig~snvalu~a calculated from data in the development database aorxegponding to vesxtor ~In 310, as discussed in more detail with referenoa to fig. 2~ lo~elow. The cYata are thus xeduaed by aalacsting fox further nonlinear processing only the components of the data lying in the directions of the thixty-three largar~t aigenveatr~rg. ~ha cdmpromig~ between retaining signal information afid reducing the number of parameters subj~cted to nonlinear processing r~aultB, at this point in the praaessing, in retaining appx~oximats~.y 50% of the information accounting fox' signal ~arianoe w#aile reduoing th~ number of parametsara subjected to nonlinear processing from 282 to thirtyathree.
'~ha resulting data va~,uea, vectox r~ 314, are applied tca blook 316, where the aampl~ate cuter product is formulated. The r~saul~ts~ of the aut~e~r product are than conaatena~tad with the veGtar pn 306 to form an g~3-~slement v~ctor tn 320. ~'hia veotor con~caina terms w9ah a high dee~ree of nonlinearity as well ae all the components ref senator p~ 306. ~t thus contains 'the data on which the nonlinear processor-3 operated as well as the data which fell belaw the threshold.
Again, wh~,~.e wa amps.~y multiplication in at~p 3~.6 because that is a simple wary to praduoe nanlinear intexaat,ion xemua.ta, other nat~7,inear functions clan also be emplayed for this O ~ . Z ~ . d C'J 1 2 : 1 ( 1 P 1V1 %fs ~7 U 'L' 'Z' ~ R 3e~L a er L ~ 1'.12.1 T=.' 2.1 r F Z S' Zi F' dZ f° f:~ ~~ 6 ~~
m2g~
purpnsa.
~'ha 643-slam~nt veotar t~ 32o is then applied to the aaaond speechmelement modal-z proepasor 322 shown in Fig.
The ~p~aah-element model-~ pz~oaessax mult3piiee the data by a kernel Kz, produo~.nc~ a veotar un 324 a Kernel KZ has elements varrespondinc~ to the espeech elements (which ws refer to b~~.aw ae "phoneme igotypes°~) listed in Table 8, Fib. :43. Veotor u~
aontaina the speeoh element (phoneme is~atype) estimates. The kernel KZ ie formulated ~ram the develapm~nt de~ta as diacu~s~ed with reference to F'ig. 32 belaw. Phansme ieotypaa ar~
dieraueeed in mare detail below.
Iiernele K~ and Its differ X17 aide and efFeCt. Karnsl K~, discussed with reference to Fig. 11 abave, aantaina elem~nts which represent a simpler set of speech elements th~xn Far which w~ model using karn~1 Eez. The~s~ speech elema~nts~ are listed in Table ~, Fig. 4~. Fpr ~5~ar~p~,e, kernel X~ aantaina .dements c~arre~ponding to the ap~eah element '°b~~, and eaoh oacurrenae in apaeah of a °'b", whether it is an initial ~°b", a bridg~a "
b_°', etc. i$ mapped, using kernel K', t~ the entry "b", Kernel Ka oantaine entries which di~t~.nguieh between an init~,a1 "b", a br~ldt~a '°m"br,n, eta. Th~ apaeah~ elements aaaoaiated wi'~h kelrnel ~Z G~.~~ ~~gt~d in ~a~~~ a, Fig. 4J o The speech element (phoneme isotype) estimates are neatt applied to the likelihs~ad ratio proceeessr 42, which ~.raraslata~
each estimate into the logarithm o!~ the lik~lihoad that its speech element is present.. The likelihaad far sash speech ~lement is calaulat~d assuming normality of th~ distributions of estimate vela~a bath when the phonem~ is absent and when the phoneme is present. Th~ logarithm en~surea that further mathematical opa~xatiane an the data may~thereafter be. performed ae simple additiane xather than the m~re~time~canauming muitipii~atione a~ likelihood ratios.
The: rmBUitant logarithm of the likellhaod rata~s in veotor l7 8 . 1 5 . 9 O 1 ~ : 1 U F' lE~I m ~ U T T E ~ rJj c G' L E I~T rJ E r.T, p Y .=~ H ~ ~ U
Eg St r, y. .t 9 ~d ~ Yy ~ C~' vn 328 is applied to the phoneme estimate rearrangement processor 44 shown in Fig, 15. The rearrangement processor 44 manipulates the data into a form that ~.a more easily handled by the Word/Phrase Determiner 14 (Fig. 2). Wh~,la some of the rearrangement st~ap,s ar~ designed to manipulate the data for the specific Word/Phrase i~sterminer used in the pr~sferred embodiment, the simplifioation and consolidation of the data by rearranging the rpee~ch element estimates may simplify the d~texmination of the appropriat~ wprds and phrase .regardless of the particular word,/phrasa determiner used in the system.
~ha phon~ma rearrangement processor manipulates the data such that each sp~esoh element is represented by anly one lab~1.
Aa~tardingly, the Word/Fhrase Determin~a~° 14 need only store and cart through one representation of a particular phon~me and one spelling? for sash woxd/phrase~.
Each speech element estimate veotor should include the estimates asspaiated with one phoneme. ~towevor, some of the vaotors may inaltade diphone estimates as set forth in ~abl~a 8 (Fic~. 43). ~uah spe~ch ~slc~msnt estimate vectors are split in block 33o in Fig. 15 into aonstitu~nt phonemes. the estimates for th~ first portion of the diphone are moved beak in time and added. to signals from the earlier signal segment and e~stixnates for the second partion of the diphone erg moved forward in time and addod to any signal data present in the later time segment.
While the order of the phonemes is 3.mpQrt~ant, the placement in time of the phon~me~s is not. Moet ~~aords and syllables era at least several of these 36~mssc t~,ma units long. Thus, s~parating diphanea into constituent phonemes and moving th~
phonemes in time by this small unit wall not aff~at the matching of the estimates 'Go a word or phrase.
Dnoe the diphonss are separa'~ed into. constituent speech ~lements, the speech elementB are raduc~sd in block 334 to the rmallast set of speeoh ~lements (which we refer to belr~w as O B . 1 5 . -r-J U 1 ~ : 1 CJ F' 2./1 >k I~ U T T E R a/( o C: L E N L1 E h7, ~ I =; ~-i y ~ e~
~~~%~~%
-3 ~.~
"phon~rna hala~.ypas°°) required to pranaunva the words/phra~~s.
These speech elements are listed in Table 9, Fig. 44. For instanara, all final and bridge forma of phon~~naa are mapped to ttte~ir initial forms. Thus fiho individual rip~ech element scores are combined and negat~,ve~ ~aor~s are ignored.
The simplified speech slam~int (phoneme holotype) estimates are :applied to th~ phoneme estimator ~,ntcgr~ator 46 which ~,s shown in block diagram form in fags. 16-~,~. With referena~ to P'ie~. 1,6, sooxcs for givat~ phonemes are grouped over time in block 338 ' slang with the a~~oc~3.at~d delta tins signal ~2C from an~rgy dete~rst prat~ssor 22 (Fig. 5) a Blc~rrk 34~ keeps txi~t~k of the abso~,uta time in the grauping. The° scores for a given phonems~ era than conaol~,dat~ad into pna time loca'~ion in bloaka 344 and 34s (Fig. 1?).
~tsferring now to fig. 1?, the summed phonoma eatima'~~a scor~ is equated with the closest °~oentra~,d°° time in block 348, that is, the ~Gimg ir~diaa~ting the center of thc~ waightod time period over which a particular phoneme is spoken. Tamaa within thie~ period are weighted by th~ phAname estimate value. Then the a~eSOaiated phon~me label node, phoneme estimate value, and centroid time of ocaurr~nc~ era stored in a location of memory, as shown in block 352. The storage is aaceased by block 353 in Fig. 15, and the entrios are ordered by the a~ntroid time of aa~urre~ne~, a~ as to provide a oorract time r~rds~ring. output phon~me estimates am 35~ and the associated delta time values d~ era thin aacss~~ad by the t~ord/Phrase ~etarminor 14 (F~ige 2) a Th~ subsar~.pt~t have c~h~ang~d ~nt~~ agai3~, from °°n°° to a9m~o r 'to indiaste that. the outputs of Fig. ~.8 have a tiane bass distinct from 'that of the inputs.
The operation of the sp~ach-~lamant modal-2 procsessor 4p, .
anr~ the rearranging and ac°naolidating of the phonem~ ~stimat~s produaod by th~ system are illustrated by Considering th~
~r~~r~~~~n~ of the w~x'~ °~y~~t~rd~~e °° The ~~t ~f ~p~~~h ~~~~~n~s C5 ~T . 1 5 . 9 C1 1 2 : 1 C1 P 2.~ rEe T7 U T T ~ ~'t Zslj o r' y, ~ L.'I PJ
E tJ, F I =; H F~ ~ ~, ~~~~w~;
d32-for which the opeeoh i:y mod~lad includes subsets, each of which aonaiats of speech elamentE~ that are all What wa call nisotyp~Dm°° of a aingla phoneme Ilhcalatypa°~, Four example, the BpD~AOh el~menta ~1-,'tr'°, n,-v_"le and IDVn of Table 8 era all.
isotypea of tlxa~ comma ho~.o~ype, in this oasDa °~v°° , The phoneme isotypa aatimato labsla that are assigned to the apeach by the spa~ach-eleme~n'~ mod~~.-~, proa~leaaor, ignor~,ng noi.ay or bead eat.imataa, may bet J1 ~T jE1 Ee re~j iaOl.to t Ri R _",di ~C9~i d aI~ eIj Iiar~ wa have exampler~ of sa~eral different phoneme possibilities. Thins ins a caomawhat schematic version of the phonemes which appear in clearly articulated gpcoch. Each of then listed elements repr~asanta those phonemo iaotypas that would appear in contiguous windows in the spdach, each window oorraaponding to a detected reaeptivo field. The symbols between semicolons occur in the came window.
'~ha ayllD9bia fprnt °°d°° praoad~a the n~
°° glide as if the uttoranda wore Dla~-.yaat~rday." ~'he njn glide appeaxa again in the diphone °°~E. °° The next window ree-iterates the vowel ,°E. ~~
The final i°orm of ~°s°~ appears next, as °° s~°, indicating that them is D~oma voicing heard before the fricative but not aa~aul~h to be identified $s a particular vowed,. The unvoiced ats~p °°t°' is expressed hare firD~t in ~.ta isalated form °°isol.t°°, indicating that th~r~ is no voicsing hearse Ira the window, and then in its initial form Ilt." Tho next window contains two phtsnemaa, another initial, °°t°° and a ayllabio '°i~n, which is r~-it~ratl~d in the next windOW. The °'dDl appears first aDx a final DI,-dl°, then in its °~bridga9~ dorm ~~~d~n, Dand than as an initial '°d. °' The bridge form oontaina voicing from both the °~R°' and the final vowel 9°ea°~ in tho window, but not enough, of oath of thaa~ to ~uat.~,$y Choir being labalD3d in the same window with O E3 . 1 5 . 9 0 1 ~ : 1 U F~ 2~/I m y~J U T T ~T F~ ~ ,- r L ~ RT 2..7 E rl, F I ~ H F~ ~ F
~, o) c ~
~~~e~3'2~~a the bridge. The final vowel ig repeated.
If the 9pEE~H signal contains nr~iae, various windows may contain phonemes isotype estimates relating to the noise. These estimates, t~ypioally with smaller likelihood numbex~a, axe proaema~ad along with the phAneme ~sirimatea coxraaponding to the spoken word or phrase. The effect of these "noise" phoneme isotypes ig an ia~oreae~a in the time it takes the word/phrasa determinator 10 (~'ig. 1) to prc~aeas tha phoneme estimates.
~eferr3.ng again to the manipulation of the phonem~ isotype estimatea~listed above~, bxock 33~ (R~,g. l~) splits the diphone "jE" into its aonmtituent phe~nem~s:
~l ~J j ~''J Ej ~",&1T ~~~O~,otj t Rj R ~C'~J dwj d ~~J ~~f Eloo9c 33~ then substitutes phoneme holotypes for each ooaurrenae of any of its iaotypea:
~ l ~ J ~ ES EJ ~j tJ t RJ R d; di d eZJ 6IJ
f ina~l ly the estimator integrato~° ~ 6 ( Figs . 16 - ~.8 consolidates instances of the each phoneme holotype, .ao that multip3.e ~,netanaes axe eliminateds f Ef si '~ Rf d1 eI j Tha result it the phoneme ~etimates for the speech. Each phoneme ie treated he~c~a as though it had occurred at a single aentroid time of oaaurranae. These centroid times are no longar restricsted by the ~ncdulo~-3 constraint of the detests (block 13~, ~'ig. 5) , however, the order c~f the ~rax°ioue labels is retained to ensure the correct phon~tia spelling s~f the word. Only those phangmes which are close enough to be ct~naidered in the same word or phrase ara~ so consolidated.
Note l~hat in this eatample the consolidated "t" is assigned to the same w~,ndo~r as the syllabic "R.~~ ~hia will occur if the aentroid tinaee of oceurrenae of the twc phonemes arg ~suffiaiently close, hZ112~,METER t7EVE~pMENT
GB. 1 5. ~JO 1 2 : 1 CJ PIVS m~-UTTER~a C:LLh7h7ErJ, ~ T =~H F'~'7 The development of the parameters used in caloulating the phoneme estimates is discussed with ref~,r~nce to Figs. 1935.
Fig. 19 depiGta the oalcula~tion of the fixed parameters ~tt~~ and at~J used ih t~orma~.izing the data aoxre~spond~,ng to the inooming speech in they adaptive normalizer 26 (Fig. 7). The ~ixe~d parameters used throughout the proceasi,ng, including the mean and standard deviation values, ar~ caloulated using the data in th~ development database.
~'he deva~,opment datak~aae is formuhated Exam known speech aignalts. Tha known signals are applied to the apeeah processor and the data are manipulated as sot forth in Figa. 3-18. then various parameters useful in aharacterfsting the ae3aociated phonem~s at various points in the prot~e~asing era calculated for the entire development datab~sse. the~e oalaulated, or fixed, parameters are than used in cal,ouiating the phoneme estimates fox incoming signals reprgsentirag unknown ape~ch.
Referring t.o Fig. 19, a mean, Katy, is calculated for each of they elements, ui~~~n of th~ ~~Pta~ matrices Un ~C16 formulated from tho~ development data. First, the corresponding elements from each of the Un matrices in the development data are a~reraged, resulting in a matrix ~s 4~a with the various calouaatod mean values ae3 e~.eme~ats. Next the standard deviation values, Qtrl, o~ the correspo~dirag gl~ments of the tJn matrices era calculated using the asaooiated mea3~ valuetb, ,t;;~~~
resulting in a matrix c 40~ with ttag various calculated atandaxd deviation va~,ttes sec elements. the fi~ead ri~~aan a,~d standard deviation parame°~ers are thereafter used in the adaptiv~ normalizer to normalize e~aah element of the matx~i,x tin formulated for the incoming, unknown ~peech.
Fig. 2~ d~tfinas a covarianaeP matrix R 41t7 which is used in calculating various eigenmatr~.ces. ~h~a oovariance matrix R
corxespanding to the. N input vects~rs a~ 406 formulated for the ~evalopment data is a~alcu~.ated as shown an block 40F~. ~°kae 00. 1 5. 9O 1 ~ : 1 O F'I~iI my~UTTER~c CLEh7ZTFP.7, ~ I ,'H F°~F3 covax~ianas matx~~,~c Ft is thon uc~ad to calaulato oigonvectors and aseociatod eigenvalu~s as shown in k'ig. 21.
Referring ~.o Fig. 27., the. eigenvaluas are calculated in block 412 and orderod, with v~atax bo (from 414) being the oigenve~otor having the largest eigatavaluo anti bA,~ being the eiganvactor having the small~sst aiganvalue. The eigenv~ators era than norma~.i~ad by div,~d3,ng eaoh one by the sduare root of th~a corresponding eigenvalue t.o produce a vector b~~ 420. The first B normalised oigenvoctors, that is, the ~ narmalized eigenveetors r~orresponding to the 8 largest eigenvalues, are assembled into eigenmatriat EB 4124. Tha eigenmat.rix Es is net requixad, by definit~,cn, to be a s~qu~,re matrix. The superscripts ~~Tu in blatxk 422 denote the transposee~ of the veetars as written.
Fig. 22 depicts tho calculation of ~igenmatrix 7~2~ 432 used in nonlinear processor-2 30 (Fig. 10). The eigenmatrix Eae is cal~ulated urging th~ oal,aulat.idn m~thoe~ d~eraaribed with xafarence to Fig 21. The oovarianoe matrix R 410 required for the calculation of the aiganmatx°i~c is formulated from the dev~~,opmo3~t data as shown in Fig. 2 ~D . xhe a~,ggnmata~ix Rab, e~ontaining the twenty-six e~genvec~tars associated with the laxgaat e~.gerivaluas, is than used to d~corra~.ate the data ~°alatinr~ to the incoming speech ~.ra block 24a of nonlinear pxoc~ssor-2 (Fig. ~.~) .
Fig. 23 depicts th~ oalculat.ion of the mean values used in the 3'ixs~d-parameter n~r~talixation-2 processor 244 (Fig. 10) .
'the processor 244 normalises t$~ twonty.-six data el~ments assoo~atrad with the selected twenty-six eigenvectors. Thus th~
maa~t values of the elements in the N development data~aasa v~ott~rs aorrosponding ~.o vector bn 242 are calculated.
3Fig. 24 similarly shows the calculation of the parameters used in the fixed-paramet~x norrnali~ation~3 px~cessor 256 shown in lFig. 11. The means and standard deviations for the O8. 1 5. ~.JO 1 ~C : 1 rJ PT~I »~I~f2.7TTEF:Nd,c CL~;rJI~7Er7, ~ I ~H F
°x Hd !a Corresponding N veotors a~ Z34 in the development database are calculated resulting in a vector ~c 440 containing the calculated mean values and a vo~wtox a 442 containing they calculated at,andard-deviation values, Fig. a5 depi~sts marking of the e~peech. The segments of the devalopmeri'~ data input BPEEGH signal s(t) are extracted to form a ~'window" ~.nt~a the speech, representod by vecfi~or s' ~ 44 ~ .
The windows are seal~ct~d to oorr~spond sometimes to flee time width of 'the r~ceptive (field matrices U~ 206 (Fig. 6), r~present~d also by the vectors hn ~~~ (Fig. 3.2), and sometimes to the time width of the overlapped triple, xapresent~ad by the vsators pn 30f (fig. 12j, as discussed b~low. The former tim~
width coz~rmeponds to 1184 data sampi,~c~ of the Input SP~E~T3 signal S(tj; the later time width corresponds to 1760 such aampl~s. 131ock 949 of Fig. 25 shows the axtravt~ion of the longer window. If the shortmr window were smleatsd, then tho window would b~ formed by the 1194 samples centered about element s'~o,R. Th~a windowed speech is them a~aociatmd with phonemes by a pea;sr~n list~ning to the speech, as shown in block 448. The llsta~ning person thus marks each such window as containing the particular phonemes he or she hears, if any.
The t~.me width of the window a~lected by the listening pe7C~on depends upon thA nut~be7: of phonemes heard ahd up~n the clarity o~ the sound, fhonemse in the long~ar window often can be heard more easily, but, the longer window often introduces more phonemes in a window, and hence mare ambiguity in marking.
The ahoiae ttaus represents a 'trade-off of clarity of 'the spsec~s heard againstc time resolution of the resultant labels.
~t all thc~ mark~,ng wsarg done with the shorter window, th~n the labels would oorxedpond to the time width of the speech used by speech-e?Qment model-~. 254 (Fig, llj. The labels would be °~matohed" t~ this mad~xl, but would be "mie-matahed'~ to speech-element model-2 322 (F"ig. l4). Likewise. if all the t O8. 1 5. X30 I 2 : 1 U FIJL ~t~NLTTTEI~~c C:LED7rJEr7r ~ I S'H y.j~, ~r ~ ;~ ~~ L
_s marking were, done with the longer win~ic~w, then the labe~.~a would be matched i~a the second model but mis-matched to the first.
Tdaally, 'the lab~ld always wau~.t~ be matched to the model in which thcay are used, and the p~araon listening would generate two complete label mots. There iev, however, great commonality in what is heard, with the different window wid~tha, xn the interest of easing the burden of marking 'the speech, th~
listening person ~,a~ permitted to select the window time width to best advantage for each label instanaa.
fig~ ~~6 shows the processing of the labels after they era marked by the person. Tf 'two phonemes are heard in a window, then they may constitute a pair iala~t axe mapped to diphone labeZS as shown in b~.aak 450. Tf only one phoneme is heard in a window, then that. phoneme may be one of the unvoiced ac~nsonants which era mapped to ieaolat~d gpeaah element labels as sklown in block ~,a~aa, xi" mare than two phonemes axe heard then pairs of phonemes may be mappa~t to diphone labels and others may be mapped to single ph~tt~smas. In this last ease, if the window is the longter one, the person marking the speech may sel~ct the shoat~r windeaw and listen again to reduce the number of phonemes heard in a window. ~ha anappings are done automatically after the marking is complete, gd that the actual labe7,r entered by the person are presexvsd.
the labels selected for marking the speech are shown in Table 1 (Fig. ~~) , ~hesa speech element ~.abels ax°r~ sglect,ed based, in part, oar experienc~. For example, experience shows that some particular phoneme is likely to fallow another. Some of these labels sere thereafter refin~d to ina~.ude an ordering and/or combinatian of the ph~anemes, for example, into diphonea.
The numbmr of labels used throughout the processing ~,s larger than the number of labe~.s used in ~pxevious sp~~eh-precognition system-I. Such a large numb~r off' :Labials is used l~scatase, unlike the prev~,ous system in wh~.ch a trigger mechanism is used C7 Es . 1 5 . J O 1 2 : 1 C7 P' b/~ m I~d U T T ~ Ft geld o C; L E I~7 P7 'E
TJ, ~ T ~ H F' :j f8 ~ ~~:
to ~.ndicate the start of a phonem~ and thus the start of the proa~a~ssing, the aurren~. ~yste~m may detest a phoneme anywhere within the signal seg~t~ant window, and praoeosing may be begun, a~g., in the middle a~ a ;phanem~, Thus the system usos~ more labels to oon~rsy, after further praaessing, the context of the det~ctad phoneme.
R~ferxi,nr~ again to Fig. 26, th~a labels attaahe~ to a signal eegmen~, area encoded in block 454 'Go ~Erom a label veotar ~" 456. The label vector Ln 456 contains aXements x~epres~snting each a~ the nin~ty-four poss~Labls speech ~l~ament labels shown in Table 1 (Fig. 36) ag well as the new phoneme labels generated in blocks 450 and 452. Thsa resulting vector has slam~nts that axe Z's for mpeeoh ~lement labels heard in the segmeni~ and el~ments that ar~ 0 ~ s far labels nest heax°d. the label, ~ec~,ar is then appl~.e~d to the gax~ametex dev~alopment circuit shown in gig. 27.
F'ig. 27 depicts the calaulatian of an eigenmatrix E 462 and a kernel ~d 470 urged in ~ormu2ating the combined kernel 3~~
476 (Fig. 2e) . ~ covariance matr~.x ~t is aalc~uiated ~ar the develapmerit da~.~tbase vectors g~ 262. the vea$ars gn are tile signal Bata repres~ntations that are thereaf~,er applied to ~peevh-element model-~. 34 (Rig. 11). The oalaulated covariance matria~ R is then used to farmula~t~ the assac~.a~,ec1 eigenmatrix E
falleawing the c~alau~.ations discussed above with re~ere~,ae to Fig. 21.
The vectors gn 262 are then mult~.plied by the eigenmatrix E 462 to farm decorrelated, data-reducers vectors hn 466. ~h~
de-correlated v~ctors h~ have 650 elements, associated witch the 650 largest eigen~ra~.uc~s, as opposed to the 762 elements of speecah data in vectors g~. Claus tYge ,number oaf paxam~ters is strategically reducs~sd and the mast important data~far speech recoc~nit3an are retained, ~Clae retained informat,ian includes ~.n~ormatian relati~ag to approximately ~~.97~ of 'the signal G 8 . 1 ~a . 9 G 1 2 : 1 d F 1VM m t~T TJ Z' T ~ R: M c; C; z.. ~ h1 I'~.1 F
T.1 r ~ I ~; H F' ;~
~~~~~'~ jta ioJ z ~.~:
°~39-~
variant~, Rer~ucing the data at this point reduces the adze o~
the associated kernel K 470 and also the size of aom?ained kornel K~ to a more manageable size without sacrificing mush of the information which is important in phoneme estimation.
Kexnal It 47p is formulat~ad using the reduaad 65o-element v~ator h~ 466. Enah row o~ alemsnta, Kij, o~ kernel K is form~d by multipl~ring tho corresponding element of label vector' L~ 456 by the elements d~ vector h~. The ~lements of labs~l vector ~,n 456 ar~ normalized befor~ the multiplication by subtracting thr~
mean va.liEss ,~ormul.a~ted from the elements of the Id label vectors in the development data bee~, The k~rnel K 47o is used in calculating the kern~:l K°, which is then used to calculate °°aombined'° k~,rnel K~
X76 a,~
shown in l~ig. za. Kernel ac is first normalized, by s~ividing .
sash s~~ its slam~nts by the ~asst~ciated standard deviation value, prQduaing K ~ . The normal~.z~ad )carnal K ° is then combined witkl the ~igenmatrix E 46~. Combined kernel K~ is thereafter usesi in speech-element model-~, ~4 to assign preliminary labels to the incoming a~,~ech and reduc~ the data to a subset of likely label~t.
Fig. 29 depicts the Galculat,ion ~a~ eigenmatrix E$~ 506.
The eigenmatrix X33 contains the thirty-threes ~ig~anveators assoaisted with the thirty-three larg~s.~ eigenvalues. The, ~ige~nmatrix E3~ a,a oalaulated in the same way as the eigenmatrix discussed with refrarenc~e to Pig. 21 $b,~v~. This eig~nmatrix E3~ is then usrad to se~,ect the data values xepr~aentirig the incoming sgeech which a:re associated with the thirty~thxae largest eigenv~ctors.
Fig. 3o depirsta speech label, vectors used in formulating a eeaonc! combined karngl, KZ 322 (F'ig. 3z). The sit o~ labels, which ara~ phon~me isotype labels, differs fx~am that used in calculating K~ g76 (Kig. z8j as follows: xhe preliminary labials assigned to the data in spesoh~element an~da~.-1 3.~, shown OEi. 1 5. 9O 1 c. : 1 CJ PZVI *NVTTERMc: GLEI7TVEx-7, p Z ~H Z
rfi1 ..4 0, in Table 7 (Fig. 42) are mapped in blocks 5os~51,o dither to diphorie ~,abels in Tabld,s 2 or 4 (Figs. 37 and 39) or to isaolated phoneme labdls in Tables 3 (Fig. 3g), as appropriate.
Tha mapping require0 da~,ay3ng ~ha pxocersing by one time un~,t in b?ock 514. The delay aligns the labels with the center vector of the data triple formEd in pro~scassor 36 (Fig. 12).
Thd labels are then dnooded to form a ~.Zgmglament labdl vector %.I~ 5 ~. ~ a Figs. 31 and 32 illustrate the calculation of the combindd k~arnel 1Ca 534. wing the labg~, vector Ln 51g, the kernel Kz is calculated in the same manner as th~ darker described combined kernel K~ 476 (Figs. 27 arid 2$)s namely, a ~gquare eigenmatxix E
524 is dalaulatOd,~o decorxelate the dada in the speech data vector tn 320. Then a kernel K~ is calculated using the label vector Ln 518 , mhe kernel Ii ~ and the eigenmatrix E a,xa~ than mult,~plied to form the combined kernel KZ. Kornel Ka is used in spe~ah~element mod~sl-2 40 to reduca.the data and ~oxmulat~
phoneme i~otyp~ est.~mates by associating 'the data with thd 11,~
possible pharieme is~t~rpe labels.
Figs. 33 and 34 illustrate tile calcu7.ation of parameters used in formulating the ~.ogaxithm of the like~,ihood ratio (Fig.
14) . Thd l~.kdlihcaad ratio inaorporatds paramdtarss fornyulatsd from the d~velopmant database and assigns la.keliht~oc~ values tc the pD~aneme isotype estimates associated with thd incoming spea~sh. The est,~mates may thus be multiplied by aeiding, and divided by subtracting, after they are translated to logar~.thms .
specifically, With ~~~ferBnC~e to Fig. 33, the dev,alopment data vector u~ 324 and the label. vector L,~ 51.~ (Fig. 3~) are each applied to cixc~uits 536 anrl 540. docks 536 and 540 calculate m~an and standard ddviatian values for elements c~f the input vector u~ and acoumulat~e them separa~tdly for instances whdn tho oorresponcling elements in labdl vector L~
G 8 . 1 5 . J G 1 2 : 1 G P 2,~ »~ FAT LT '~' T ~ F~ Id( G C; L ~ 1J I 7 E rJ, ~' I ~ H F' .;S ~
6' r ~4g~.
51e appear in thgi deva~.opment database and when they do not appeax~ Thus block 536 ar~oumu~,atas atatistios for instances 'when the corre~sponaing phoneme is not heard in the input sp~eah, gor c~aoh individual phoneme, these inetanoe~s account for the vast majority of the data, mince a given phoneme ie usually not heard. Block .54o mc~aumulates aatatiatica for instances when th~ ~sorreep~ar~d,ing phonems~ is~ heard in the input speech. Such instances area in the minority.
The resulting mean and standard deviation values (vectors 539-8 and 542,-~) axe app~,ied to a de~rating circuit 544 (Fig.
3~) which adjusts the data values to compensate for the ra~sult3.ng difference in acaurmay between assigning phoneme ertimatAS tc~ known data which axe in th~ d~a~ralopment database and assigning them to unknown data. Thp m,~an and standard detriation values are adju~ted by multiplying them by e~oafficients a~ and b~ which era the ratio of, on the one hand, such vaZu~a averaged over all instances in a t,~st database to, on the oth~r hand, such va~.ues avermged over a~,l instances in 'tee deVAlopmsnt databases The test database is smaller than the development database a~a~~d the data in the t~aet database have not been used ~,n oalaulating any of th~ oth~r fixed parameters. fhe test data a~t~tain fewer calculated phoneme estimatem, and the estimates era assumed to be less robust than those associated with the d~velopment database. hhe coefficients a~ and b~ are thus a gauge of how mush the 117celihc~o~i rat~,~, pa~.amet~rs forgn~alat~d ~txom the development database ~hould be ~scalod fcr inooming new ep~eoh.
R~ferring to Fig. 34, the mean values are sc$lad using the co~ffiaiants a~ and b~ as set forth above. The de-rated values rams then applied to circuit 545 which formulates polynomial coefficients for the likelihood rat~,o circuit 3Z5 (Fig. 14).
After the phon~me is~otype estimates are transformaad into O Et . 1 5 . J G1 1 2 : ~ g P' M * N'CJ T T ~ R ~( ~ G' L, E N I~T E 2J, ~ I
S' ~i F' fj I
"
i ~:~
logarivhms of the likelihood retie, the phoneme isotypra estimates axe rearranged and con~splidated in phon~me reearrang~ment prQae~ssor 44 (Fig. ~.5) 2nd estimator integrator 46 (Figs. 16.~x,8) . Fig. 35 illust~rat~,s the generation of the.
maps ueed in rearranging and consolidating the estimates. With re~ferance to Fig. 35, a mapp~,ng matriac t~ 554. is formulated tar diphones, mapping the diphonee to the csongtituent sp~eah elements. Tables 2, 4 and 5 (Figs. 37, 39 and 4o) aantain they diphones, and constituent speoch el~ments. ~ secrand mapping matrix T 660 is formulated for mapping to a single label gorm various labels xepresanting the same speech element. For ~aa~ample, both, the e~re~ arid e~R~e labels ores mapped to the ~~r~e label. Table 6 (Fig. ~1) contains the set of speech elements to which the various label forma are mapp~c9.
Figs. 36--44, as discussed above, depict all the tables used in labeling phonemes. 'fable ~., Fig. 36, contains the labels with which the listen~r may mark the speech assocsiated with the development database. While the notation assigned to the labels may beg unconventional, the ~nrat.ation can be dupliaat~d using a stanc~axc~ ksyboaxd. ~acplanationa of the notation are therefore included as part of tae tab7.e.
The eet of labels which may be usead by a person marking the apeeah (Table 1) hoe been carefully chosen to saver the poesibl~ acoustic mani~astatior~s of phonemes within the listen~.ng windows. Thus the selection of vowels and consonant~, in all the vax~ioua forms s~chibited in Table ~., ~,s a sgt c.f ape~ah slem~nt~ that one could hear in the iiat~ning windows, The list of spaeah elsme~r~ts includes more thaw what 'one typically rafe~rs to as ~~phonemes~~, for example, it includ~s initial farms, bridge forms and final forms of ~rarious speech elements. ' Table 2, Fig. 37, e~onf.ains diphone labels and they constituent sp~sch elements. This table is used to s~parato a C7 E'I . 1 5 . 9 O 1 2 : ~ f3 F' M %It N U T T E P_ N( ry C L E N h7 E TJ, c I
=; H F' fj :j -~3-spe~aah alemen~. ~stimate vaator containing diphone estimates into the twa appropriate speaah element estimates. The table is used also to generate the aombinad kernels K~ and IC2. Tabls 2 ig alpaca used along with Tables 3-G, Figs. 38-41 in generating the maps For rearranging and consalidwting the phoneme ~as~.imates in phoneme estimt~tor integtrator a6 (fig. 35) .
Tables 7-9 (Figs. 42-44) are tables of the speech element labels used in model-1 prr~r~as~sor 3~i, modal-2 processor 40 and phoneme rsarrangemen~G proaaasar ~~, respeat~ivaly. Tables aontaina the labAls aor:~esponding to the elements of kernel Tip, Tables 9 cantains the phanama i~rotype~ labels corresponding t,o sl~ments of kernel KZ and Table 9 oontains the phoneme e~atimate ~.alael~a which ere applied to data which are manipulated and re-arranged to conform to the requirements of the wordjphr~sa determiner 7.4 and the wordjphrase diotionary 1G (fig. 1). The labels shown in Table 9 are the phonemes which wa believe best aharautdriza the spaken words ar phrases in general speech.
Th~ set~c of lx~bels used in Tables 1-g have bean carefully aho~en to opti.mi~e the final phoneme aaauracy of the ~paeah racogrtizer. Thus, the salaatian of vt~w~1, oonsonants, diphones and isolated forms, while net a complete rat of all such possibilities, is the set which is the most useful for subses~uently lc~oki,ng up wards in the Wox°djPhx~ass Determiner, .
block l4 c~~' Fig. 1. Th~ tables may ba modified to include sounds inclioative of subs eat-~ralatad sp~ea~l, far ~xample, ~aumbars, and also tt~ inoluda sounds pre~ant in ~.anguagas other 4han Fin~li~h, ~~' a~'~~opr~~t~e HbW~ CdNFIi',,U~iPaTICtE$8 figs ~5~-48 depict system hardware aonfiguraf.ions 1-~. The first oonfiguration (F°ig. ~5), including a Digital signal Froaessor (mSpj microprocessor BOO and a~ memory 602, is C7 E~ . 1 5 . 9 O 1 ~ : ~ F.3 P M ~ Y~3 LJ T T E R NL c C, L E 1 J kJ E rJ, F
I ~ H P C, /j, y~ v .
l~a ~ ~ ~ _r !v ~w.
~49,°~
designed !or a so~twaro-intensiv~ approach to the currant system. A second configuration (Fig. 4fi) is designed slap for ~a rather so:Etware-intensive embodiment. This second con~ic~uration, Which oon~sists o~ Four DSP~a 604, 606, 610 arid 61~~, and two sthe~red memories, 608 and 614, performs the syat~am ~unctiona using two memorie~ which ar~ each hal~ as ~.arga ss ~.he memory in Firs. 45 and DSP~s which era two try three times slower than the ~,0-16 M~FS (millions-o~-instruatians per second) of DSF 600 (Fig. 45).
P'ic~. 47 depicts a system configuraticr~ which is r~latively hardware-int~3ns~.ve. This third configuration consists o~ a 2-5 MIPS micropxocassor 616, a memory fia0 a,nd a multiply/accumulats Circuit 618. Th~a multiply/accumulat~ circuit performs the rather largo matrix multiplication operations. For ~xamp~.~, this cixcuit would multiply the ~.~.9x$43_eJ.ement cambinad kernel xa matrix and the 843-element vector tn 320 (Fig. 14). The miarnprcase$or sls, wllir~h per~arxns the ether aalculationg need not ba :a DBP.
Fig. 48 depicts a Floating point system con~.iguration.
Ths system includ8s 710--15 MFLOFS (millions o~ floating-point--operatiorig p~r second) DSh prcot~ssor fi22 and a memory 624 w~ai~sh is twic~ as larc~~ as the mamc~ri~a usod iri the other systems.
Tha memory 624 is thus capable a~ storing 32-b~.t 9~loatihg pc~iaat numbers iristoad o~'ths 36~bit intar~srs used in the rather three cani'igurations.
Fig. 49 illustrates how the parameters, the devalr~pment c~~ which is shown iri Pigs. 19-35, era related to the processing system depicted iri block diagram Form in F'ic~s~. 3-1S.
COidChUSIQN
The present speech rcc~gniti4n system per~orms spsech~
element-speoi~ic processing, For exa~mpls, in speech-element Gtr. 15. 9G 12: ~~3 Pb/L mrpUTT~.,TF:Mc,CLTixII.l~2~:1, ~I:~H ~G
model-1 3a (brig. 11), in between nonlinear processing to manipulate the data into a $orm which contains recognizable phoneme patterns, pex~for~"ing apeeah~-element-s~peci$ia proasesing at various points in tins system allows r~latively large amounts og high reaolution ~~~.gnal freguency dat$ to be raduce~d without sacrificing ~,nf~armation which is important to phoneme estimation.
speech-slam~nt-specific data reduction procsssinc~ is not perfr~rmad at the a~ppropxiats plaas~ in the system, thg resolution of the signal data applied to the nonlinear processors would have to be reduoed to limit the number of parameters.
~'ha present system thus retains important, relatively high r~solution data $or nanl~,n~aar processing and eliminates data at variouss points in than system which, at the point of data reduction, are found to be redundant or relatively unimportant aft~r spaach~-element-specific procsessing. If the data rmduation and nonlinsax prooesa~ing steps ware not so interleaved, the system would ba operating on lpwer raso~.ution data, impairing accuracy.
The foregoing deraription has been limited to s spsai$io embadimant of this invention. zt will be apparent, however, that variations and modi$ioationg may be ma~da to the invention, with the attainment og some air all e~f the advantagas~ o;p the invention. Therefore, it is the ebject of the appended claims ter saver all sudh variations and modifiaatimns as ooma within the true spirit send sec~pe of the invent~.~rn.
PROCESSING, SPEECH ELEMENT MODELING AND PHONEME ESTIMATION
FIELD OF INVENTION
The invention is directed to speech recognition, and more particularly to those parts of speech recognition systems used in recognizing patterns in data-reduced versions of the speech. It is an improvement to the circuit disclosed in U.S.
Patent 5,027,408 entitled "Speech-Recognition Circuitry Employing Phoneme Estimation" issued 25 June 1991 in the names of the same inventors.
BACKGROUND OF THE INVENTION
Most systems for recognizing speech employ some means of reducing the data in raw speech. Thus the speech is reduced to representations that include less than all of the data that would be included in a straight digitization of the speech signal. However, such representations must contain most if not all of the data needed to identify the meaning intended by the speaker.
In development, or "training", of the speech-recognition system, the task is to identify the patterns in the reduced-data representations that are characteristic of speech elements such as words or phrases. The sounds made by different speakers uttering the same words or phrases are different, and thus the speech-recognition system must assign the same words or phrases to patterns derived from these different sounds. There are other sources of ambiguity in the patterns, such as noise and the inaccuracy of the modeling process, which may also alter the speech signal representations. Accordingly, routines are used to assign likelihoods to various mathematical combinations of the reduced-data representations of the speech, and various hypotheses are tested, to determine which one of a number of possible speech elements is most likely the one currently being spoken, and thus represented by a particular data pattern.
The processes for performing these operations tend to be computation-intensive. The likelihoods must be determined for various data combinations and large numbers of speech elements. Thus the limitation on computation imposed by requirements of, for instance, real-time operation of the system limit the sensitivity of the pattern-recognition algorithm that can be employed.
It is accordingly an object of the present invention to increase the computational time that can be dedicated to recognition of a given pattern but to do so without increasing the time required for the total speech-recognition process.
It is a further object of the invention to process together signal segments corresponding to a longer time period, that is, use a larger signal "window" without substantially increasing the computational burden and without decreasing the resolution of the signal data.
SUMMARY OF THE INVENTION
The invention provides a method of identifying a speech element of interest in a speech signal, said method comprising the steps of: A. generating a first vector each of whose components represents a component of said speech element;
B. comparing said first vector with a first set of model vectors representing known speech elements and for each 2a comparison deriving a value representing the degree of correlation with one of said model vectors, thereby generating a second vector each of whose components is one of said values;
C. calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. comparing said third vector with a second set of model vectors representing known speech elements and producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
The invention also provides a speech-recognition device for identifying a speech element of interest in a speech signal, said device comprising: A. means for generating a first vector each of whose components represents a component of said speech element; B. first modeling means for comparing said first vector with a first set of model vectors representing known speech elements, for each comparison deriving a value representing the degree of correlation with one of said model vectors, and generating a second vector each of whose components is one of said values; C. means for producing a third vector, said means calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. second modeling means for comparing said third vector with a second set of model vectors representing known speech elements, the comparing means producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
2b The foregoing and related objects are achieved in a speech-recognition system that includes a phoneme estimator which selectively reduces speech data by a first modeling operation, performs nonlinear data manipulations on the reduced data and selectively reduces the data by a second modeling operation. The reduced data are then further manipulated and re-arranged to produce phoneme estimates which are used to identify the words or phrases spoken.
In brief summary, the phoneme estimator monitors the energy of a data-reduced version of the input speech signal and selects for further processing all speech segments with energy GE3. lra. GG 11 :~.f,~ AM *hIUTTEI.ZVjcCLE~TIJErJ, FISH hG~, which exceeds a certain thrashald. Such signal segments typically xepree~a~nt voicing or unvoiGad axpiratian within apaac~h, and thus phonamara. The phoneme estimator then manipulates a further data-reduced representation of the signal segments through a series oaf speech modeling operat~.ons, nonlinear operations and furtha~c speech modeling operat~.ons to calGUlate which phanema patterns the data most closely r~sambla.
Tha ~paactl modeling is used to reduce i~ha speech signal data by ignoring data which, through ~sxp~rigna~a, era found to be ralati~raly insignificant or redundant in terms of phonema-pstte~rn e~atima~tion. Tha more ~ignificant data are than manipulated using eomputatian-intensive nanlinear operations resulting in dafi.a patterns which aru us~sd tea determine the .
likelihood of the intended phone~mas mare accurately. The tuna rsguirad ~Cor such computations is minimixad by so reducing the data, The phoneme estimator else looks at the time between signal energy, ~r phoneme, dete~ations in sa~.r~oting the most likely phanamas. Considering the time between the phoneme defections, the astimx~tax may conoatanata what would otharw~Lsa ba ocansidared a series of distinct phonam~s into a group of mufti-phoneme patterns, far exempla, diphonaB. Th~asa mufti--phoname patterns often oanvay the intended meaning e~i' th~a sp~ach mare ol~arly tk~an the individual phonemes.
~R~C~F I3ES~t~:PT~o~1 ~~ TFi~ p~w~C~I~S
Fear a mare aomplste undarstmnding of the features, aduantagas, arid objects of the invantian, raforanca should be made to the follawinc~ d~tailad desariptian and the accompanying drawings, in whiexh:
lFig. 1 is ~. b~.ock dias~ram of a speech~r~cogn~.tion ~ty~atam G 8 . 1 5 . 9 G 1 1 : 4 F~ AM %k $~'LT T T ~ ~ ydg ~ C, L ~ I 7 h7 E DJ, F I
E,' H F~ C~ ~, employing the teachings of the present inventions Fig. 2 is a blook diagram depia~irig a phoneme estimator shown in 'Fig. ~. s Fig. 3 ies a blook diagram depiciting calculation og an estimation c~~ the signal power spectxum, shown as block 19 in Fig. 2:
Fig. 4 is a block t~~.agram depicting aalou~.ation of a r~duction of the power apea~Grum estimation, Shawn as block ao in Fig o 2 Fig. S is .~ blook ds.agram a~ an ~nsrgy ds~te~r~. praGessor, shown as block 22 in Fig. 2s Fig. s is a b3LOak diagram depicting a resoeptiva ~i~ald _.
pros~ssor, shown as block 24 in F~.g. ~_ Fig. 7 is a block diagram dapir~ting an adapti~rs no:rmalize~r, shown as block 2a in Fig. 2~
Figs,. 8 arid ~ taken together illustrate a recsptiv~ field non7.inear pre~c~eaaor, shown as blool~ ~8 in Fig. 2 ;
Fig. 10 is a blook diagre~m i~.~.ustrating naril~,rie:ar roc~ssor-2, shown ova blook 30 in Fig. Z;
Fig. 11 is a block diagram illustrating a normalization prooeasor arid sp~sch~alem~arit modal-1 psoaersax°, shown as blo~xka 32 and 34 iri Fig. 2i Fig. 12 is a block diagram illustrating a concatenation of vectors into tripled, shown as bloc~Jc 3~ in Fig. ~s Fig. 13 is a~ block diagram depicsting nonlinear pxocessor-3, shown as blook 3~ in Fig. 2,°
Fig. 14 is a kalook diagram illustrating speeoh-element model-2 and calou7.ation of the log~ar~.thm of the lik~lihood xati~, shown as blocks 40 arid 42 ~.n Fig: 2t Fig. 15 illust~a'~es phoncame ieo'~ypa eatima'~e rearrangement, shown as block 44 in Fig. 2s Figs. 16, 17 arid ~.e together are a block ~liac~ram ilJ.ustrating an estimator integrator, shown as block 4~ in Fig.
~7 5 . 1 5 . 9 O 1 1 : 4 ~ a n~i m td U ~ 'r ~ iz rot a c: t. ~, r~ r.T ~
r.~s, F i s~ :~ ~ r~ F
.~~_ a;
~'ig. 19 illustrxstas the calculations of parameters used in the adaptive normalizar of Fig. 7~
Fig. 20 illustrates thc~ calculation of a covariance matrix R for a~,lcul2~tit7g paramataxs~ used, for axampl~, in the nanlinaar procassar-2 of Fig. 1,0:
F'~.g. 21 illustrates the Calculation of an aigenmatrix ~s, using covariana~ matrix R of P'ig. 20~
Fig..22 illustrat~x~ the calculation of an aiganmatrix Eab used in the nonlinear processor-2 of ~'ig. 7.0;
k'ig. 23 illustrat~s the calculation of further parameters used in the nonlinear procassar-2 of Fig. 7,at Fig. 24 illustrates the calculation of p~rametsrs used in the normalization procassar of F'ig. la.:.
Fig. 25 illustrates marking of the ape~oh signal;
F,ig. 26 illusstxatag tho stctaxmina,tian cat speech label vectara used in formulating a ksrn~l;
Eig. 27 illustrates calculation c~ eigenmatrix and kernel parameters for further calculating of param~t~re uee~d in speech-~1~mant modal-1 proGeasor of Fig. 7.1~
Fig. 2~ illustrat~s tho frrrmulation of a cc~mbinad kern~1 ~~ using th~ parameters of ~°ig. 27, the combined kernel is then uge~a in the spaaCh-element modal-~. processor of Fig.
gig. 2~ illustrates the calculatign of aiganmatrix ~3~, usadt in nonlinear p~oc~ssor-3 shown in gig. l3;
~'ig. 30 illustrates tire a~~armina~.ian of speech ~.aB~al tractors used in formulating anoth~~° kernels ~'ig. 3~, illustrates calculation of aigenma~.xix anc~ kernel parameters fr~r further calculating of parttm~tars used in speech-element r~ocle~,-2 procossor o~ gig. 14 ~
Fig. 32 i~.l.ustrata~s th.a formula~.~.an of a combines ~Sarnal KZ using the paramator~s crt Fig. 3~., the cambirxad kars~ai is then tasad ia~ tho speechmslsan~nt model--2 processor of Fig. l4 r r~ a . z 5 . ~ ra z z : 4 a A x.~ :~ y~.~ U T T E FL r/y o c z. ~ r.~r rr ~
rr, y ~; ~ p ~~ -7 r ~ ,..
F~.gs. 33 and 3~ illustrate the aal.aulat_'Lon of mean and standaxd deviat~.on parameters whioh are used in aalaulating the logz~rithm of the likelihood ratio as illustrated in Fig. 14:
Fig. 35 i,lluatxatgs th~ gen~ration of tables for diphone: _ and phoneme maps whioh ar~,used in the phonem~a estimate rearrangement dapioted in Fig. 1~:
Fig. 36 is a table a! lab~ls used in marking the speech ag illustrated in Fig. 25i Fig.,37 ie~ a table of diphone and aonatituent phoneme labels to ba used in the parame'~ex~ aalculatians of Figs. 26, 30 anc~ 3aE
Fig. 38 ie a table of isolat~ad forma of phonemes us~rl in th~ parameter ~salaulat.ions depicted in Figs. ~G and 30.
Fig. 39 i~ a tabla~ of diphonea and oonstitu~nt phonemes used in the parameter valaulation~ dep~.oted in Figs. 30 and 351 Figs . 4 o and 4 ~, ar~ tables of diphane~ and aona~titu~snt. .
phonemes to bc~ used in determining the parameters depicted in Fig. 3~t Fig. 42 is a tabl~ of s~ae~ah element labels used ire speech-~1~ment model~~,~
Fig. 42 is a table of phoneme 3sotyp~ lab~sls used in s~peeah~sl~m~nt modal-~2:
Fig. 4~ ie a taD~le vt plar~n~me labels used in p~hvn~ame re-az~rang~a~ent prvar~aaor 44 of Fig. 2;
fig. 45 is a block diagram of as hardware configuration of the speech res~ognition system of Figa. 1~2~
Fig. 45 is a block diagram of a second hardware configuration of the ep~ae~ala~reac~gn.itian gyhtem of Figs. 1~2;
Fig. 47, is ~t block diagram of a third hardware configuration of the speraall~reaognition system of Figr . ~,-2 ;
Fig. 48 ~.s a block diagram of a Eourt3a hardware confi~guratiori of the epeea~nmr~e~ognitioax system ~f Figs. ~.~2:
and O 8 . 1 5 . 9 O 1 1 : 4 8 A 2~/I m N U T 'I' E I~ ~J(, c C. 1 E I'.1 t7 ~ 1~7, ~ I = H F' fj r Fic~. 4~ ie a t~ab~.e explaining th~ rolationahip between the processing asystem tigure~a, Figs~. ~ - xs, and the parameter d~velopment figuxes, Figs. 19 - 3a. .
~~ ~,a~~ ~~ ~z~TZON ~F err mz,~u~~,TwE ~zsonx av~~va~w This specification describes, with reference to Figs. 1-is, a pxoaessing system Eras x~~aogn~.xing speech. The parameters used in calculations performed by procassox~s in the processing system and their development area c~eaara,becd with re~e~rence to Fags. ~.9-35 and th~ various tabl~s shown in Figs. 36-44.
Hardware configurations of the pray~ssing system are describ~d With ~'efer~nc~ to P'iga~. 4545.
With.rdferanc~ to Fig. Z, a speech recognition syat~m ~.0 ina~ludes a phoneme estim~at~r 12, ~a wArd/phrase determiner 14 and a word/phraae library 1.G. The phoneme ~stimator ~.2 receives a SFEE~H inpu~~ signal from, for e~e~mple, a microphone or a talsphc~ne line. The phoneme estimatpr 1~ senses the ~n~rgy of the SPh~7ECH input signal and determines whether the energy exceeds a pradetermin~ed threshold. If it does it ~~ndicsateB that sp~m~ch, and thus phonemes, are presewt ~.x~ the ~~~i~~~ signal, r Th$ ph~3n~x11~ ~St3.m$ta~" ~.~ t~'l~n C'.idlc'l,~l~tEl~
oorxeepon~iing phonem~ estimates, that is, a group of output signals, ~ach of which is an estimate c~f how Xikely it is that the 9PE~CH signal aanstitutes the phoneme a~asociated with that autput. ~°he estimator also o~nloulat~s th~ time 'between phan~ms det~ct3.ons, that i~, delta time.
The delta tim~ values and the estimates are appliet~ to the woxd/phraea dat~rminer 14. The. waxd/phxage det~rminer ~4, using thca Limn and estimate valu~s, aonsulte the word/phrase~
library 16, which aonta~,ns Wards and phrases listed in terms of ConBtituent phonemes. The word/phrase determiner 14 than O 8 . 1 5 . J O 1 2 : Z O P R!I >x ~ U T T E R Ivg c C L, ~ IvT I~7 E I J, 1F
I =, H y O 1 ! G'1 ~ ~ ") i ~, ~ ~d z3 '_t ivd =~:
asa~,gns a word or phrase to the aPEECH signal and txansaribe~r the ~ap~ac~h. Tho output of the word/phx~ase determiner 14 may tako othor formm, for examplce, an indication o~ whioh o~ a group of possible ~xpeate~d answers has been spoken.
Th~a dotaila~ of the word/phraae det~armiriar 14 will not ba set forth here because the apeaific way ~.n which the phoneme estimates are further processed is not part of the present invention. However, it is of interest that the word/phrase determiner 1~ determines the m~aariing of the Sp~~C~i input signal based strictly on the phoneme estimate values an~i the delta time values produced by the phoneme estimator l2, rather than on ~a more primi.~,~.ve form of data, for example, the :raw speech or its frequanay spectrum.
F'ig. 2 is ari overview of the phoneme estimator 22 shown in pig 1. gt should be noted at this point that the drawings represent the various processes as being performed by separate processors, ox blocks, as they oould be in an appropriate hard-wi~ed s~rat~am. This segregation into separate pra~cessc~rs ~acil~.tates the d~aoription, but these skilled in the art will recognize that moat of these ~'unotions will typically be performed by a relatively small number of common hardware elements. Spsci~iaally, ms~st oil the steps would ordinarily be~
carried out by aria or a very small number of m~.croproar~ssors, Referring again to ~'ig. 2 the phoneme estimator 12 reo~ives a, raw sp~~c~ signal and praaassas it, reducing the data through power~apectrum eatimatian in bloaDt 1~ and power spe~atrum reduct~,on in block 2tD ~a$ deaaribed in more detail below with rsf~xer~ae to digs. 3~~. The data-r~ac~uced signal is applied to broth an energy ~ieteot processor 22 and a receptive field praosssor~2~.
If the energy in the data-reduced signal is abov~ a pres~ete~mina~d threahol.d, indicating the preaenc~ of s~aeeoh, the energy detect processor 22 asserts a I~~T~CT signal on line ~2~.
r~ a . i 5 . ~ 0 1 ~ : i 0 ~ x.~ m ra zi -r ~r ~ ~: we ~ c z. ~ r7 rr ~ rJ, F
z ~ x ~ r~
~~ :~.~
Ths~ asserted DETECT signal energixas they receptive field proaesaor ::4, which then further processes the data, agse~mbi~.ng a receptive field. :Cf the signal energy is below the threshold, the DETECT esignal i~t not ;~saerted and the reo~ptiv~e fi~sJ.d prooassar 24 is not energixac~, prohibiting further processing of the SPEECH signal. The energy d~tect processor ~2 and the raae~ptiv~ field processor 24 are described in more detail below with refereno~a to Figs. 5-6.
Det~gting the preganae of phonemes in 'the received speech using the ~nargy processor a~ replaces the two~pa~th processing performed by the speech~recognition system desarib~d in ao~
pending appli~oation entitled "SpeeahbRacognitian circuitry employing Phoneme E~timatton" of which this ie an improvement.
Th$ eaxliar system, which is hex~inafter referred to as speeoh-x~oognition system-I, examines ~.he speech signal and dataot~s the pros~nae of either initial oont~onaxWs or vowels in one processing path, and the presence of final ~onsanantg in the ether proaassing path. Depending on which path produaes~ the det~at signal, the speech signal is further proaesged by a vowel, initial consonant, or final oongonant proo~ssor. Thus the speech-r~:aogn~,tion system-I raguiras th.raa ret~eptive field processors, each processing the speech signal to match it to a a~3bset of phonemes, instead of the one used in the present system. The present system, through anhanaed modeling and data reductions is able to compare the signal rapras~ntation with 'the full set of poss9.bl~ phdnemas.
Ref~rring agail~ to ~°ig. 2, when a ~ETEC'I s~.gn~ti asserted on line 22A, the energy effete~t proaeseor ~~ also produces on line ~2~, a signal proportional tn 'the integrated energy of the speech signal as described in mor~ detail b~low with reference t~ Fig .5.
The integrated energy signal is applied to an adaptive noraie~al~.zsr Z~ which also rgoaiv~ss '~ha output of the receptive 1 5. 90 7. 2 : 1 0 f'M :xT$'LTTTER~c GLFx7TJEx~T, F I -;F-I ~r>;3 ~lo~.
field processor 24. .The int~grated ~nergy signal is used by the adaptive naxmalizer 26 to impoad z~ ascend, higher energy threshold.
The ada,ptiv~ normalizex 26 r~moves an ~stimatsd mean from the data, that is, Erom the output a~' the receptive field procas~aor 24. The estimated mean is incre~mentaily updat~d only iii tote int~gratrad energy level o~ the data ie above the higher predetermined energy threslar~ld, signifying a speech signal with a relativ~ly large signal-to-no~,se~ ratio. ~'hua the adaptive normalizes 26 do~s not update 'oho estimated mean if the integrated energy ls~rel a~ the data ~,a below the thr~ahold since the ~stima$e~s may not #~e arsaurat~ in such oases. The egfect of th~ oparatiana of the adaptive noranaliz~r 26 can data with a high int~gratad anere~y signal is to apply to the data an adapted mean which dccaya ~s~ponentially over a lung "time constant".
the time constant is d~.fferent for different aituatians.
Speai~~.cally, tho time aon~stant in this sass is measured, not in time iteal~, but in the numla~r o~ instances input wectore ar~ applls~d to the adaptiv~ normalizes. A ~.arge number signifies that a particular ag~eaker is continuing to speak.
Thus, thc~ cha~caatariatias o~ the apeaah and the aaaoaiat~d audio channel, should raa'~ Change drastically fear this epe~aCta.
Accordingly, a ~.~ong time oonatant may ba used and the mean of the data aasociat~d with this apaaCh Can ba reduced to near ZBr~.
~onveraely, a small number of inatanCes in ~ah3ah the input.
vectors are applied to the ad~a~ative nox~snaliz~r ind3uates that a new spsakgr may be starting to speak. Thus, the aharaateriatica of th~ speech ane~~or the audio ahann~al are rest yet known. accordingly, a relatively short time constant ie us~d, and the adaptive average is quiokly ad~ustad to redoes the mean of t.he~data as alo~a to z~ro ae possible. ~lae f~ i3 . 1 5 . ~ (J 1 c: : 1 C', P 2V1 ~ N U T T E R M a G' L., E P.J t7 E DJ, F T ~ H F' rJ 4.
i -13.~
adaptive average is adjusted to accommodate the new apaaker~s pronunciation of the various sounds, for exempla, and also to accommodate the dif:~ez~snaee in the ~aoundm due to the quality of the audio channel. The operation of the adaptive normalizer is disr~uest~ad in more detail below with refer~nce to Fig. 7.
The normalized data are next applied to a rya~ptive field nonlinear proc~aaaeor a8 and thar~aafter to another nonlinear processor-2 30.~ The nonlin~ar proaeeec~re 28 and 30, described in mere detail below with referen~e to Figs. a~s and lo, respectively, manipulate the data, and each paes~es linear first order data tex~m~s and nonlinear second, third and/or fourth order terms. These terms are~then passed to normal2~ation processor 32. The normalization proc~a~eor 32 normaliz~ae th~a data and app~.~,~a them to a fzret of two speech-elemont models.
The normal9.zation processor 32 is deeoribed in more detail below with reference to Fig. 10.
The ape~eeh-element model-~1 proaeegox 34 reduaas the data appli~ad to it using parameters, that is, selected spe~aoh labels, formulated from the development data. The ape~ah-elemen'~ model-1 processor ~4 thus sal~atg for further procea~aing only the moat significant data. xhe reduaad diets, whioh represent figures of merit xelated to the likelihood that the speech contains reepec'Give speech e~.aments associated witty the components, are then canaatenated into triple vectors in block 36. Each input vector applied to processor 36 induaas an autput, which ie formed, usually, ~x~om the input vector, the previous input vector and 'the eubmetiuen~t input vector. Tha output may in the a~.tarnative be Formed with zero~~iller vectors, the choice depending upon the delta time signal 2zC
from the energy detect processor ~2. The use of the r~ubsequent input vaster induces ac delay in prooeeeor 3s which is described in mare detail below with reference to F'ig. 12.
~'he triple ve~tore axe then applied to a third nonlinear OB. 1S. 9O 12: 1O PTA %khtUTTERNiaCLErJrJE2~7, FISH PO5 2~~~ ~~%~
-i2-praa~ss~sor-3 38. The nonlinear prr~aassor--3 38 manipulates the data by Computa~tionwinte~neiwe, nonlinsar operations and 'then applies the data to a second speech-~el~ment model-2 proaessox~
40 which produvee estimates that th~~ apseah contains the speech elements (which wa later xefex to as phoneme isotypes) listed ire T2~bla 8 (Fig. 43) . '~ha sperach-element model processors 34 and 4a are described in more detail below with reference to Pigs. 11 and 24, respectively. The nonlinear processor-3 3a is described , in more detail below with r~,~erence to ~'~,g. a3.
Thereafter, the estimates: are applied to a logarithm proasesor 42 which aalau7Late~s the logarithm of the like~.~.hood ratio for etch. The ecstimatas acre then further simplified, that is, re-arranged and integrated, in pros~seors ~4 and 4s to ready the data Ear tho word/phrase datexminer 14 (Fig. 1). The a~impligier~ eatimat~ss and the delta time signal 22~ from the energy detect prop~saor 22 (Fig. 2) are then applied to the ward/phraae de~term3ner ~.4 whicsh assigns ws~rds~ or phrases to ~ths ~spe~ah. The various pros~saaors 42, 44 arid 46 are deeaGxi,b~d in more detail below with reference to Figs. 14-18.
PHONEME &~ROCESSINO ' Referring now to fig. 3, the pawer spectrum estimation processor 18 calculates a power spectrum ~acstimate ref the HPEECH
signal by f~.rst~ aonvsrting the analag SPEEC~T signal to a digital repreeenta~tion in an Analog-to-~ig~,ta1 (A/r~) converter The A/~ ~onvarter, which ae of eonvsntional design, samples the SF~EEC~I signed a't a rate of 8-kHx and produa~s Z6-bit digital data symbols, an, representing the amplitude of ache signal. The 8 k~3x sampling rate i~ consistent witty the current t~lephone industry standards.
The d$.gital delta samples, a~, are then segmented into ~eg<aences of 128 data samples, as illustrat~d in block 102.
C1 8 . 1 5 . 9 C1 1 ~ : 1 ~_J P NL %ts 3~J U T T E R M a C L E IV ~ J E nJ ~ F
Z ~ H F' C'a f~
~'~"°~-~
~.~~3~~i Each of these aaa~uenoss aarraaparids t,d a 12-m~.lliae~aond ~segm~nt of the SPEECH signal. The ~aeque~riaee can be thought, of a:a vectors bm 104, with oath voator having olEmentg bk~m. The bm vaators overlap by 32 data samples, and thus each bm veotor contains 96 new ~alemants and 32 s:lemante from the previous vector. Next the mean,. or 1~.G., value of th~ signal eogmont r~presented by the bm vector ins romav~ad in blank loG, producing a vector c~ 108. The mean valuo oonveys information whioh ie of littlo,or no valuo ~.n phoneme estimation.
R~f~rring again to ~'~.g. 3, tho vo~stc~r Cm 108 is applied to a 128-point discrete Fourier Transform (i3FT) circuit 110. Up to this paint, the pow~r spectrum estimation proaes~ is similar to the gpe~ch-element praproar~ssor of tho speech-recognition system-I. Haw~ver, in ord~r to increase the resolution of the rosuits of the DFT, the current system p~rformc~ the DFT urging lz8 data olem~nts as appaaed to the system--I, which uses 64 data ~lAm~nt~ anll ~1~ ~~r~~~ o The 12s distinct olomonts appaiad to the pFT circuit aro r~ai and thus only sixty--five of tho izs, mostly aamplea~, output valuos o~ the DFT, dk,m, repra~ont non-redundant data.
Tho power spectrum is thus calculated by multiplying the DP'T
values dk,~ by their ro6poctivo complex conjugates d*k,~, to produces carx~eepandirig r~ai values, ek~~. The eis~ty-five non-redundant valuos aro retain~d in a vector a~ 114. The data era thus reduced by onr~-half whilo tho ~.nfarmation b~liaved to be the most important to phoneme estimation ie retained.
Tho power spectrum valuos o~~~ are appi~.ed simultaneously to a 'won Mann window's circuit 11S and a band-limited ~norgY
circuit ~.a~ (Fig. 4) . The von tiann window circuit o~smooths'r the epeotrum in $ conventional manner, roduc~.ng the sid~lobes that result from truncation in the time domain.
The smoothod voatar fm is applied to laloc~k l2tD who'ro eari~eus o~.~ments, Ekr~, of vect.o~C f~ are combined, producing a C1 B. 1 5. JO 1 2 : 1 O PM mF~IUTTEI2~r. CLEh7h7Eh1r F I =;H F'~,'7 strateg~.cally Y~duGGd VCCtdr gm 122, ThB reduC~d V~actor includes 'germs from a fraguanoy range of 2.8.75 Hz - 353.75 Hz. ~hia range corresponds to s~.gnals~ reGeivsd using telephone line communisation.
The band~l.imited energy hm from circuit 118 includes the ari~Ygy within the same fr~quenoy ranges art that seed for the vector gm 122. ~h~ prewxous speech-xeaognition system-I used ari energy t~rm that was not band-limited iri dais fashion, but inst~ad was the average power of th~a entire gpeatrum. using the average power introduced soma noise into the energy which was not derived from th~ speech itself.
~ha band-limited energy value, hm, is aonaatoriated with the vector c~~ 122 in circuit 124 to form a vors~tar pm lxe. Thus vaster pm contains a data~reduGed version oS~ frequ~nay and ~nergy ~,n~ox7mation representing, for the most part, the center band frequ~nci~s of the SF~EE~H signal. F~eduainc~ the data in thin way retains information of particular value for forth~r computations while reduairig the data to a manageable size.
~'he phoneme-idsritific~~ltion information probab~.y resides in the relative, rather than in t;ha absolute, sixes caf variations of the indiv~Ldua~. elements pk~,~ of vaator p~ 12f>. Avvord~,ngly, as iri the speech reac~griition system-x, tt~e elements pk~m, which are all pos~.t~.vs or zero, are inarement~d by one, and the ~,ogarithms of the results are Gomputed,as indicated ~.n block 128. incrementing the vector pm elements by ori~ ensures that the resulting logarithm values are zero or positiv~ (~.og~~.
0) » ~'he resulting values c~x,m axe than applaed to energy detect processor 22 and receptive Geld processor ~~ (~a,c~. 6), f'ig. 5 dspiotet the energy detect pros~ssor 22 iri block diagram form. the energy Gompor~ant~ of vsator c~ 130, e~.ement q~~~, is integrated over a three time unit time ssgm~nt iri 3.ntsgrating circuit 132. Each time unit is 1.~ m~.17.~.seas~nds ~,oi~c~, ae dieaussed above, and thus the gnargy is intsgrat~d O t3 . 1 5 . 9 O 1 2 : ~ p T' 2,n ac $~y U T T E R ~ o C L F h1 I~.7 ~ N, ~' I
~ H F~ fj F~
over 36 milliseconds. T~ the integrated energy, rm, ~xceeds a predetermined threshold a dataator 134 asserts a DETECT signal 22A, sm, iridiaating the pxasence oP speech. The DETECT signal, sm, may bs asserted at most ona~:a~v~ry three time units, as shown in block 134, sincr~ the subscript, m, of the energy parameter rm must be zero in modulo three arithmetic.
Each time the DETECT signal 22A is asserted, block 3.36 produces ~a delta time signal (gym) aorraaponding to the time betty~an this DETECT signal and the previous one. The delta time signal is applied to x~n .interval e~ctractie~n cirouit 13a, which produoas a time signal. ~" 2ZC. An associated energy axtraation circuit ~,4~ produces an integrated energy signal, t~
228. Hoth the ~~ and the tn signals correspond to th~ SPEECH
signal ~iv~ time units aarll~ar, as diec~usaad below with raf~ranca to F'ig. 6. . The parameter index has changed from ~,m~~
to "nn t.c~ ~amphasi~~ 'that tkl~ extracted delta time and integrat~d ~nargy signz~ls era produaad for only.cartain s~gmants of the ~pEECFi signal, that is, segments for which a DETB~T signal is asserted. .
The DETECT signal 22A is appliod, slang With the vaster corn 130, to th~ receptive field p~.'orie~lsor 24 shown in Figs 6. The intagrat~d energy signal 22B is applied to the adaptive normalizar 26 sh~awn in Fig. 7. The delta tim~ signal 2~C is applied both to the fox-matian of ~,riple vectors in processor 36 as shown in gig. 12, and to the eavimator int~c~ra~or 46 as discussed below with r~farsna~ to F~c~s. a.~ and 3,~.
R~sfarr~.ng now to ~'ig. 6, the nETECl' s~.gn~~. 22A energizes a r~~~ptiva fla~.d axtr~c~tion circuit 2a0 which aas~mbla~c a racap~tiwa field 202, that ia, a group of vectors containing frequency inf~rmation covering a signal segmon'~ 12 'time units long. The DETECT signal cGrrst~pands to a signal segment in the middle of the receptive field, that is, i~t~ a s~.gnal segment a time units earlier, or to the mp5 column ~.n the raaapt~.va field o ~ , i 5 . s o i 2 : 1 o p xvx ~ r~ a ~ T ~ x5 a~a a c L E rr rr F rJ, F x :~, ~ g r~
matrix 202. The delay is xamaes~sary to synchronize the delta time and int~agrated energy aignalo producsed by the onergy d~teot proc~,~ssor 22 (Fig, a) w~.°~t1 th~a racept~,ve field, centering as aloaely as possible thra signal segment for which t.h,~ DETECT signal is assert~ad. The receptive fields are xe~.atively large, 12 time units, and thus no information is Xost in limiting the D~T1~CT s~igraal to at most one every three time units.
~1n averaging circuit 204 averages pairs of adjacent veotorc~ of the reoeptive field matrix 202, that is. elements axed c~9~m.10 are averaged, elements t~osm~g and c,~~m.8 are av~raged, ~ta. This op~ration r~ducas data by ons~-halt, producing matrix U~ 206. The parameter index is aga~,n changed from "m°' to "nn to emphasize that ~r~captive fields a>id ~.ntegrated energy signa~,s are produced for only certain segments of 'the ~pEECH sigatal.
The sspesch-recognition system-I diaaussed above reduces the data by two-thirds by averaging the data over thxee time units. The reduced data are thereafter aubjeated to nanlin~ar processing. Using the currant cyst~m, however, better raaolut~.on ~.s achieved by avexaginr~ the matrix elements over only two time unite and r~taining more data. The ~oextra" data may be retaitaed at this point in the proc~sa because of enhane~ed data reduction within the reaeptivs field nonlinear px°ocessor 29, discussed below in connection with Figs. 8 and ~.
Ths matrix Ltn a0s, is neart ag~liad to the adaptive normaliz~r 26 Shawn in Fig. 7. Th~ adaptive normalizsr 2~
produces a matrix vn 210 by aubtraa~ting a fiat~d~-~oarameter mean ~a~9 sand then dividing by a fixed~~aramo~tar standard deviation, ~r',. Th~ fined-parameter mean and staa~dard d~,viation values are calaula'ted f~COm the development database as described below with r~afersnas to Fig. 1.9.
If the statistics of th~ inac~ming sF~EECFi signal are O 8 . 1 5 . 9 O 1 2 : 1 O P ZvI m N'U T T E R ~ c C; L ~ h71w7 E ~7, F j y' H
F' 1 _17_ sufficiently close to those of da~Ca in the development database than th~x "nc~rmali~~ad~~ matrix V~ 2~.0 has a mean close td zero and a standard deviation close '~o one. However, it is likely that the statistics of the incoming SPEECI~ signal era somewhat diff~ren~. from those of data in the development database.
end~ed, individual voic~ samples from the development database may have mtatistics that are different from those in the aggr~gate. Hence, for an individual SPEECH signal, we expect that matrix Vn will have a mean different from zero and a standard deviatirrn different from one. Accordingly, further as~aptive norma~.i~at~.on is applied in the circuitry of Fig. 7 in cxrder to allow at least the mean 'Go decay toward zero.
xf the matriu V~ 210 data correspond to a SPEECH signal B~gment for which th~ integrated energy, tn z2H (Fig. ~), is abave a predetermined value, indicating a high signal-to-noise ratio and thus voicing, the data are further proaesaed by calculating Chair adaptive average in blocks 212-z~,s and then subtracting tha~e~verage in block 220. .First, the data are averaged over time, that is, over th$ matrix rows, in averaging circuit ~2~.2 producing vector w" 21~. v~actor w~ thus contains only the signal frequency information. This information adequately aharaaterizes they speaker's voice and the. audio sshannel. These characteristics should not vary significantJ.y ~ver time, particularly over the time coxresponding to the matrix data. Averaging' the data over tim~ thus reduces them fr~m ld~ parametors, that is, 5.05 elements of matrix V~, to twenty-~ne parameters, that is, twenty-one elemenfis s~~ vector w~.
The elements of veotar wh Z1~ axe applied to exponential averaging circuit 216. Adaptive averaging circuit 27.6 thus Gampareg 'the int8grated energy, tn 22~, calculated in energy detect processor 22 cfig. a) with a predetermined threshold value which is higher than the det~ct threshold used in the O 8'. 1 5 . 9 CJ 1 c : 1 U F' IVI he T d U T T F R. ~ c C L, E Iv7 I 7 ~ N, F
I S' H F' ~, -ls-energy datavt processor a2. Thus averaging circuit 216 detects which signal aegm~nts hava~ h~.gh signal-to-noise ratios, that is, which segments have significant voice components.
Tf ths~ iritagratad energy do~g not exceed the "voias'r threshold-value, the adaptive average vaCtor x'~ 218 rema~,ns what it was at the previoum instande, x'~~~. In this case the sxpon~ntial av~rage is subtracted in k~~lock 22o as b~fore, however, the average itself is not changed. Signal segments with energy values below ~.he voice-threshold may correspond, on the one hand, to unvoiced fricative or nasal phonemes, but may also correspond, on ~th~a athsw.hand, to breathing by the speaker or other cyuiet no~,s~s, pax°ti.~sular7.y at the and of breath groups. Such low-energy signal segments may not be re~.~.~ab~.a in ohmracterixing the mean of V~c~tor W~ 214 ~Or purposes of xeaogniaing phonemes.
the exponential averaging is performed using a time period which is relatively long with raspaa~t an individual phoneme but shoat wham compaxed with a serieas cf words or phrases. The averaging thus does not. greatly affect th~ data r~alating to a single phon~me but it do~s redue~ to near zero the mean of the data xalat~.ng to words ox phrases.
mhe time period us~ad varies depending on the length of t~.me the system has bean proeessing the speech. Spaai~iaa7.ly.
exponential av~raging is performed over sith~r a short time period corresponding to, for example, loo receptive fields with sufficient energy m approximately 3.6 seconds - ar a long~r time period corresponding to, for axamplg, 30~ receptive fields with sui~ficient, energy - approximate7.y 10 seconds - the length c~f' time depends an the number of times t~ha integrated energy s~,gnal 2z8 has e~tcsadad the Voice-threshold, that. is, the number of times that t~ ~ 25. ~ha ~ahoxtor t.ima period is used when the system ~naountars a new spr~ak~zr. It thereby adapt~a ~uia%ly 'Go the speaker's eh~craat~ristics and the 08. 1 ra. 90 1 2 : 1 O PIVL ~NUTTEPt/$c CLENI~7E2~7, y I ,H y j, ,v -.~
.~x.9_ raharaataristias of th~ audio ahanne~l. ~hsrea~'ter, the system uses the longer lima period to proaass that s~pr~aker ~ s spaaoh because his or her voit~a tsh~lrar~teri~stias and those of the audio ohannel are assumed to remain r~lativa~,y oonstant.
once the csalaulatians Ear the adaptive av~raga vector x~~
21~ are completed, the adaptive avar2~ge vector is subtracted from the matrix ~h 21.o elem~nta in blank 22o to proc9uca matrix x~ 222. the mean of the data in matrix X~, GdrrBE~ponding tp a long time period and representing speech signals that include voicing, is now o~.oao to Sara. Matrix Xn im ttaat~. t~pplied to raaeptive field nonlinear processor 2~ shown in black diagram ~arm in Figs. ~ and 9.
In comparison with the oorroa~p~nding nonlinear processing described in our previous application, the nonlinear processing of Figs. 8 and 9 calaulat~s Eew~r nonlinear al~ments. ~s a conseequenae, w~ have b~en ab~.e to retain more data in ~arlie~r ~procassin~ arid thus supply high~r~r~solution data to the t~alaulation of the more-important nonlinear produots that w~a do caloulat~.
with r~egerariae to Figs. a and g, the alemants of matri.x Xn 222 are combined as linear terms and also as specific, partial outer products in blpoks 224~23~. ~ssentia~.ly, the J.in~ar terms and the partial c~ut~ar produc~t~ ar~ added over the time dimenaian of th~ r~ceptive field. xhess specific products arse designed to canv~Y o~rtain intorrnation about the speech signal wt~i~,~ significantly reducing the data from wha~G thoy would b~
iE a straight outer product, that is, all products of distinafi mat~rix~element ~~irs w~re calculated. ~'ha ~arlisr spseah-reaognition system-z calculatoa a straight outer product at 'this paint in the processing, mnd thug, data aura required to bs significantly reduced during prior processing. 3'he current systatn, era the other hand, may retain mare data up to this point clue to this roox~lin~ar ~srace~s~~ing step, and it thus OR. 1 ~. 9O 1 2 : 1 O PM >xI~IUTT~R,NLc, CL~NN~h7, 1~ I SH P 1, maintains better resolution caf the inoaming data.
Th~x recaptivsa-field nonlinear proe~essor 2& pa~oduaes fbur vector groups. ~aoh vector group oontains v~otors y~~, z°,n~
and x~n and is assaaiated with a diffex~ent time delay, Th~ y~~
vectors c~onta~~,n seta which era a linear ar~mbinatiosa of the terans used in forming the two associated '~z'~ veatars. Th~ z'~,~
vectors contain the results of combining certain partial. outer products form~d using the energy, or first, terms in the variaus matrix Xn 222 aolumns, and the z~~ vectors contain the resu~,t.s of specific partial outer products f~armod using the ndn~e~nargy. or frequency, terms of the an~atrix x~ co~.umns. The ~mrmation of aa~ah aE these vec~tc~rs is discussed below.
significant time averaging 3s performed in reaGptiv~a-field non~.in~ar processor Vie. It is assumed that a phoneme is °'atat;onary~~ within a receptive field and thus that th~a location c~f a given frequency ~c~alumn within the receptive field does nr~t convey muoh us$ful signal infermation. However, th~
nonlin9ar combinations cf frsc~usncy columns, averag~d over the time window of the rac~apt~,va fia~.d, do represent information which is useful ~'ar speech reaognitian.
As set forth above, a vector group ie formed for each ~f four t~.me~-diffarsnae segm~nts. vector groups for further time d~.ft~er~nees ar~ not aalculat~sd because information r~lating to variar~aas near greater time dif~aranoes appa~ra o~ l,ittla value.
spaoi~E~.aal~.y, the v~s~tor gx~axp for a time diff~arenae of zero ~a = O) is farmed in k~loolCS 224-228 of Fig'. 8. sl~ck 224 praducas the first ~alemeni~ of the 'veatcr yaon by adding together the first aiamar~ts ia~ a~.l of the columns of the matriX X~ 222.
xt produces the second vector e~.~ment by adding tcs~sthar the second el~m~nts in all the columns, and sa on. Accordingly, thg vector y~ has as its elements the matrix data summed cover time.
L~1 8 . 1 5 . 9 C1 1 2 : 1 CI P T~4 m ~ U T T E F: M c r' L ~ ?'J D3 E h7, ~ I
, H y y, 4, 2~~~~~~
The a~aonc~ vector in than veaatar group, vaatvr z'o~~ is formed using the matrix enorgy terms, wh~.ah are the fir~t elements of the columns. Block Z26 forms, ~r~r Gash Galumr~, the product of the energy term and all the ether elements in the same column. The products are then summed to farm the elements ' of vector z'o~n. The vecter al~m~ant~a art thus the energy praduata summed over time.
The third vector in th~ ve~atvr group for the time diffe~renGe of sera, so,n, is formulated in block 228. ~hia block forma all the products among the matrix X~ 222 fxer~ttency elements, that ia, among all the elemerita in a column except the first. Ones Cauld here use the outs~r product, taking all of these products separately. Instead, a sum is formed of these products which is like that in an autoaorrelation. This sum is aallerl a "self-praduat" in block 228 ainaa it ~.~ f4xmed from within the frequency elements of a, aing~,e aalumn. This eelf-produat is then summed through time, i.e., ovex all the columns. Caking the self-produota within the frequanay aolumns instead of the full outer product strategically reduces the output v~ator from what ~.t~would hav~a been iE a dull outer product were calculated. Thug the ncrnlin~sar processor can.
process a larger input vector containing more signal ~requenay data, that ia, data with h~.gher f~ree~uency xesGlut.ian.
V~ator graupa far time differences of ~,, 2, and ~ arc°
ca~lculat~d in blacks 230~2~4 ahawn 3.n F'ig. 9. The vector Y n aontaina linear aambinationa of all the ~al~ments used in formulating 'the two associated ~~zn vgatora. Thus for a ~~.me difference o~P 1 (~ ~ 1) , vector ylsn cvrat~ains vombinations of all the el~manta that are one column apart, i.e., the elements in adjacent columns. Similarly, the y~~ vectors for time d~.~ferenaea 2 and 3 ~ar~ formulated by combining all the elements that era at least two and three columns apart, reapectiv~ly.
O~. 15. 9U 12: 10 P2JZ %~NUTTEFIMcCLEBJhTEIT, FZ~73 yl:., vector z~~~~ is formulated in block 232 by combining the energy farms with matrix r~lements which are ore column apart.
8imiletrly, the vector ~~~n im formulated in block 23A by Combining frequency elements that are, one column apart. 7Chus, the "~" veCtore contain e~,°smenta r~presenting certain combinations of the en~rgy and frequency terms from columns relating to the $ppropriate time difference. 8lmilarly, the vector groups for time differences a and 3 (e ~ z, s) are formed by,Cambaning elements which are two and three columns apart, respectively.
Vectors ~~~ are formed in block 234 by Cambining all the products of the fr~qu~ency ta~ema i~ram paixs o~ columns. mhe produata are ~ummed in a fasshion lake that of a Cross-correlatian be°twe~n frequency vectors, mhe sum in block 234 is called a ~~aroas pradu~st'° since it is formed be~tw~en the ~requanoy elements of two d~.ff~arent. columns. this c;roes product is then summed through tim~, i.~., over all the pairs of aolttmna adh~ring to the times d~.:~ferenae ~. Again, taking the cross prs~duvt of block 234 strategically r~duces the output veatar Erom what it would have bean if a full cuter product were Calculated. Hence, the input vector may be larger.
They veatar groups are theta Concatenated in block 236 to form a 431~e1,oment vector an 238, which :~a a nr~nli.near representa~:~.on of they data. ~'h~ superscript "T" in block X36 dgn~t~~ th4 '~1"~n~p0~9 ~7~ th8 v~G°~.'b~° ~"~d wx'2t~~n a It ie impartant to note that, although the nonlinear processor 28 emp~,oya multiplic~aticn to produce non:lineax interactions between olementaa ether nonlinear funct~.ona COUld be u~ec~ in plac~ of multiplication; the important feature is that saome type of nonl3~asar interaot~.c~n aCCUr. we employ multiplication mere9.y because it is simple to impiemet~~t.
The vector an 238 is appli~d to the aeaond nonlinear pxaceasax~m2 30 (dig. 2) which is depicted in Fig. 10. she f78. 1 5. ICJ 1 ~~~' : Z [a F'h?, m~UTTER~c C,'LEN2~1E~7, ~ I ~H y I t=i ~~r~ ~' -~3-elements of ers~otar an are :firs' dea~xxolat~d and data-reduced by multiplying th~am by an a~igenmatxix Fee. Th~a eigenmatrix FZa is formulated Eram the davalapme~nt database as illustrated in Fig. zz. They e~ir~anmatrix Faa aan~Ga~.ns aige~nvaatora corresponding to ~.ha tw~anty~s~ix larg~agt aigenvalu~s aaiculated from the devaalapment data aorresponding to the veator groups.
Thus multiplying a~ by tlx~ eiganmatrix reduces the data to the components of a~ lying in the directiane o:~ the twenty-eiac sigenvectors sel~!eted as accounting for the most variance.
The data are thus reduced from the 431, elements in v~ctor a~ to twenty~~ix ~~.ements in v~atar bn 242. By so reducing the dat$, we. 1~~a only abQUt 4~ of the information relating to s~.gna1 variance. Aaaordingly, without sacritiatng mush at the impartant signal intarxnatian, a c~omprt~mis~ is arshieved betwe~n (~,) retaining complete signal information and (ii) limiting the number at parameters subjected to nonlinear processing and thus to a geometric ~xpansion in the number of pmrameters. W~s belleva that by selecting the intarmation oorrespo~ac~~,ng to 'the laxg~st e3genvectore we are selecting the infr~rmation which is mast impar~tan~ t~r phQname recognition attar further processing.
The resulting twenty-six-~slsment ver~tar, b~ 242, is subjected to fixed~parameter normalization in block 244. The mean values, ~5~, depleted in blac~C 244 are formulated tram corresponding elements in ~a group of twenty-~xix~element vectors 1~~ in the development database as discussed in mots detail balaw with reference to Fa.g 23. The twenty~ix elements in the v~atGr b~ generated for the incoming SpFI~CFi signal ar~ compered with the av~rags of corresponding elements in the development database. The relative data values, rather than the actual values, are important for phoneme estimation. The m~an values, which may add little ihtormatian, ere thus elimina~.ed from the vector elements. ode may omit this normalization pr~aaess~.ng OB. 15. 9U 12: ~[~ FM >k~'UTTEI2~oC:LErIPTEPJ, F'IS'H y1~7 '.~ ~ =3:
e~tep from future embadimente.
A fu~,1 cuter praduat of the twenty-six elements of "narmaliaada~ vaator on 246 is th~n formulated in block 248.
The result 18 & 361-~elament veatAr d~ 250 Gdrttaining third°~ and fourth-order terms re~lativa to the adaptive receptive field mat~r~.x %n 222 (fig. 7) . This vector do is concatenated with the elements of vector a~ a38, flaming a 782-element vector a~
254. The. concatenated data are .th~n ,applied to iChe normalization proces8or 32 (fic~. ~
again, while we employ multiplication in step 246 beaausa that ie a ~timpl~ way to produce nonlinear in~.eraation results, other nonlinear functions can also be emp~.sayed for this puxpose.
RaiCerring to fig. 11, the Vector e~ 254 is sub~eated to another fix~d-parameter norrnalizatian in block 256. Thereafter the data ~.n the resultant vector ~" 258 are sub~ec~ted to ver~tox-by-vector normalization. That is, each individual veatox tn ie normalized sa that, across its 7a2 alements~, the mean is zero arid the standard d~viation ~.s one. The resulting normalized veotor g~ 262 is applied to the speech-element model-1 prorsessor ZG4. The data are thus raduc~d to a se's of spe~oh element estimates, with each estimate raorre~sponda.ng to one o~ the labels in Table 7 (dig. 43). Furthax nonlia~ear processing can ba p~r~ormed on the reduced data to better eetima~:e which particular epee~aYa element the data represent.
The speech-element modal-1 prvaesgor 264 multiplies the x~armal3.zed vector gn 262 by a kernel K'. The kexriel Ka cax~tains parameters relating to specific speech clement labels calculated using the data 3n thc~ development database. These labels are listed in Table 7 (fig. 42). formulatian of the kernel Xc~ is discuss~d with reference to dig. 28 b~low.
Multiplication by 'the kernel K~ effectively multiplies vectrar ~n by each o~ ninety-faux vectors, each o~ which ie aesc~aiated C7 8 . 1 5 . 9 C7 1 2 : 1 G F' T,/I >k r.1 U T T ~. R N( c C; L E I~T rJ ~
2~7, F I =3 ~Z y 1 g ~ .ra ,~
~~a~~~E,,~~.
With a different speaGh element listed in Table 7. The multipa~,~aat~.on generates a vaator hn whose avmpon~nts are n~.no~Gy-four figures of me~xit, sash of which is related to the likelihood that th,r~ ~spssch aontaina 'the speech a~l~ment a~soaiatad with it. The speech-element model-1 pros~asor 264 thus strategically reduces the data x~alatinc~ to tho incoming ~PRECFT signal, that ig, veotar g~, from 782 elements to ninety-four elements.
The veatox hn 266 conta~.ning tho r~duaed data is then concatenated with v~ators from two previous time periods in pxoc~essor 35, shown in Fig. 12. The de~l~ta time signal 22~
(Fig. aj is also app~,ied to the proaessox 36. 8peai~ically both the vector h~ and the ~l~lta tuna signal 22C are applied to buffers wtl~re the respective values for the two pz~ew~.ous time periods of each are stored. ~'hua the two buffers contain information relating to the same thxes-time-unit-long time period.
~f twc~ aoneeautiv~ vectors correspond to delta time signals longer than 12-milliaeaonds, we assume that the v~etors are derived from non-ovcarlapping r~ceptivo fisaldn. Thues the vector corresponding to the long d~lt.a time s~,gnal, that is, either the fix~at. ~r third vea~or atoreel in the buffer, will add liittla information wh.~ch is helpful in assigning phoneme estimates to the c~enGer vector hn. Accordingly, the corz~esponding vector is replaaad with all ~~RCa. This ensures that they triplea~ formed in block X04, i.e., vectors p~ 3c6, do not contain non-contiguous data. ~'he triple vector p~ 306 thus covers an ~nlarg~d ~~windaw~l in aontinuoue speech, ~orme~d from data r~eri.ved a~rom three overlapping rec~aptive fields.
Tn subsequent model.»g the speci~ia phoneme label associated with, the larger window are those of the aentra~.
receptive ~i~ld, so that the phonemes reaogni~ed are aentsred as much as possible in thr~ larger window. Many ph~snem~a, for l~ Et . 1 ~a . H U 1 2 : 1 O F DQ m N U T T ~ R 7~1i c C L ~ 1~7 r.7 E tJ, p I
~ H P 1 g ':e L.i exempla the °°au°~ in the word °~thougand°°, are heard morn distinctly over a relatively long time period anti thu~ should be more easily reaogniaed using this larger winc9ow. However, it the system is receiving ~PE~~H signals aorreaponding to rapid s~paaah, the lonc~nxr time period may result in mare than one phon~me per window. purther nonlinear processing and speech modeling .allow the system to recognize and s~parate such phonemes.
Referring to Fig. 12, enlarging the phoneme estimat~a time window a~t~this point.in the prac~easing is mor~ effective in phoneme recognition, for example, than increasing the size, that is, th~ r~levant time period, og the reaeptiva field.
Increasing the time period covered by the receptive field increases the numb~r of param~ters, assuming the resolution o~
the data remains the same. Then, in order to perform non~.~.naar processing using the larger receptive field without unduly expanding the number of,parametera the system must handle, the resolution Q~ the data-~ e~.ther per time pericsd ar pa:r frequency distribution-- mur~t ba radua~ad. ~Gengthening the ~,~.~n~ window at this point in the proaeasing, that ia, after a first speech-element mode~.ing step reduces the c9ata by selecting data x°elatint~ tc particular speech elements, inst~ad of lengthening the receptive field time period, all.ow~~ the system to look at data representing a longer segment of the incoming gpEECH signal without unduly inereasing the number of data parameters andOox without re~uaing the resolution o~ the data.
Referring still t.o Fig, ~.2, by enlarging th~ phoneme-estima'~e t~.me window w~ eliminate some of th~ contsxt-dependen~sy labeling s~~ th~ earlier speech r~aognition system-x.
Thg epeeah-recognition system-~ alters phonem~ labels d~pending upon context. F'or example, i~ a ~rc~we3 was preceded immediately by an unvoiced consonant or a voiced consonant, then the label OED. 15. 00 12: 1 O PM *?dUTT~R~cCL,EP71'dErJ, FI 5H F~Jra t G, r,~
~aa_ of that vowel was changed accordingly. As a consequence, phonem~ labels, particularly those for vowels, proliferated.
Tn the pre~tent system, however, the great ma~~ority of phaneme~s hav~ only one label, and the increased nonlinearity of the data aonveyst fi.hs~ Gontaxt of the phoneme labels to the word/phz~ase determ~,ner 1~ (F'ig.1). The number of label~ and thus spellings stored in the determiner is ~aign~.fioantly reduced, and bhis reduction expedites the search for the appropriate word ox phrase.
Referring now to gig. ~.3, the: output triple vec'~or p~ 3os from F'ig. 12 is apg~lied to the third nonlinear prncessar-3 ~8.
~ehis nonlinear processor is similar to nonlinear proa~assor-2 30, shown in dig. 10, faith two differences. ~°irs~t, there is no fixed-°paramater normal~.xat~.Qn here. Second, and more importantly, th~re is a thre~sho~.d here.
lPr~.ox to forming ths~ outer product in proaessc~r~3 38, the data arty csmmpared w~:~th a threshold in block 308. The threshold is set at ~~ro: Ves~tor p~ 306 contains estim~atas of the likeliho~d of sash sp~~ch e~,ement. thus an el~ment of vector pn that iB below zero indicates that a speecsh element that leas been proaeased by speech~elsmsnt ~nadel-~, 266 (pig. 11,) ~,s unlik~ly to have oeau~°red at the corresponding pleas in the concatenated windrow.
the rat~.onale for applying thg threshold ~0~ is as fellows: Vector p~ 3~6 is deeaomposed info eigenvector ~somponents in block 33.2, and then passed through an outer product in black 316 which greatly expands the s~.~e of the vectc~x. the expansion in vectrar sire means that a relatively large number of.parameters will be devoted to processing th~
ves~tc~r in subsequent prooest~ing. ~I~nce, care should be taken to forrrEU~.ate a vector Containing only the most important information before the eacpansir~n in s~ixe. In the interest of deploy~.ng parameters in they ~t~st efficient manner, it is batter OE3. 1 5. 90 1 2 : 1 U F'~/i mNUTTEF:Mo CLEITP.TET7, ~ I aH F'2 2 to ignore the model values of the great majority of spes~ch elements that era unlikoly to hmve oaauxrad at a given tame.
~'hese apseah elements have mode~l,valu~a below zero. Thus, using the thra~~hold 308, what is paaaad~to further nonlinmar processing is characterized lay the model valuQa a~s~aociated with the speech elements that era J.ikely to have occurred.
Referring still to Fig. ~.3, vector pn 306 aomponenta exceeding the predetermined thr$aho~.d era strategically decorrelatcad and reduced by multiplying the data lay an aig~nmatrix.R~~ in block 312. Tae eiganmatrix E33 is formulated from the aiganvactora associated with the thirty-three ~,argest eig~snvalu~a calculated from data in the development database aorxegponding to vesxtor ~In 310, as discussed in more detail with referenoa to fig. 2~ lo~elow. The cYata are thus xeduaed by aalacsting fox further nonlinear processing only the components of the data lying in the directions of the thixty-three largar~t aigenveatr~rg. ~ha cdmpromig~ between retaining signal information afid reducing the number of parameters subj~cted to nonlinear processing r~aultB, at this point in the praaessing, in retaining appx~oximats~.y 50% of the information accounting fox' signal ~arianoe w#aile reduoing th~ number of parametsara subjected to nonlinear processing from 282 to thirtyathree.
'~ha resulting data va~,uea, vectox r~ 314, are applied tca blook 316, where the aampl~ate cuter product is formulated. The r~saul~ts~ of the aut~e~r product are than conaatena~tad with the veGtar pn 306 to form an g~3-~slement v~ctor tn 320. ~'hia veotor con~caina terms w9ah a high dee~ree of nonlinearity as well ae all the components ref senator p~ 306. ~t thus contains 'the data on which the nonlinear processor-3 operated as well as the data which fell belaw the threshold.
Again, wh~,~.e wa amps.~y multiplication in at~p 3~.6 because that is a simple wary to praduoe nanlinear intexaat,ion xemua.ta, other nat~7,inear functions clan also be emplayed for this O ~ . Z ~ . d C'J 1 2 : 1 ( 1 P 1V1 %fs ~7 U 'L' 'Z' ~ R 3e~L a er L ~ 1'.12.1 T=.' 2.1 r F Z S' Zi F' dZ f° f:~ ~~ 6 ~~
m2g~
purpnsa.
~'ha 643-slam~nt veotar t~ 32o is then applied to the aaaond speechmelement modal-z proepasor 322 shown in Fig.
The ~p~aah-element model-~ pz~oaessax mult3piiee the data by a kernel Kz, produo~.nc~ a veotar un 324 a Kernel KZ has elements varrespondinc~ to the espeech elements (which ws refer to b~~.aw ae "phoneme igotypes°~) listed in Table 8, Fib. :43. Veotor u~
aontaina the speeoh element (phoneme is~atype) estimates. The kernel KZ ie formulated ~ram the develapm~nt de~ta as diacu~s~ed with reference to F'ig. 32 belaw. Phansme ieotypaa ar~
dieraueeed in mare detail below.
Iiernele K~ and Its differ X17 aide and efFeCt. Karnsl K~, discussed with reference to Fig. 11 abave, aantaina elem~nts which represent a simpler set of speech elements th~xn Far which w~ model using karn~1 Eez. The~s~ speech elema~nts~ are listed in Table ~, Fig. 4~. Fpr ~5~ar~p~,e, kernel X~ aantaina .dements c~arre~ponding to the ap~eah element '°b~~, and eaoh oacurrenae in apaeah of a °'b", whether it is an initial ~°b", a bridg~a "
b_°', etc. i$ mapped, using kernel K', t~ the entry "b", Kernel Ka oantaine entries which di~t~.nguieh between an init~,a1 "b", a br~ldt~a '°m"br,n, eta. Th~ apaeah~ elements aaaoaiated wi'~h kelrnel ~Z G~.~~ ~~gt~d in ~a~~~ a, Fig. 4J o The speech element (phoneme isotype) estimates are neatt applied to the likelihs~ad ratio proceeessr 42, which ~.raraslata~
each estimate into the logarithm o!~ the lik~lihoad that its speech element is present.. The likelihaad far sash speech ~lement is calaulat~d assuming normality of th~ distributions of estimate vela~a bath when the phonem~ is absent and when the phoneme is present. Th~ logarithm en~surea that further mathematical opa~xatiane an the data may~thereafter be. performed ae simple additiane xather than the m~re~time~canauming muitipii~atione a~ likelihood ratios.
The: rmBUitant logarithm of the likellhaod rata~s in veotor l7 8 . 1 5 . 9 O 1 ~ : 1 U F' lE~I m ~ U T T E ~ rJj c G' L E I~T rJ E r.T, p Y .=~ H ~ ~ U
Eg St r, y. .t 9 ~d ~ Yy ~ C~' vn 328 is applied to the phoneme estimate rearrangement processor 44 shown in Fig, 15. The rearrangement processor 44 manipulates the data into a form that ~.a more easily handled by the Word/Phrase Determiner 14 (Fig. 2). Wh~,la some of the rearrangement st~ap,s ar~ designed to manipulate the data for the specific Word/Phrase i~sterminer used in the pr~sferred embodiment, the simplifioation and consolidation of the data by rearranging the rpee~ch element estimates may simplify the d~texmination of the appropriat~ wprds and phrase .regardless of the particular word,/phrasa determiner used in the system.
~ha phon~ma rearrangement processor manipulates the data such that each sp~esoh element is represented by anly one lab~1.
Aa~tardingly, the Word/Fhrase Determin~a~° 14 need only store and cart through one representation of a particular phon~me and one spelling? for sash woxd/phrase~.
Each speech element estimate veotor should include the estimates asspaiated with one phoneme. ~towevor, some of the vaotors may inaltade diphone estimates as set forth in ~abl~a 8 (Fic~. 43). ~uah spe~ch ~slc~msnt estimate vectors are split in block 33o in Fig. 15 into aonstitu~nt phonemes. the estimates for th~ first portion of the diphone are moved beak in time and added. to signals from the earlier signal segment and e~stixnates for the second partion of the diphone erg moved forward in time and addod to any signal data present in the later time segment.
While the order of the phonemes is 3.mpQrt~ant, the placement in time of the phon~me~s is not. Moet ~~aords and syllables era at least several of these 36~mssc t~,ma units long. Thus, s~parating diphanea into constituent phonemes and moving th~
phonemes in time by this small unit wall not aff~at the matching of the estimates 'Go a word or phrase.
Dnoe the diphonss are separa'~ed into. constituent speech ~lements, the speech elementB are raduc~sd in block 334 to the rmallast set of speeoh ~lements (which we refer to belr~w as O B . 1 5 . -r-J U 1 ~ : 1 CJ F' 2./1 >k I~ U T T E R a/( o C: L E N L1 E h7, ~ I =; ~-i y ~ e~
~~~%~~%
-3 ~.~
"phon~rna hala~.ypas°°) required to pranaunva the words/phra~~s.
These speech elements are listed in Table 9, Fig. 44. For instanara, all final and bridge forma of phon~~naa are mapped to ttte~ir initial forms. Thus fiho individual rip~ech element scores are combined and negat~,ve~ ~aor~s are ignored.
The simplified speech slam~int (phoneme holotype) estimates are :applied to th~ phoneme estimator ~,ntcgr~ator 46 which ~,s shown in block diagram form in fags. 16-~,~. With referena~ to P'ie~. 1,6, sooxcs for givat~ phonemes are grouped over time in block 338 ' slang with the a~~oc~3.at~d delta tins signal ~2C from an~rgy dete~rst prat~ssor 22 (Fig. 5) a Blc~rrk 34~ keeps txi~t~k of the abso~,uta time in the grauping. The° scores for a given phonems~ era than conaol~,dat~ad into pna time loca'~ion in bloaka 344 and 34s (Fig. 1?).
~tsferring now to fig. 1?, the summed phonoma eatima'~~a scor~ is equated with the closest °~oentra~,d°° time in block 348, that is, the ~Gimg ir~diaa~ting the center of thc~ waightod time period over which a particular phoneme is spoken. Tamaa within thie~ period are weighted by th~ phAname estimate value. Then the a~eSOaiated phon~me label node, phoneme estimate value, and centroid time of ocaurr~nc~ era stored in a location of memory, as shown in block 352. The storage is aaceased by block 353 in Fig. 15, and the entrios are ordered by the a~ntroid time of aa~urre~ne~, a~ as to provide a oorract time r~rds~ring. output phon~me estimates am 35~ and the associated delta time values d~ era thin aacss~~ad by the t~ord/Phrase ~etarminor 14 (F~ige 2) a Th~ subsar~.pt~t have c~h~ang~d ~nt~~ agai3~, from °°n°° to a9m~o r 'to indiaste that. the outputs of Fig. ~.8 have a tiane bass distinct from 'that of the inputs.
The operation of the sp~ach-~lamant modal-2 procsessor 4p, .
anr~ the rearranging and ac°naolidating of the phonem~ ~stimat~s produaod by th~ system are illustrated by Considering th~
~r~~r~~~~n~ of the w~x'~ °~y~~t~rd~~e °° The ~~t ~f ~p~~~h ~~~~~n~s C5 ~T . 1 5 . 9 C1 1 2 : 1 C1 P 2.~ rEe T7 U T T ~ ~'t Zslj o r' y, ~ L.'I PJ
E tJ, F I =; H F~ ~ ~, ~~~~w~;
d32-for which the opeeoh i:y mod~lad includes subsets, each of which aonaiats of speech elamentE~ that are all What wa call nisotyp~Dm°° of a aingla phoneme Ilhcalatypa°~, Four example, the BpD~AOh el~menta ~1-,'tr'°, n,-v_"le and IDVn of Table 8 era all.
isotypea of tlxa~ comma ho~.o~ype, in this oasDa °~v°° , The phoneme isotypa aatimato labsla that are assigned to the apeach by the spa~ach-eleme~n'~ mod~~.-~, proa~leaaor, ignor~,ng noi.ay or bead eat.imataa, may bet J1 ~T jE1 Ee re~j iaOl.to t Ri R _",di ~C9~i d aI~ eIj Iiar~ wa have exampler~ of sa~eral different phoneme possibilities. Thins ins a caomawhat schematic version of the phonemes which appear in clearly articulated gpcoch. Each of then listed elements repr~asanta those phonemo iaotypas that would appear in contiguous windows in the spdach, each window oorraaponding to a detected reaeptivo field. The symbols between semicolons occur in the came window.
'~ha ayllD9bia fprnt °°d°° praoad~a the n~
°° glide as if the uttoranda wore Dla~-.yaat~rday." ~'he njn glide appeaxa again in the diphone °°~E. °° The next window ree-iterates the vowel ,°E. ~~
The final i°orm of ~°s°~ appears next, as °° s~°, indicating that them is D~oma voicing heard before the fricative but not aa~aul~h to be identified $s a particular vowed,. The unvoiced ats~p °°t°' is expressed hare firD~t in ~.ta isalated form °°isol.t°°, indicating that th~r~ is no voicsing hearse Ira the window, and then in its initial form Ilt." Tho next window contains two phtsnemaa, another initial, °°t°° and a ayllabio '°i~n, which is r~-it~ratl~d in the next windOW. The °'dDl appears first aDx a final DI,-dl°, then in its °~bridga9~ dorm ~~~d~n, Dand than as an initial '°d. °' The bridge form oontaina voicing from both the °~R°' and the final vowel 9°ea°~ in tho window, but not enough, of oath of thaa~ to ~uat.~,$y Choir being labalD3d in the same window with O E3 . 1 5 . 9 0 1 ~ : 1 U F~ 2~/I m y~J U T T ~T F~ ~ ,- r L ~ RT 2..7 E rl, F I ~ H F~ ~ F
~, o) c ~
~~~e~3'2~~a the bridge. The final vowel ig repeated.
If the 9pEE~H signal contains nr~iae, various windows may contain phonemes isotype estimates relating to the noise. These estimates, t~ypioally with smaller likelihood numbex~a, axe proaema~ad along with the phAneme ~sirimatea coxraaponding to the spoken word or phrase. The effect of these "noise" phoneme isotypes ig an ia~oreae~a in the time it takes the word/phrasa determinator 10 (~'ig. 1) to prc~aeas tha phoneme estimates.
~eferr3.ng again to the manipulation of the phonem~ isotype estimatea~listed above~, bxock 33~ (R~,g. l~) splits the diphone "jE" into its aonmtituent phe~nem~s:
~l ~J j ~''J Ej ~",&1T ~~~O~,otj t Rj R ~C'~J dwj d ~~J ~~f Eloo9c 33~ then substitutes phoneme holotypes for each ooaurrenae of any of its iaotypea:
~ l ~ J ~ ES EJ ~j tJ t RJ R d; di d eZJ 6IJ
f ina~l ly the estimator integrato~° ~ 6 ( Figs . 16 - ~.8 consolidates instances of the each phoneme holotype, .ao that multip3.e ~,netanaes axe eliminateds f Ef si '~ Rf d1 eI j Tha result it the phoneme ~etimates for the speech. Each phoneme ie treated he~c~a as though it had occurred at a single aentroid time of oaaurranae. These centroid times are no longar restricsted by the ~ncdulo~-3 constraint of the detests (block 13~, ~'ig. 5) , however, the order c~f the ~rax°ioue labels is retained to ensure the correct phon~tia spelling s~f the word. Only those phangmes which are close enough to be ct~naidered in the same word or phrase ara~ so consolidated.
Note l~hat in this eatample the consolidated "t" is assigned to the same w~,ndo~r as the syllabic "R.~~ ~hia will occur if the aentroid tinaee of oceurrenae of the twc phonemes arg ~suffiaiently close, hZ112~,METER t7EVE~pMENT
GB. 1 5. ~JO 1 2 : 1 CJ PIVS m~-UTTER~a C:LLh7h7ErJ, ~ T =~H F'~'7 The development of the parameters used in caloulating the phoneme estimates is discussed with ref~,r~nce to Figs. 1935.
Fig. 19 depiGta the oalcula~tion of the fixed parameters ~tt~~ and at~J used ih t~orma~.izing the data aoxre~spond~,ng to the inooming speech in they adaptive normalizer 26 (Fig. 7). The ~ixe~d parameters used throughout the proceasi,ng, including the mean and standard deviation values, ar~ caloulated using the data in th~ development database.
~'he deva~,opment datak~aae is formuhated Exam known speech aignalts. Tha known signals are applied to the apeeah processor and the data are manipulated as sot forth in Figa. 3-18. then various parameters useful in aharacterfsting the ae3aociated phonem~s at various points in the prot~e~asing era calculated for the entire development datab~sse. the~e oalaulated, or fixed, parameters are than used in cal,ouiating the phoneme estimates fox incoming signals reprgsentirag unknown ape~ch.
Referring t.o Fig. 19, a mean, Katy, is calculated for each of they elements, ui~~~n of th~ ~~Pta~ matrices Un ~C16 formulated from tho~ development data. First, the corresponding elements from each of the Un matrices in the development data are a~reraged, resulting in a matrix ~s 4~a with the various calouaatod mean values ae3 e~.eme~ats. Next the standard deviation values, Qtrl, o~ the correspo~dirag gl~ments of the tJn matrices era calculated using the asaooiated mea3~ valuetb, ,t;;~~~
resulting in a matrix c 40~ with ttag various calculated atandaxd deviation va~,ttes sec elements. the fi~ead ri~~aan a,~d standard deviation parame°~ers are thereafter used in the adaptiv~ normalizer to normalize e~aah element of the matx~i,x tin formulated for the incoming, unknown ~peech.
Fig. 2~ d~tfinas a covarianaeP matrix R 41t7 which is used in calculating various eigenmatr~.ces. ~h~a oovariance matrix R
corxespanding to the. N input vects~rs a~ 406 formulated for the ~evalopment data is a~alcu~.ated as shown an block 40F~. ~°kae 00. 1 5. 9O 1 ~ : 1 O F'I~iI my~UTTER~c CLEh7ZTFP.7, ~ I ,'H F°~F3 covax~ianas matx~~,~c Ft is thon uc~ad to calaulato oigonvectors and aseociatod eigenvalu~s as shown in k'ig. 21.
Referring ~.o Fig. 27., the. eigenvaluas are calculated in block 412 and orderod, with v~atax bo (from 414) being the oigenve~otor having the largest eigatavaluo anti bA,~ being the eiganvactor having the small~sst aiganvalue. The eigenv~ators era than norma~.i~ad by div,~d3,ng eaoh one by the sduare root of th~a corresponding eigenvalue t.o produce a vector b~~ 420. The first B normalised oigenvoctors, that is, the ~ narmalized eigenveetors r~orresponding to the 8 largest eigenvalues, are assembled into eigenmatriat EB 4124. Tha eigenmat.rix Es is net requixad, by definit~,cn, to be a s~qu~,re matrix. The superscripts ~~Tu in blatxk 422 denote the transposee~ of the veetars as written.
Fig. 22 depicts tho calculation of ~igenmatrix 7~2~ 432 used in nonlinear processor-2 30 (Fig. 10). The eigenmatrix Eae is cal~ulated urging th~ oal,aulat.idn m~thoe~ d~eraaribed with xafarence to Fig 21. The oovarianoe matrix R 410 required for the calculation of the aiganmatx°i~c is formulated from the dev~~,opmo3~t data as shown in Fig. 2 ~D . xhe a~,ggnmata~ix Rab, e~ontaining the twenty-six e~genvec~tars associated with the laxgaat e~.gerivaluas, is than used to d~corra~.ate the data ~°alatinr~ to the incoming speech ~.ra block 24a of nonlinear pxoc~ssor-2 (Fig. ~.~) .
Fig. 23 depicts th~ oalculat.ion of the mean values used in the 3'ixs~d-parameter n~r~talixation-2 processor 244 (Fig. 10) .
'the processor 244 normalises t$~ twonty.-six data el~ments assoo~atrad with the selected twenty-six eigenvectors. Thus th~
maa~t values of the elements in the N development data~aasa v~ott~rs aorrosponding ~.o vector bn 242 are calculated.
3Fig. 24 similarly shows the calculation of the parameters used in the fixed-paramet~x norrnali~ation~3 px~cessor 256 shown in lFig. 11. The means and standard deviations for the O8. 1 5. ~.JO 1 ~C : 1 rJ PT~I »~I~f2.7TTEF:Nd,c CL~;rJI~7Er7, ~ I ~H F
°x Hd !a Corresponding N veotors a~ Z34 in the development database are calculated resulting in a vector ~c 440 containing the calculated mean values and a vo~wtox a 442 containing they calculated at,andard-deviation values, Fig. a5 depi~sts marking of the e~peech. The segments of the devalopmeri'~ data input BPEEGH signal s(t) are extracted to form a ~'window" ~.nt~a the speech, representod by vecfi~or s' ~ 44 ~ .
The windows are seal~ct~d to oorr~spond sometimes to flee time width of 'the r~ceptive (field matrices U~ 206 (Fig. 6), r~present~d also by the vectors hn ~~~ (Fig. 3.2), and sometimes to the time width of the overlapped triple, xapresent~ad by the vsators pn 30f (fig. 12j, as discussed b~low. The former tim~
width coz~rmeponds to 1184 data sampi,~c~ of the Input SP~E~T3 signal S(tj; the later time width corresponds to 1760 such aampl~s. 131ock 949 of Fig. 25 shows the axtravt~ion of the longer window. If the shortmr window were smleatsd, then tho window would b~ formed by the 1194 samples centered about element s'~o,R. Th~a windowed speech is them a~aociatmd with phonemes by a pea;sr~n list~ning to the speech, as shown in block 448. The llsta~ning person thus marks each such window as containing the particular phonemes he or she hears, if any.
The t~.me width of the window a~lected by the listening pe7C~on depends upon thA nut~be7: of phonemes heard ahd up~n the clarity o~ the sound, fhonemse in the long~ar window often can be heard more easily, but, the longer window often introduces more phonemes in a window, and hence mare ambiguity in marking.
The ahoiae ttaus represents a 'trade-off of clarity of 'the spsec~s heard againstc time resolution of the resultant labels.
~t all thc~ mark~,ng wsarg done with the shorter window, th~n the labels would oorxedpond to the time width of the speech used by speech-e?Qment model-~. 254 (Fig, llj. The labels would be °~matohed" t~ this mad~xl, but would be "mie-matahed'~ to speech-element model-2 322 (F"ig. l4). Likewise. if all the t O8. 1 5. X30 I 2 : 1 U FIJL ~t~NLTTTEI~~c C:LED7rJEr7r ~ I S'H y.j~, ~r ~ ;~ ~~ L
_s marking were, done with the longer win~ic~w, then the labe~.~a would be matched i~a the second model but mis-matched to the first.
Tdaally, 'the lab~ld always wau~.t~ be matched to the model in which thcay are used, and the p~araon listening would generate two complete label mots. There iev, however, great commonality in what is heard, with the different window wid~tha, xn the interest of easing the burden of marking 'the speech, th~
listening person ~,a~ permitted to select the window time width to best advantage for each label instanaa.
fig~ ~~6 shows the processing of the labels after they era marked by the person. Tf 'two phonemes are heard in a window, then they may constitute a pair iala~t axe mapped to diphone labeZS as shown in b~.aak 450. Tf only one phoneme is heard in a window, then that. phoneme may be one of the unvoiced ac~nsonants which era mapped to ieaolat~d gpeaah element labels as sklown in block ~,a~aa, xi" mare than two phonemes axe heard then pairs of phonemes may be mappa~t to diphone labels and others may be mapped to single ph~tt~smas. In this last ease, if the window is the longter one, the person marking the speech may sel~ct the shoat~r windeaw and listen again to reduce the number of phonemes heard in a window. ~ha anappings are done automatically after the marking is complete, gd that the actual labe7,r entered by the person are presexvsd.
the labels selected for marking the speech are shown in Table 1 (Fig. ~~) , ~hesa speech element ~.abels ax°r~ sglect,ed based, in part, oar experienc~. For example, experience shows that some particular phoneme is likely to fallow another. Some of these labels sere thereafter refin~d to ina~.ude an ordering and/or combinatian of the ph~anemes, for example, into diphonea.
The numbmr of labels used throughout the processing ~,s larger than the number of labe~.s used in ~pxevious sp~~eh-precognition system-I. Such a large numb~r off' :Labials is used l~scatase, unlike the prev~,ous system in wh~.ch a trigger mechanism is used C7 Es . 1 5 . J O 1 2 : 1 C7 P' b/~ m I~d U T T ~ Ft geld o C; L E I~7 P7 'E
TJ, ~ T ~ H F' :j f8 ~ ~~:
to ~.ndicate the start of a phonem~ and thus the start of the proa~a~ssing, the aurren~. ~yste~m may detest a phoneme anywhere within the signal seg~t~ant window, and praoeosing may be begun, a~g., in the middle a~ a ;phanem~, Thus the system usos~ more labels to oon~rsy, after further praaessing, the context of the det~ctad phoneme.
R~ferxi,nr~ again to Fig. 26, th~a labels attaahe~ to a signal eegmen~, area encoded in block 454 'Go ~Erom a label veotar ~" 456. The label vector Ln 456 contains aXements x~epres~snting each a~ the nin~ty-four poss~Labls speech ~l~ament labels shown in Table 1 (Fig. 36) ag well as the new phoneme labels generated in blocks 450 and 452. Thsa resulting vector has slam~nts that axe Z's for mpeeoh ~lement labels heard in the segmeni~ and el~ments that ar~ 0 ~ s far labels nest heax°d. the label, ~ec~,ar is then appl~.e~d to the gax~ametex dev~alopment circuit shown in gig. 27.
F'ig. 27 depicts the calaulatian of an eigenmatrix E 462 and a kernel ~d 470 urged in ~ormu2ating the combined kernel 3~~
476 (Fig. 2e) . ~ covariance matr~.x ~t is aalc~uiated ~ar the develapmerit da~.~tbase vectors g~ 262. the vea$ars gn are tile signal Bata repres~ntations that are thereaf~,er applied to ~peevh-element model-~. 34 (Rig. 11). The oalaulated covariance matria~ R is then used to farmula~t~ the assac~.a~,ec1 eigenmatrix E
falleawing the c~alau~.ations discussed above with re~ere~,ae to Fig. 21.
The vectors gn 262 are then mult~.plied by the eigenmatrix E 462 to farm decorrelated, data-reducers vectors hn 466. ~h~
de-correlated v~ctors h~ have 650 elements, associated witch the 650 largest eigen~ra~.uc~s, as opposed to the 762 elements of speecah data in vectors g~. Claus tYge ,number oaf paxam~ters is strategically reducs~sd and the mast important data~far speech recoc~nit3an are retained, ~Clae retained informat,ian includes ~.n~ormatian relati~ag to approximately ~~.97~ of 'the signal G 8 . 1 ~a . 9 G 1 2 : 1 d F 1VM m t~T TJ Z' T ~ R: M c; C; z.. ~ h1 I'~.1 F
T.1 r ~ I ~; H F' ;~
~~~~~'~ jta ioJ z ~.~:
°~39-~
variant~, Rer~ucing the data at this point reduces the adze o~
the associated kernel K 470 and also the size of aom?ained kornel K~ to a more manageable size without sacrificing mush of the information which is important in phoneme estimation.
Kexnal It 47p is formulat~ad using the reduaad 65o-element v~ator h~ 466. Enah row o~ alemsnta, Kij, o~ kernel K is form~d by multipl~ring tho corresponding element of label vector' L~ 456 by the elements d~ vector h~. The ~lements of labs~l vector ~,n 456 ar~ normalized befor~ the multiplication by subtracting thr~
mean va.liEss ,~ormul.a~ted from the elements of the Id label vectors in the development data bee~, The k~rnel K 47o is used in calculating the kern~:l K°, which is then used to calculate °°aombined'° k~,rnel K~
X76 a,~
shown in l~ig. za. Kernel ac is first normalized, by s~ividing .
sash s~~ its slam~nts by the ~asst~ciated standard deviation value, prQduaing K ~ . The normal~.z~ad )carnal K ° is then combined witkl the ~igenmatrix E 46~. Combined kernel K~ is thereafter usesi in speech-element model-~, ~4 to assign preliminary labels to the incoming a~,~ech and reduc~ the data to a subset of likely label~t.
Fig. 29 depicts the Galculat,ion ~a~ eigenmatrix E$~ 506.
The eigenmatrix X33 contains the thirty-threes ~ig~anveators assoaisted with the thirty-three larg~s.~ eigenvalues. The, ~ige~nmatrix E3~ a,a oalaulated in the same way as the eigenmatrix discussed with refrarenc~e to Pig. 21 $b,~v~. This eig~nmatrix E3~ is then usrad to se~,ect the data values xepr~aentirig the incoming sgeech which a:re associated with the thirty~thxae largest eigenv~ctors.
Fig. 3o depirsta speech label, vectors used in formulating a eeaonc! combined karngl, KZ 322 (F'ig. 3z). The sit o~ labels, which ara~ phon~me isotype labels, differs fx~am that used in calculating K~ g76 (Kig. z8j as follows: xhe preliminary labials assigned to the data in spesoh~element an~da~.-1 3.~, shown OEi. 1 5. 9O 1 c. : 1 CJ PZVI *NVTTERMc: GLEI7TVEx-7, p Z ~H Z
rfi1 ..4 0, in Table 7 (Fig. 42) are mapped in blocks 5os~51,o dither to diphorie ~,abels in Tabld,s 2 or 4 (Figs. 37 and 39) or to isaolated phoneme labdls in Tables 3 (Fig. 3g), as appropriate.
Tha mapping require0 da~,ay3ng ~ha pxocersing by one time un~,t in b?ock 514. The delay aligns the labels with the center vector of the data triple formEd in pro~scassor 36 (Fig. 12).
Thd labels are then dnooded to form a ~.Zgmglament labdl vector %.I~ 5 ~. ~ a Figs. 31 and 32 illustrate the calculation of the combindd k~arnel 1Ca 534. wing the labg~, vector Ln 51g, the kernel Kz is calculated in the same manner as th~ darker described combined kernel K~ 476 (Figs. 27 arid 2$)s namely, a ~gquare eigenmatxix E
524 is dalaulatOd,~o decorxelate the dada in the speech data vector tn 320. Then a kernel K~ is calculated using the label vector Ln 518 , mhe kernel Ii ~ and the eigenmatrix E a,xa~ than mult,~plied to form the combined kernel KZ. Kornel Ka is used in spe~ah~element mod~sl-2 40 to reduca.the data and ~oxmulat~
phoneme i~otyp~ est.~mates by associating 'the data with thd 11,~
possible pharieme is~t~rpe labels.
Figs. 33 and 34 illustrate tile calcu7.ation of parameters used in formulating the ~.ogaxithm of the like~,ihood ratio (Fig.
14) . Thd l~.kdlihcaad ratio inaorporatds paramdtarss fornyulatsd from the d~velopmant database and assigns la.keliht~oc~ values tc the pD~aneme isotype estimates associated with thd incoming spea~sh. The est,~mates may thus be multiplied by aeiding, and divided by subtracting, after they are translated to logar~.thms .
specifically, With ~~~ferBnC~e to Fig. 33, the dev,alopment data vector u~ 324 and the label. vector L,~ 51.~ (Fig. 3~) are each applied to cixc~uits 536 anrl 540. docks 536 and 540 calculate m~an and standard ddviatian values for elements c~f the input vector u~ and acoumulat~e them separa~tdly for instances whdn tho oorresponcling elements in labdl vector L~
G 8 . 1 5 . J G 1 2 : 1 G P 2,~ »~ FAT LT '~' T ~ F~ Id( G C; L ~ 1J I 7 E rJ, ~' I ~ H F' .;S ~
6' r ~4g~.
51e appear in thgi deva~.opment database and when they do not appeax~ Thus block 536 ar~oumu~,atas atatistios for instances 'when the corre~sponaing phoneme is not heard in the input sp~eah, gor c~aoh individual phoneme, these inetanoe~s account for the vast majority of the data, mince a given phoneme ie usually not heard. Block .54o mc~aumulates aatatiatica for instances when th~ ~sorreep~ar~d,ing phonems~ is~ heard in the input speech. Such instances area in the minority.
The resulting mean and standard deviation values (vectors 539-8 and 542,-~) axe app~,ied to a de~rating circuit 544 (Fig.
3~) which adjusts the data values to compensate for the ra~sult3.ng difference in acaurmay between assigning phoneme ertimatAS tc~ known data which axe in th~ d~a~ralopment database and assigning them to unknown data. Thp m,~an and standard detriation values are adju~ted by multiplying them by e~oafficients a~ and b~ which era the ratio of, on the one hand, such vaZu~a averaged over all instances in a t,~st database to, on the oth~r hand, such va~.ues avermged over a~,l instances in 'tee deVAlopmsnt databases The test database is smaller than the development database a~a~~d the data in the t~aet database have not been used ~,n oalaulating any of th~ oth~r fixed parameters. fhe test data a~t~tain fewer calculated phoneme estimatem, and the estimates era assumed to be less robust than those associated with the d~velopment database. hhe coefficients a~ and b~ are thus a gauge of how mush the 117celihc~o~i rat~,~, pa~.amet~rs forgn~alat~d ~txom the development database ~hould be ~scalod fcr inooming new ep~eoh.
R~ferring to Fig. 34, the mean values are sc$lad using the co~ffiaiants a~ and b~ as set forth above. The de-rated values rams then applied to circuit 545 which formulates polynomial coefficients for the likelihood rat~,o circuit 3Z5 (Fig. 14).
After the phon~me is~otype estimates are transformaad into O Et . 1 5 . J G1 1 2 : ~ g P' M * N'CJ T T ~ R ~( ~ G' L, E N I~T E 2J, ~ I
S' ~i F' fj I
"
i ~:~
logarivhms of the likelihood retie, the phoneme isotypra estimates axe rearranged and con~splidated in phon~me reearrang~ment prQae~ssor 44 (Fig. ~.5) 2nd estimator integrator 46 (Figs. 16.~x,8) . Fig. 35 illust~rat~,s the generation of the.
maps ueed in rearranging and consolidating the estimates. With re~ferance to Fig. 35, a mapp~,ng matriac t~ 554. is formulated tar diphones, mapping the diphonee to the csongtituent sp~eah elements. Tables 2, 4 and 5 (Figs. 37, 39 and 4o) aantain they diphones, and constituent speoch el~ments. ~ secrand mapping matrix T 660 is formulated for mapping to a single label gorm various labels xepresanting the same speech element. For ~aa~ample, both, the e~re~ arid e~R~e labels ores mapped to the ~~r~e label. Table 6 (Fig. ~1) contains the set of speech elements to which the various label forma are mapp~c9.
Figs. 36--44, as discussed above, depict all the tables used in labeling phonemes. 'fable ~., Fig. 36, contains the labels with which the listen~r may mark the speech assocsiated with the development database. While the notation assigned to the labels may beg unconventional, the ~nrat.ation can be dupliaat~d using a stanc~axc~ ksyboaxd. ~acplanationa of the notation are therefore included as part of tae tab7.e.
The eet of labels which may be usead by a person marking the apeeah (Table 1) hoe been carefully chosen to saver the poesibl~ acoustic mani~astatior~s of phonemes within the listen~.ng windows. Thus the selection of vowels and consonant~, in all the vax~ioua forms s~chibited in Table ~., ~,s a sgt c.f ape~ah slem~nt~ that one could hear in the iiat~ning windows, The list of spaeah elsme~r~ts includes more thaw what 'one typically rafe~rs to as ~~phonemes~~, for example, it includ~s initial farms, bridge forms and final forms of ~rarious speech elements. ' Table 2, Fig. 37, e~onf.ains diphone labels and they constituent sp~sch elements. This table is used to s~parato a C7 E'I . 1 5 . 9 O 1 2 : ~ f3 F' M %It N U T T E P_ N( ry C L E N h7 E TJ, c I
=; H F' fj :j -~3-spe~aah alemen~. ~stimate vaator containing diphone estimates into the twa appropriate speaah element estimates. The table is used also to generate the aombinad kernels K~ and IC2. Tabls 2 ig alpaca used along with Tables 3-G, Figs. 38-41 in generating the maps For rearranging and consalidwting the phoneme ~as~.imates in phoneme estimt~tor integtrator a6 (fig. 35) .
Tables 7-9 (Figs. 42-44) are tables of the speech element labels used in model-1 prr~r~as~sor 3~i, modal-2 processor 40 and phoneme rsarrangemen~G proaaasar ~~, respeat~ivaly. Tables aontaina the labAls aor:~esponding to the elements of kernel Tip, Tables 9 cantains the phanama i~rotype~ labels corresponding t,o sl~ments of kernel KZ and Table 9 oontains the phoneme e~atimate ~.alael~a which ere applied to data which are manipulated and re-arranged to conform to the requirements of the wordjphr~sa determiner 7.4 and the wordjphrase diotionary 1G (fig. 1). The labels shown in Table 9 are the phonemes which wa believe best aharautdriza the spaken words ar phrases in general speech.
Th~ set~c of lx~bels used in Tables 1-g have bean carefully aho~en to opti.mi~e the final phoneme aaauracy of the ~paeah racogrtizer. Thus, the salaatian of vt~w~1, oonsonants, diphones and isolated forms, while net a complete rat of all such possibilities, is the set which is the most useful for subses~uently lc~oki,ng up wards in the Wox°djPhx~ass Determiner, .
block l4 c~~' Fig. 1. Th~ tables may ba modified to include sounds inclioative of subs eat-~ralatad sp~ea~l, far ~xample, ~aumbars, and also tt~ inoluda sounds pre~ant in ~.anguagas other 4han Fin~li~h, ~~' a~'~~opr~~t~e HbW~ CdNFIi',,U~iPaTICtE$8 figs ~5~-48 depict system hardware aonfiguraf.ions 1-~. The first oonfiguration (F°ig. ~5), including a Digital signal Froaessor (mSpj microprocessor BOO and a~ memory 602, is C7 E~ . 1 5 . 9 O 1 ~ : ~ F.3 P M ~ Y~3 LJ T T E R NL c C, L E 1 J kJ E rJ, F
I ~ H P C, /j, y~ v .
l~a ~ ~ ~ _r !v ~w.
~49,°~
designed !or a so~twaro-intensiv~ approach to the currant system. A second configuration (Fig. 4fi) is designed slap for ~a rather so:Etware-intensive embodiment. This second con~ic~uration, Which oon~sists o~ Four DSP~a 604, 606, 610 arid 61~~, and two sthe~red memories, 608 and 614, performs the syat~am ~unctiona using two memorie~ which ar~ each hal~ as ~.arga ss ~.he memory in Firs. 45 and DSP~s which era two try three times slower than the ~,0-16 M~FS (millions-o~-instruatians per second) of DSF 600 (Fig. 45).
P'ic~. 47 depicts a system configuraticr~ which is r~latively hardware-int~3ns~.ve. This third configuration consists o~ a 2-5 MIPS micropxocassor 616, a memory fia0 a,nd a multiply/accumulats Circuit 618. Th~a multiply/accumulat~ circuit performs the rather largo matrix multiplication operations. For ~xamp~.~, this cixcuit would multiply the ~.~.9x$43_eJ.ement cambinad kernel xa matrix and the 843-element vector tn 320 (Fig. 14). The miarnprcase$or sls, wllir~h per~arxns the ether aalculationg need not ba :a DBP.
Fig. 48 depicts a Floating point system con~.iguration.
Ths system includ8s 710--15 MFLOFS (millions o~ floating-point--operatiorig p~r second) DSh prcot~ssor fi22 and a memory 624 w~ai~sh is twic~ as larc~~ as the mamc~ri~a usod iri the other systems.
Tha memory 624 is thus capable a~ storing 32-b~.t 9~loatihg pc~iaat numbers iristoad o~'ths 36~bit intar~srs used in the rather three cani'igurations.
Fig. 49 illustrates how the parameters, the devalr~pment c~~ which is shown iri Pigs. 19-35, era related to the processing system depicted iri block diagram Form in F'ic~s~. 3-1S.
COidChUSIQN
The present speech rcc~gniti4n system per~orms spsech~
element-speoi~ic processing, For exa~mpls, in speech-element Gtr. 15. 9G 12: ~~3 Pb/L mrpUTT~.,TF:Mc,CLTixII.l~2~:1, ~I:~H ~G
model-1 3a (brig. 11), in between nonlinear processing to manipulate the data into a $orm which contains recognizable phoneme patterns, pex~for~"ing apeeah~-element-s~peci$ia proasesing at various points in tins system allows r~latively large amounts og high reaolution ~~~.gnal freguency dat$ to be raduce~d without sacrificing ~,nf~armation which is important to phoneme estimation.
speech-slam~nt-specific data reduction procsssinc~ is not perfr~rmad at the a~ppropxiats plaas~ in the system, thg resolution of the signal data applied to the nonlinear processors would have to be reduoed to limit the number of parameters.
~'ha present system thus retains important, relatively high r~solution data $or nanl~,n~aar processing and eliminates data at variouss points in than system which, at the point of data reduction, are found to be redundant or relatively unimportant aft~r spaach~-element-specific procsessing. If the data rmduation and nonlinsax prooesa~ing steps ware not so interleaved, the system would ba operating on lpwer raso~.ution data, impairing accuracy.
The foregoing deraription has been limited to s spsai$io embadimant of this invention. zt will be apparent, however, that variations and modi$ioationg may be ma~da to the invention, with the attainment og some air all e~f the advantagas~ o;p the invention. Therefore, it is the ebject of the appended claims ter saver all sudh variations and modifiaatimns as ooma within the true spirit send sec~pe of the invent~.~rn.
Claims (16)
1. A speech-recognition device for identifying a speech element of interest in a speech signal, said device comprising:
A. means for generating a first vector each of whose components represents a component of said speech element;
B. first modeling means for comparing said first vector with a first set of model vectors representing known speech elements, for each comparison deriving a value representing the degree of correlation with one of said model vectors, and generating a second vector each of whose components is one of said values;
C. means for producing a third vector, said means calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. second modeling means for comparing said third vector with a second set of model vectors representing known speech elements, the comparing means producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
A. means for generating a first vector each of whose components represents a component of said speech element;
B. first modeling means for comparing said first vector with a first set of model vectors representing known speech elements, for each comparison deriving a value representing the degree of correlation with one of said model vectors, and generating a second vector each of whose components is one of said values;
C. means for producing a third vector, said means calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. second modeling means for comparing said third vector with a second set of model vectors representing known speech elements, the comparing means producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
2. The speech-recognition device of claim 1, wherein said means for generating a first vector includes:
A. processing means for processing the speech signal to produce for a signal segment a reduced-data representation of the speech that includes a plurality of reduced-data elements; and B. means for calculating quantities proportional to products of certain of the reduced-data elements and powers of certain of the reduced-data elements to produce as said first vector a nonlinear representation of the speech that includes as elements thereof the quantities proportional to the products and powers.
A. processing means for processing the speech signal to produce for a signal segment a reduced-data representation of the speech that includes a plurality of reduced-data elements; and B. means for calculating quantities proportional to products of certain of the reduced-data elements and powers of certain of the reduced-data elements to produce as said first vector a nonlinear representation of the speech that includes as elements thereof the quantities proportional to the products and powers.
3. The speech recognition device of claim 2, wherein said first modeling means compares the nonlinear representation with a group of modeling elements which include nonlinear representations characteristic of one or more speech-elements of interest in known speech, and producing as said second vector a reduced-data nonlinear representation that includes a plurality of reduced-data nonlinear representation data elements.
4. The speech recognition device of claim 3 wherein said means for producing a third vector calculates quantities proportional to products of certain of the reduced nonlinear representation data elements and powers of certain of the reduced nonlinear representation data elements in said second vector to produce as said third vector a further nonlinear representation of the speech that includes as elements thereof the quantities proportional to the products and powers.
5. The speech recognition device of claim 4 wherein said second modeling means compares the further nonlinear representation with a group of modeling elements which include nonlinear representations characteristic of the speech-elements of interest in known speech.
6, The speech-recognition device of claim 4, wherein said means for producing a third vector includes means for concatenating the reduced nonlinear representation data elements corresponding to a predetermined number of signal segments before calculating the proportional quantities.
7. The speech-recognition device of claim 1 said device further including:
means for monitoring the speech signal to determine when the speech signal contains energy above a predetermined value, the monitoring means (i) manipulating data corresponding to speech signal frequency and amplitude to produce data elements corresponding to the energy of the speech signal within a predetermined frequency range, (ii) manipulating the data elements to produce an energy quantity representing the integral of the signal energy over a time period corresponding to a predetermined number of signal segments, and (iii) asserting an output signal when the energy quantity is above a predetermined energy threshold; and timing means for determining the time when the monitoring means asserts the output signal; and processing means, responsive to the monitoring means and the timing means, processing the speech signal to identify speech-elements of interest in the speech signal segments, the processing means processing only the signal segments for which the monitoring means asserts the output signal.
means for monitoring the speech signal to determine when the speech signal contains energy above a predetermined value, the monitoring means (i) manipulating data corresponding to speech signal frequency and amplitude to produce data elements corresponding to the energy of the speech signal within a predetermined frequency range, (ii) manipulating the data elements to produce an energy quantity representing the integral of the signal energy over a time period corresponding to a predetermined number of signal segments, and (iii) asserting an output signal when the energy quantity is above a predetermined energy threshold; and timing means for determining the time when the monitoring means asserts the output signal; and processing means, responsive to the monitoring means and the timing means, processing the speech signal to identify speech-elements of interest in the speech signal segments, the processing means processing only the signal segments for which the monitoring means asserts the output signal.
8. The speech-recognition device of claim 7, wherein said processing means includes means for consolidating and rearranging the speech-element estimate signals to produce a minimum speech-element representation of a word or phrase corresponding to the speech signal.
9. The speech-recognition device of claim 8, wherein said consolidating and rearranging means is responsive to the timing means, rearranging and consolidating the speech-element estimate signals based, in part, on the time when the monitoring means detects a speech element.
10. The speech-recognition device of claim 1, wherein said second set of model vectors corresponds to a predetermined set of phoneme isotypes.
11. The speech-recognition device of claim 1, wherein said first set of model vectors corresponds to a predetermined set of phonemes.
12. A method of identifying a speech element of interest in a speech signal, said method comprising the steps of:
A. generating a first vector each of whose components represents a component of said speech element;
B. comparing said first vector with a first set of model vectors representing known speech elements and for each comparison deriving a value representing the degree of correlation with one of said model vectors, thereby generating a second vector each of whose components is one of said values;
C. calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. comparing said third vector with a second set of model vectors representing known speech elements and producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
A. generating a first vector each of whose components represents a component of said speech element;
B. comparing said first vector with a first set of model vectors representing known speech elements and for each comparison deriving a value representing the degree of correlation with one of said model vectors, thereby generating a second vector each of whose components is one of said values;
C. calculating quantities proportional to products of certain of the components of said second vector and powers of certain of the components of said second vector to produce a third vector with components that are quantities proportional to the products and powers; and D. comparing said third vector with a second set of model vectors representing known speech elements and producing respective speech-element estimate signals that represent the likelihoods that the speech contains respective speech elements.
13. The method of identifying speech elements of claim 12, said method including the further steps of:
E. generating data elements each of which corresponds to the energy of the speech signal within a predetermined frequency range;
F. manipulating the data elements to produce an energy quantity representing the integral of the signal energy over a time period corresponding to a predetermined number of signal segments; and G. processing the speech signal segments associated with energy quantities which are above a predetermined value in accordance with steps A-D.
E. generating data elements each of which corresponds to the energy of the speech signal within a predetermined frequency range;
F. manipulating the data elements to produce an energy quantity representing the integral of the signal energy over a time period corresponding to a predetermined number of signal segments; and G. processing the speech signal segments associated with energy quantities which are above a predetermined value in accordance with steps A-D.
14. The method of identifying speech elements of claim 12, wherein said step of generating said first vector and said step of comparing said first vector with a first set of modeling vectors include, respectively, i. producing as said first vector a first reduced-data representation of the speech signal segment that includes a plurality of reduced-data elements; and ii. include nonlinear representations which are characteristic of one or more speech elements of interest in known speech.
15. The method of identifying speech elements of claim 14 wherein said step of deriving said third vector and said step of comparing said third vector with a second set of model vectors include, respectively:
iii. calculating quantities proportional to products of certain of the elements of said second vector and powers of certain of said elements to produce as said third vector a nonlinear representation of the speech that includes as elements thereof quantities proportional to the products and powers; and iv. comparing said nonlinear representation with a group of modeling elements which include nonlinear representations which are characteristic of the speech elements of interest in known speech.
iii. calculating quantities proportional to products of certain of the elements of said second vector and powers of certain of said elements to produce as said third vector a nonlinear representation of the speech that includes as elements thereof quantities proportional to the products and powers; and iv. comparing said nonlinear representation with a group of modeling elements which include nonlinear representations which are characteristic of the speech elements of interest in known speech.
16. The method of identifying speech elements of claim 12 further including the steps of:
E. repeating steps A through D for successive segments of the speech signal which are associated with a spoken word or phrase;
F. combining the speech element estimate signals produced in step D to form combination signals; and G. in response to the combination signals identifying words or phrases corresponding to the speech signal.
E. repeating steps A through D for successive segments of the speech signal which are associated with a spoken word or phrase;
F. combining the speech element estimate signals produced in step D to form combination signals; and G. in response to the combination signals identifying words or phrases corresponding to the speech signal.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US07/395,449 | 1989-08-17 | ||
US07/395,449 US5168524A (en) | 1989-08-17 | 1989-08-17 | Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2023424A1 CA2023424A1 (en) | 1991-02-18 |
CA2023424C true CA2023424C (en) | 2001-11-27 |
Family
ID=23563092
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002023424A Expired - Lifetime CA2023424C (en) | 1989-08-17 | 1990-08-16 | Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation |
Country Status (6)
Country | Link |
---|---|
US (2) | US5168524A (en) |
EP (1) | EP0413361B1 (en) |
JP (1) | JP3055691B2 (en) |
AT (1) | ATE179828T1 (en) |
CA (1) | CA2023424C (en) |
DE (1) | DE69033084T2 (en) |
Families Citing this family (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5168524A (en) * | 1989-08-17 | 1992-12-01 | Eliza Corporation | Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation |
GB9106082D0 (en) * | 1991-03-22 | 1991-05-08 | Secr Defence | Dynamical system analyser |
JPH05134694A (en) * | 1991-11-15 | 1993-05-28 | Sony Corp | Voice recognizing device |
JPH05188994A (en) * | 1992-01-07 | 1993-07-30 | Sony Corp | Noise suppression device |
FR2696036B1 (en) * | 1992-09-24 | 1994-10-14 | France Telecom | Method of measuring resemblance between sound samples and device for implementing this method. |
US5455889A (en) * | 1993-02-08 | 1995-10-03 | International Business Machines Corporation | Labelling speech using context-dependent acoustic prototypes |
US5652897A (en) * | 1993-05-24 | 1997-07-29 | Unisys Corporation | Robust language processor for segmenting and parsing-language containing multiple instructions |
CN1160450A (en) * | 1994-09-07 | 1997-09-24 | 摩托罗拉公司 | System for recognizing spoken sounds from continuous speech and method of using same |
US5594834A (en) * | 1994-09-30 | 1997-01-14 | Motorola, Inc. | Method and system for recognizing a boundary between sounds in continuous speech |
US5638486A (en) * | 1994-10-26 | 1997-06-10 | Motorola, Inc. | Method and system for continuous speech recognition using voting techniques |
US5596679A (en) * | 1994-10-26 | 1997-01-21 | Motorola, Inc. | Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs |
US5796924A (en) * | 1996-03-19 | 1998-08-18 | Motorola, Inc. | Method and system for selecting pattern recognition training vectors |
FI114247B (en) * | 1997-04-11 | 2004-09-15 | Nokia Corp | Method and apparatus for speech recognition |
US6006181A (en) * | 1997-09-12 | 1999-12-21 | Lucent Technologies Inc. | Method and apparatus for continuous speech recognition using a layered, self-adjusting decoder network |
FR2769117B1 (en) * | 1997-09-29 | 2000-11-10 | Matra Comm | LEARNING METHOD IN A SPEECH RECOGNITION SYSTEM |
US6963871B1 (en) * | 1998-03-25 | 2005-11-08 | Language Analysis Systems, Inc. | System and method for adaptive multi-cultural searching and matching of personal names |
US8812300B2 (en) | 1998-03-25 | 2014-08-19 | International Business Machines Corporation | Identifying related names |
US8855998B2 (en) | 1998-03-25 | 2014-10-07 | International Business Machines Corporation | Parsing culturally diverse names |
JP3789246B2 (en) * | 1999-02-25 | 2006-06-21 | 株式会社リコー | Speech segment detection device, speech segment detection method, speech recognition device, speech recognition method, and recording medium |
US6442520B1 (en) | 1999-11-08 | 2002-08-27 | Agere Systems Guardian Corp. | Method and apparatus for continuous speech recognition using a layered, self-adjusting decoded network |
US7370086B2 (en) * | 2000-03-24 | 2008-05-06 | Eliza Corporation | Web-based speech recognition with scripting and semantic objects |
DE60143797D1 (en) * | 2000-03-24 | 2011-02-17 | Eliza Corp | VOICE RECOGNITION |
US7366766B2 (en) * | 2000-03-24 | 2008-04-29 | Eliza Corporation | Web-based speech recognition with scripting and semantic objects |
US6868380B2 (en) * | 2000-03-24 | 2005-03-15 | Eliza Corporation | Speech recognition system and method for generating phonotic estimates |
US6629073B1 (en) * | 2000-04-27 | 2003-09-30 | Microsoft Corporation | Speech recognition method and apparatus utilizing multi-unit models |
US6662158B1 (en) | 2000-04-27 | 2003-12-09 | Microsoft Corporation | Temporal pattern recognition method and apparatus utilizing segment and frame-based models |
US20020059072A1 (en) * | 2000-10-16 | 2002-05-16 | Nasreen Quibria | Method of and system for providing adaptive respondent training in a speech recognition application |
JP4759827B2 (en) * | 2001-03-28 | 2011-08-31 | 日本電気株式会社 | Voice segmentation apparatus and method, and control program therefor |
US7181398B2 (en) * | 2002-03-27 | 2007-02-20 | Hewlett-Packard Development Company, L.P. | Vocabulary independent speech recognition system and method using subword units |
JP3873793B2 (en) * | 2002-03-29 | 2007-01-24 | 日本電気株式会社 | Face metadata generation method and face metadata generation apparatus |
US20070005586A1 (en) * | 2004-03-30 | 2007-01-04 | Shaefer Leonard A Jr | Parsing culturally diverse names |
US7554464B1 (en) * | 2004-09-30 | 2009-06-30 | Gear Six, Inc. | Method and system for processing data having a pattern of repeating bits |
EP1886303B1 (en) * | 2005-06-01 | 2009-12-23 | Loquendo S.p.A. | Method of adapting a neural network of an automatic speech recognition device |
US20110014981A1 (en) * | 2006-05-08 | 2011-01-20 | Sony Computer Entertainment Inc. | Tracking device with sound emitter for use in obtaining information for controlling game program execution |
FR2913171A1 (en) * | 2007-02-28 | 2008-08-29 | France Telecom | Cyclostationary telecommunication signal presence determining method for communication equipment, involves comparing statistic indicator with predetermined threshold for determining presence of telecommunication signal on frequency band |
US8140331B2 (en) * | 2007-07-06 | 2012-03-20 | Xia Lou | Feature extraction for identification and classification of audio signals |
US20120324007A1 (en) * | 2011-06-20 | 2012-12-20 | Myspace Llc | System and method for determining the relative ranking of a network resource |
WO2015145219A1 (en) * | 2014-03-28 | 2015-10-01 | Navaratnam Ratnakumar | Systems for remote service of customers using virtual and physical mannequins |
US10008201B2 (en) * | 2015-09-28 | 2018-06-26 | GM Global Technology Operations LLC | Streamlined navigational speech recognition |
EP3641286B1 (en) * | 2018-10-15 | 2021-01-13 | i2x GmbH | Call recording system for automatically storing a call candidate and call recording method |
Family Cites Families (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3727193A (en) * | 1971-05-18 | 1973-04-10 | School Of Electrical Eng | Signal vector recognition system |
GB1569450A (en) * | 1976-05-27 | 1980-06-18 | Nippon Electric Co | Speech recognition system |
US4156868A (en) * | 1977-05-05 | 1979-05-29 | Bell Telephone Laboratories, Incorporated | Syntactic word recognizer |
US4227177A (en) * | 1978-04-27 | 1980-10-07 | Dialog Systems, Inc. | Continuous speech recognition method |
US4241329A (en) * | 1978-04-27 | 1980-12-23 | Dialog Systems, Inc. | Continuous speech recognition method for improving false alarm rates |
US4277644A (en) * | 1979-07-16 | 1981-07-07 | Bell Telephone Laboratories, Incorporated | Syntactic continuous speech recognizer |
US4412098A (en) * | 1979-09-10 | 1983-10-25 | Interstate Electronics Corporation | Audio signal recognition computer |
US4400828A (en) * | 1981-03-27 | 1983-08-23 | Bell Telephone Laboratories, Incorporated | Word recognizer |
US4400788A (en) * | 1981-03-27 | 1983-08-23 | Bell Telephone Laboratories, Incorporated | Continuous speech pattern recognizer |
JPS5852695A (en) * | 1981-09-25 | 1983-03-28 | 日産自動車株式会社 | Voice detector for vehicle |
US4489434A (en) * | 1981-10-05 | 1984-12-18 | Exxon Corporation | Speech recognition method and apparatus |
JPS5879300A (en) * | 1981-11-06 | 1983-05-13 | 日本電気株式会社 | Pattern distance calculation system |
JPS58130396A (en) * | 1982-01-29 | 1983-08-03 | 株式会社東芝 | Voice recognition equipment |
JPS58145998A (en) * | 1982-02-25 | 1983-08-31 | ソニー株式会社 | Detection of voice transient point voice transient point detection |
JPS59139099A (en) * | 1983-01-31 | 1984-08-09 | 株式会社東芝 | Voice section detector |
US4712243A (en) * | 1983-05-09 | 1987-12-08 | Casio Computer Co., Ltd. | Speech recognition apparatus |
US4723290A (en) * | 1983-05-16 | 1988-02-02 | Kabushiki Kaisha Toshiba | Speech recognition apparatus |
JPS59216284A (en) * | 1983-05-23 | 1984-12-06 | Matsushita Electric Ind Co Ltd | Pattern recognizing device |
US4606069A (en) * | 1983-06-10 | 1986-08-12 | At&T Bell Laboratories | Apparatus and method for compression of facsimile information by pattern matching |
US4718092A (en) * | 1984-03-27 | 1988-01-05 | Exxon Research And Engineering Company | Speech recognition activation and deactivation method |
US4718093A (en) * | 1984-03-27 | 1988-01-05 | Exxon Research And Engineering Company | Speech recognition method including biased principal components |
EP0190489B1 (en) * | 1984-12-27 | 1991-10-30 | Texas Instruments Incorporated | Speaker-independent speech recognition method and system |
US4908865A (en) * | 1984-12-27 | 1990-03-13 | Texas Instruments Incorporated | Speaker independent speech recognition method and system |
NL8503304A (en) * | 1985-11-29 | 1987-06-16 | Philips Nv | METHOD AND APPARATUS FOR SEGMENTING AN ELECTRIC SIGNAL FROM AN ACOUSTIC SIGNAL, FOR EXAMPLE, A VOICE SIGNAL. |
US4941178A (en) * | 1986-04-01 | 1990-07-10 | Gte Laboratories Incorporated | Speech recognition using preclassification and spectral normalization |
JP2815579B2 (en) * | 1987-03-10 | 1998-10-27 | 富士通株式会社 | Word candidate reduction device in speech recognition |
US5027408A (en) * | 1987-04-09 | 1991-06-25 | Kroeker John P | Speech-recognition circuitry employing phoneme estimation |
US5168524A (en) * | 1989-08-17 | 1992-12-01 | Eliza Corporation | Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation |
-
1989
- 1989-08-17 US US07/395,449 patent/US5168524A/en not_active Expired - Lifetime
-
1990
- 1990-08-16 CA CA002023424A patent/CA2023424C/en not_active Expired - Lifetime
- 1990-08-17 AT AT90115830T patent/ATE179828T1/en not_active IP Right Cessation
- 1990-08-17 JP JP2216934A patent/JP3055691B2/en not_active Expired - Lifetime
- 1990-08-17 DE DE69033084T patent/DE69033084T2/en not_active Expired - Lifetime
- 1990-08-17 EP EP90115830A patent/EP0413361B1/en not_active Expired - Lifetime
-
1993
- 1993-02-09 US US08/015,299 patent/US5369726A/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
ATE179828T1 (en) | 1999-05-15 |
CA2023424A1 (en) | 1991-02-18 |
JPH03137699A (en) | 1991-06-12 |
EP0413361A3 (en) | 1993-06-30 |
US5168524A (en) | 1992-12-01 |
US5369726A (en) | 1994-11-29 |
DE69033084T2 (en) | 1999-09-02 |
EP0413361B1 (en) | 1999-05-06 |
JP3055691B2 (en) | 2000-06-26 |
EP0413361A2 (en) | 1991-02-20 |
DE69033084D1 (en) | 1999-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2023424C (en) | Speech-recognition circuitry employing nonlinear processing, speech element modeling and phoneme estimation | |
EP0771461B1 (en) | Method and apparatus for speech recognition using optimised partial probability mixture tying | |
Kumar et al. | Analysis of MFCC and BFCC in a speaker identification system | |
US6553342B1 (en) | Tone based speech recognition | |
Vergin et al. | Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male/female classification | |
Kannadaguli et al. | A comparison of Gaussian mixture modeling (GMM) and hidden Markov modeling (HMM) based approaches for automatic phoneme recognition in Kannada | |
US5202926A (en) | Phoneme discrimination method | |
Bahi et al. | Combination of vector quantization and hidden Markov models for Arabic speech recognition | |
Kannadaguli et al. | Phoneme modeling for speech recognition in Kannada using Hidden Markov Model | |
Ljolje | Speech recognition using fundamental frequency and voicing in acoustic modeling. | |
Kannadaguli et al. | A comparison of Bayesian multivariate modeling and hidden Markov modeling (HMM) based approaches for automatic phoneme recognition in kannada | |
Church | A finite state parser for use in speech recognition | |
Bacchiani | Speech recognition system design based on automatically derived units | |
Yamagishi et al. | Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV | |
Boulianne et al. | Optimal tying of HMM mixture densities using decision trees | |
Cook et al. | Utterance clustering for large vocabulary continuous speech recognition. | |
Gao et al. | A real-time Chinese speech recognition system with unlimited vocabulary | |
Nurul et al. | Distinctive phonetic feature (DPF) based phone segmentation using hybrid neural networks | |
Rath et al. | Acoustic class specific VTLN-warping using regression class trees | |
Kwon et al. | Performance of HMM-based speech recognizers with discriminative state-weights | |
Fotinea et al. | Emotion in speech: Towards an integration of linguistic, paralinguistic, and psychological analysis | |
Gao et al. | HMM-based warping in neural networks | |
Fitch | Reclaiming temporal information after dynamic time warping | |
Siniscalchi et al. | A penalized logistic regression approach to detection based phone classification. | |
Sun et al. | A polynomial segment model based statistical parametric speech synthesis sytem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKEX | Expiry |