JPS60149096A

JPS60149096A - Recognition of word voice

Info

Publication number: JPS60149096A
Application number: JP59003590A
Authority: JP
Inventors: 金指　久則; 秋場　国夫; 入間野　孝雄
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1984-01-13
Filing date: 1984-01-13
Publication date: 1985-08-06
Also published as: JPH0431116B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞書を照合
して単語を認識する単語音声認識方法に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本発明の単
語音声認識方法の実施例等を実行するだめの装置の機能
ブロック図である。従来例を第１図〜第３図とともに説
明する。第１図において、１は入力音声からパラメータ
の時系列を作成する・ぞラメータ抽出部、２は音素標準
パタンを照合して、音素の確率密度を算出する確率密度
計算部、３は音素毎のセグメンテーション、尤度計算、
ｉｌ’４語類似度計算等を行なう単語認識部である。１
だ、４は予め予備実験等により作成された、各音素毎の
各種パラメータにおける分布を各音素毎の平均値（／１
１１）、及び各種パラメータ間の共分散行列（Σ、）の
形で表わした音素標準パタンを記憶する音素標準・ぞタ
ン部、５は認識すべき全単語を音素単位の記号列で表記
した単語辞書が記憶されている単語辞書部である。その
単語辞書は、例えば単語「す７７ｅ　口Ｊ　、ｒアサヒ
カワ」、「アキタ」、「シマ」、「シサ」等は、それぞ
れｒｓＡＱＰＯＲＯＪ、ｒ　ＡＳＡＨＩＫＡＷＡ　Ｊ、
ｒＡＫＩＴＡＪ、ｒｓＩＭＡＪ、「５ＩＳＡ」等と表記
されている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an apparatus for executing an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention. A conventional example will be explained with reference to FIGS. 1 to 3. In Figure 1, 1 is a parameter extraction unit that creates a time series of parameters from input speech, 2 is a probability density calculation unit that calculates the probability density of a phoneme by collating a phoneme standard pattern, and 3 is a probability density calculation unit that calculates the probability density of each phoneme. segmentation, likelihood calculation,
il' is a word recognition unit that performs four-word similarity calculations and the like. 1
4 is the average value (/1
11), and a phoneme standard/zotan part that stores phoneme standard patterns expressed in the form of covariance matrices (Σ,) between various parameters; 5 is a word in which all words to be recognized are expressed as symbol strings in phoneme units; This is a word dictionary section in which dictionaries are stored. In the word dictionary, for example, the words "su77e 口J, r ASAHIKAWA", "Akita", "shima", "shisa", etc. are respectively rsAQPOROJ, rASAHIKAWA J,
It is written as rAKITAJ, rsIMAJ, "5ISA", etc.

次に上記従来例の動作について説明する。入力音素をパ
ラメータ抽出部１により１０ｍ５のフレーム毎に分析し
パラメータを抽出して、パラメータ時系列を作成する。Next, the operation of the above conventional example will be explained. The input phoneme is analyzed by the parameter extraction unit 1 for each frame of 10 m5, and parameters are extracted to create a parameter time series.

確率密度計算部２はフレーム毎に得られた・ぐラメータ
と音素標準パタンを照合し、その・ぐラメータの値から
生成される音素の確率密度を算出する。次に単語認識部
３において、上記のパラメータと得られた確率密度値を
用いて各辞書項目毎に、その辞書項目を構成する辞書音
素系列に従って１音素毎に音素のセグメンテーションを
行ない、下記０式に従いその音素の種類と、その音素に
対応してセグメンテーションされた区間の尤度りを計算
し、その辞書項目における、各音素の尤度の平均として
類似度をめる。ここで、その音素をＸとし、Ｘに対応し
てセグメンテーションされた区間の始端と終端のフレー
ム番号をＮ５゜Ｎｅ！：Ｌ、第ｎフレームにおける各パ
ラメータの値をＣｎとすると、音素Ｘの尤度ＬＸは下式
で定義される。The probability density calculation unit 2 compares the grammeter obtained for each frame with a phoneme standard pattern, and calculates the probability density of the phoneme generated from the value of the grammeter. Next, in the word recognition unit 3, using the above parameters and the obtained probability density value, phoneme segmentation is performed for each dictionary item according to the dictionary phoneme sequence that constitutes the dictionary item, and the following 0 formula is used. Then, the type of phoneme and the likelihood of the segmented interval corresponding to the phoneme are calculated, and the similarity is calculated as the average of the likelihoods of each phoneme in the dictionary entry. Here, let the phoneme be X, and the frame numbers at the start and end of the segmented section corresponding to X are N5°Ne! :L, and when the value of each parameter in the n-th frame is Cn, the likelihood LX of phoneme X is defined by the following formula.

φ１（Ｃｎ）はある音素１の確率密度を表わし、■式の
ように定義される。φ1(Cn) represents the probability density of a certain phoneme 1, and is defined as in equation (2).

・・・・■ Ｃ１１つのフレームにおけるＪ］固のパラメータ（ペク
ト　ル　）ｓｔ　ｉ：ある音素ｌの・ぐラメータの平均値（ベクト
ル）Σ、：共分散行列 ■式において、確率密度の割り算における公刊のサメン
ションｌの範囲は、音素Ｘが何であるかによって異なり
、例えばＸが音素Ａ（ト）の時はｌの範囲は５母音、Ａ
、　、　Ｅ、　Ｉ　、　Ｏ、Ｕとしている。・・・・■C1 J] fixed parameter (vector) in one frame sti: Mean value (vector) of the parameter of a certain phoneme l Σ,: Covariance matrix The range of summation l differs depending on the phoneme X. For example, when X is the phoneme A (g), the range of l is 5 vowels, A
, , E, I, O, U.

以上により得られる単語類似度ＬＭを■式に従って各辞
書項目毎にめ、ＬＭが最大となる辞書項目をもって、認
識単語としていた。The word similarity LM obtained above was calculated for each dictionary item according to formula (2), and the dictionary item with the maximum LM was selected as a recognized word.

ＬＭ：辞書中のＭ番目の単語の類似度ｔｉ：辞書音素系列中の音素ｉの尤度ＮＰ：辞唇音素数上記従来例においては、音素の確率密度の値を用いて辞
書項目中の１音素毎についてセグメンテーション及び尤
度計算を行なっている。第２図は／ＳｉＭＡ／（島）と
発声した時の各音素の確率密度の時間変化を示している
。この場合のセグメンテーション及び尤度計算は、各音
素／Ｓ／、／ｉ／、／Ｍ／、／Ａ／の確率密度の値φ８
．φｌ、φＭ、φＡの時間変化に従って行ない、語頭の
／Ｓ／のセグメンテーションはφ８が低くなシ、φｉが
高くなるフレーム、ａを／Ｓ／の後端とし、セグメンテ
ーションされた区間（ＳＦ−ａ）に対してφ８を用いて
尤度計算を行なう。語頭の／Ｓ／に後続する第２番目の
音素／ｉ／についても同様にφｉが低くなりφＭが高く
なるフレームｂを４ｖの後端とし、セグメンテーション
された区間（ａ−ｂ）に対してφｉを用いて尤度計算を
行なっていた。LM: Similarity of the Mth word in the dictionary ti: Likelihood of phoneme i in the dictionary phoneme sequence NP: Number of labial phonemes In the above conventional example, one phoneme in the dictionary entry is calculated using the probability density value of the phoneme. Segmentation and likelihood calculations are performed for each. FIG. 2 shows the temporal change in the probability density of each phoneme when /SiMA/ (island) is uttered. In this case, the segmentation and likelihood calculation are performed using the probability density value φ8 of each phoneme /S/, /i/, /M/, /A/.
．． Segmentation of /S/ at the beginning of the word is performed according to the time changes of φl, φM, and φA. The segmentation of /S/ at the beginning of the word is performed in the frame where φ8 is low and φi is high, and a is the rear end of /S/, and the segmented section (SF-a) Likelihood calculation is performed using φ8 for. Similarly, for the second phoneme /i/ following the /S/ at the beginning of the word, frame b where φi is low and φM is high is set as the rear end of 4v, and φi is set for the segmented interval (a-b). Likelihood calculations were performed using .

第３図は／５ｉＳＡ／　（示唆）と発声した時の各音素
の確率密度の時間変化を示している。セグメンテーショ
ン及び尤度計算は、各音素／Ｓｌ／、Ａ　／、／ｓ／　
。Figure 3 shows the temporal change in the probability density of each phoneme when /5iSA/ (suggestion) is uttered. Segmentation and likelihood calculation are performed for each phoneme /Sl/, A /, /s/
.

／Ａ／の確率密度の値、φ８．φ１．φ８．φ＾の時間
変化に従って行なうが、語頭の／Ｓ／のセグメンテーシ
ョンをする場合、後続する／１／が無声化しているため
φ１が非常に小さくなり、またφ８が語頭の／Ｓ／の本
来の区間である（ＦＳ＝ｃ）を越え、さらに語頭の／Ｓ
／に後続する／１／の本来の区間（ｃ−ｄ）も越えてい
るため、／ｉ／に後続する／Ｓ／の後端ｅを語頭の／Ｓ
／の後端として出力しセグメンテーション誤りを起こし
ていた。The probability density value of /A/, φ8. φ1. φ8. This is done according to the time change of φ^, but when segmenting the word-initial /S/, φ1 becomes very small because the following /1/ is devoiced, and φ8 is the original segment of the word-initial /S/. (FS=c), and the /S at the beginning of the word
Since the original interval (c-d) of /1/ that follows / is also exceeded, the trailing end e of /S/ that follows /i/ is changed to /S at the beginning of the word.
/ was output as the trailing end, causing a segmentation error.

このため、語頭の／Ｓ／に続く音素／ｉ　／、／ｓ／、
／Ａ／の音素についてのセグメンテーションも誤す、尤
度が低くなる結果、無声化母音を含む単語は誤認識し易
い欠点があった。Therefore, the phonemes /i /, /s/, following the /S/ at the beginning of the word,
The segmentation of the /A/ phoneme is also incorrect, and as a result, words containing devoiced vowels are easily misrecognized as a result of the lower likelihood.

（発明の目的）本発明は上記従来例の欠点を除去するものであす、セグ
メンテーション及び尤度計算の精度を向上させ、それに
より単語認識率を向上させることを目的とする。(Objective of the Invention) The present invention aims to eliminate the drawbacks of the above-mentioned conventional example, and aims to improve the accuracy of segmentation and likelihood calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、認識すべき単語を音素単位の記号列で表記し
た単語辞書と、各音素の音響パラメータの分布形で表わ
された各音素の標Ｗ−４′タンを具備し、入力音声の単
語を認識する際、入力音声を単語辞書の各辞書項目と照
合し、各辞書項目を構成する辞書音素系列に従い各音素
毎にその音素標準・ぐタンを用いて、その音素から生成
される確率密度を計算し入力音声をセグメンテーション
し、そのセグメンテーションされた音声の区間に対して
、上記の確率密度の値を用いて各辞書項目と入力音声の
類似度をめて単語を認識する単語音声認識方法において
、無声子音に挾まれた無声化母音のセグメンテーション
及び尤度計算を行なう際、各音素の確率密度の値を用い
て無声化ｍ−音を含む、無声子音、無声化母音、無声子
音の連続３音素を寸どめてセグメンテーションし尤度計
算を行なうことを特徴とするものであり、これによりセ
グメンテーション及び尤度計算の精度を向上させる効果
を持つものである。(Structure of the Invention) The present invention includes a word dictionary in which words to be recognized are expressed as symbol strings in units of phonemes, and a mark W-4' tongue for each phoneme expressed in the distribution form of acoustic parameters of each phoneme. When recognizing words in input speech, the input speech is checked against each dictionary entry in the word dictionary, and the phoneme standard is used for each phoneme according to the dictionary phoneme sequence that makes up each dictionary entry. The input speech is segmented by calculating the probability density generated from the above, and words are recognized by calculating the similarity between each dictionary item and the input speech using the above probability density value for the segmented speech section. In a word speech recognition method, when performing segmentation and likelihood calculation of a devoiced vowel sandwiched between unvoiced consonants, the probability density value of each phoneme is used to identify unvoiced consonants and unvoiced vowels, including devoiced m-sounds. This method is characterized by segmenting three consecutive phonemes of unvoiced consonants and performing likelihood calculations, which has the effect of improving the accuracy of segmentation and likelihood calculations.

（実施例の説明）以下に本発明の一実施例について第１図とともに説明す
る。同図においてパラメータ抽出部１、確率密度計算部
２および音素標準・ぞタン部４は前述の従来例と同様で
あり、従来例と異なるのは、主として罹語辞書部５の内
容及び単語認識部３のセグメンテーションおよび尤度計
算の一部である。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIG. 1. In the same figure, the parameter extraction section 1, the probability density calculation section 2, and the phoneme standard/zoom section 4 are the same as those in the conventional example described above. This is part of the segmentation and likelihood calculation in Section 3.

その単語辞書部５に格納されている単語辞書は、認識す
べき（４４語を音素の記号列で表記しであるが、従来例
と異なるのは、無声化し易い母音１例えば、「ＡｓＡＨ
■ＫＡＷＡ　Ｊ％　Ｉ”　ＡＫ　（１）ＴＡＪ　、ｒ　
Ｓ　ａ）ＭＡ　Ｊ　、［Ｓ　ｏＳ　Ａ　Ｊ　等）Ｏ印を
つけた１１に対して予めそれを示す符号をつけであるこ
とである。The word dictionary stored in the word dictionary section 5 has 44 words to be recognized (expressed as phoneme symbol strings), but the difference from the conventional example is that 1 vowel that is easy to be devoiced, such as "AsAH
■KAWA J% I” AK (1) TAJ, r
S a) MA J , [S oS A J , etc.) 11 marked with an O must be given a symbol to indicate this in advance.

本実施例の方法は、先ず入力音声から・ξラメータ抽出
部１によりフレーム毎のパラメータを得、さらに確率密
度計算部２において、そのパラメータの値を使って、各
音素標準バタンから得られる確率密度を計算する。ここ
までは、前記従来例と同様である。次に単語認識部３で
、単語辞書部５の各辞書項目毎にその辞書項目を構成す
る辞書音素系列に従って音素Ｘのセグメンテーションを
行ないその音素Ｘとその音素Ｘに対応してセグメンテー
ションされた区間の尤度ｔｘを計算する。辞書音素系列
中に無声子ｇＣ，，Ｃ２に挾まれた無声化母音Ｖがある
場合無声化母音の確率密度の値は母音の性質を示さず、
無声子音の性質を示す。従って上記セグメンテーション
において、無声子音、無声化母音、無声子音（Ｃ−１Ｖ
Ｃ２）の並びにおける各音素の種類及びその音素並びに
対応して、各各の音素確率密度の値を利用して３音素ま
とめてセグメンテーションを行ない、そのセグメンテー
ションされた区間に対して尤度’ｃ、ｖｃ２を計算する
。The method of this embodiment first obtains a parameter for each frame from the input speech using the ξ parameter extraction unit 1, and then uses the value of the parameter in the probability density calculation unit 2 to calculate the probability density obtained from each phoneme standard button. Calculate. The process up to this point is the same as the conventional example. Next, the word recognition unit 3 performs segmentation of the phoneme X according to the dictionary phoneme series that constitutes the dictionary item for each dictionary item in the word dictionary unit 5. Calculate the likelihood tx. When there is a devoiced vowel V between voiceless consonants gC,,C2 in the dictionary phoneme sequence, the probability density value of the devoiced vowel does not indicate the nature of the vowel.
Indicates the nature of voiceless consonants. Therefore, in the above segmentation, voiceless consonants, voiceless vowels, voiceless consonants (C-1V
Segment the three phonemes at once using the type of each phoneme in the sequence C2), the phoneme, and the corresponding phoneme probability density value, and calculate the likelihood 'c, Calculate vc2.

第３図において／ｓｉｓ／の間の／Ｖの確率密度のφ１
はほとんどなく、代わりに語頭の、′Ｓ／の確率密度の
値φＳが語頭から第３番目のＳの終り、ｅまで優勢であ
る。In Figure 3, φ1 of the probability density of /V between /sis/
Instead, the probability density value φS of 'S/ at the beginning of the word is dominant from the beginning of the word to the end of the third S, e.

従って、無声化母音を含む連続３音素の第３番目の音素
／Ｓ／とそれに後続する母、音／Ａ／の確率密度φ８．
φＡを用いてセグメンテーシヨンを行ない、そのセグメ
ンテーションされた区間に対してφ８を用いて尤度を計
算する。このようにすることにより、無声子音、無声化
母音、無声子音の連続３音素、／ＳＩＳ／は区間（ＦＳ
−ｅ）に対応し良好なセグメンテーションができるため
尤度計算の精度も向上する。Therefore, the probability density of the third phoneme /S/ of three consecutive phonemes including a devoiced vowel and the vowel that follows it, the sound /A/, is φ8.
Segmentation is performed using φA, and likelihood is calculated using φ8 for the segmented interval. By doing this, /SIS/, a series of three phonemes consisting of a voiceless consonant, a voiceless vowel, and a voiceless consonant, becomes an interval (FS
-e), and good segmentation is possible, which improves the accuracy of likelihood calculation.

本実施例においては無声化母音を１つの音素として扱わ
ず無声化母音を含む、無声子音、無声化母音、無声子音
の音素並びをまとめて、セグメンテーション尤度計算を
行なうため、無声化Ｕ音を含む単語の認識率が向上する
利点がある。In this example, the devoiced vowel is not treated as one phoneme, but the phoneme sequence of the unvoiced consonant, devoiced vowel, and unvoiced consonant including the devoiced vowel is collectively calculated to perform segmentation likelihood, so the devoiced U sound is This has the advantage of improving the recognition rate of words included.

（発明の効果）本発明は、無声子音に侠１れた無声化ｍ計のセグメンテ
ーション及び尤度計算を行なう際、各音素の確率密度の
値を使って無声化母音を含む、無声子音、無声化母音、
無声子音の連続３音素をまとめてセグメンテーションし
尤度計算を行なうので、従来法に比べ高い精度でセグメ
ンテーション及び尤度計算を行なう利点を有する。(Effects of the Invention) The present invention uses the probability density value of each phoneme when performing segmentation and likelihood calculation of a devoiced m-meter that is suitable for unvoiced consonants. conjugated vowel,
Since three consecutive phonemes of unvoiced consonants are segmented and likelihood calculations are performed collectively, this method has the advantage of performing segmentation and likelihood calculations with higher precision than conventional methods.

[Brief explanation of drawings]

第１図は従来の単語音声認識方法の一例及び本発明の単
語音声認識方法の実施例等を実行するだめの装置の機能
ブロック図、第２図は／ＳｉＭＡ／、（島）と発声した
場合の各音素の確率密度の時間変化を表わす図、第３図
は／５ｉＳＡ／　（示唆）と発声した場合の各音素の確
率密度の変化を表わす図である。１・・・パラメータ抽出部、２・・確率密度計算部、３
・・・単語認識部、４・・・音素標準バタン部、５・・
単語辞書部。第　１　図入力奢声FIG. 1 is a functional block diagram of an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention, and FIG. 2 is a case in which /SiMA/, (island) is uttered. FIG. 3 is a diagram showing changes in the probability density of each phoneme over time when /5iSA/ (suggestion) is uttered. 1... Parameter extraction unit, 2... Probability density calculation unit, 3
・・・Word recognition part, 4... Phoneme standard slam part, 5...
Word dictionary department. Figure 1 Input scream

Claims

[Claims]

jl'+, which represents the word to be recognized as a symbol string in phoneme units
It is equipped with a word dictionary and a standard pattern for each phoneme expressed as a distribution form of the acoustic parameters of each phoneme. According to the dictionary phoneme series that constitutes the top of each dictionary, the probability density generated from the phoneme is calculated for each phoneme using the phoneme standard Gutan, the input speech is segmented, and the segmented speech interval is calculated. On the other hand, when recognizing words by calculating the similarity between each dictionary item and input speech using the above probability density value, for a devoiced vowel narrowed to a voiceless consonant, a devoiced vowel is including, voiceless consonants,
A method for recognizing idiographic speech, characterized in that continuous consonants are segmented using probability density values of phoneme sequences of devoiced vowels and unvoiced consonants, and likelihood calculations are performed.