JPS617896A

JPS617896A - Word voice recognition method

Info

Publication number: JPS617896A
Application number: JP59129852A
Authority: JP
Inventors: 金指　久則; 入間野　孝雄; 秋場　国夫
Original assignee: Matsushita Communication Industrial Co Ltd
Current assignee: Panasonic Mobile Communications Co Ltd
Priority date: 1984-06-22
Filing date: 1984-06-22
Publication date: 1986-01-14
Also published as: JPH0413719B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は、入力音声と、音素表記された単語辞書を照合
して単語を認識する単語音声認識方法に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

従来例の構成とその問題点第１図は従来の単語音声認識方法を実行するだめの装置
の機能ブロック図である。第１図において、１は入力音
声からパラメータの時系列を作成するパラメータ抽出部
、２は音素標準パタンを照合して、音素の確率密度を算
出する確率密度計算部、３は音素毎のセグメンテーショ
ン、尤度計算。Structure of the conventional example and its problems FIG. 1 is a functional block diagram of a device for executing the conventional word speech recognition method. In FIG. 1, 1 is a parameter extraction unit that creates a time series of parameters from input speech, 2 is a probability density calculation unit that collates phoneme standard patterns and calculates the probability density of phonemes, 3 is a segmentation unit for each phoneme, Likelihood calculation.

単語類似度計算を行なう単語認識部である。また、４は
各音素毎の各種パラメータにおける分布を各音素毎の平
均値（μ２）、及び各種パラメータ間の共分散行列（Σ
、）の形で表わした音素標準パタンを記憶する音素標準
パタン部、５は認識すべき全単語を音素単位の記号列で
表記した単語辞書が記憶されている単語辞書部である。This is a word recognition unit that calculates word similarity. In addition, 4 shows the distribution of various parameters for each phoneme, the average value for each phoneme (μ2), and the covariance matrix (Σ
, ), and a word dictionary section 5 stores a word dictionary in which all words to be recognized are expressed as symbol strings in phoneme units.

その単語辞書は、例えば単語「アサヒ」、「トチ」は［
ＡＳＡＨＩＪ。For example, the words "Asahi" and "Tochi" are [
ASAHIJ.

「ＴＯＣＩ」等と表記されている。It is written as "TOCI" etc.

次に、上記従来例の動作について説明する。パラメータ
抽出部１において、入力音声を１０　ｍ　ｓｅｃのフレ
ーム毎に分析し、パラメータを抽出してパラメータ時系
列を作成する。次に確率密度計算部２において、フレー
ム毎に得られたパラメータと音素の確率密度を算出する
。次に、単語認識部３において、各辞書項目毎にその辞
書項目を構成する辞書音素系列に従って音素のセグメン
テーションを行ない、下記０式に従い、その音素の種類
とその音素に対応してセグメンテーションされた区間の
尤度ｌを計算し、その辞書項目における、各音素の尤度
の平均として幽似度を求める。ここで、その音素をＸと
し、Ｘに対応してセグメンテーションされた区間の始端
と終端のフレーム番号をＮｓ、Ｎｅとし、第ルフレーム
における各パラメータの値をＣｎとすると、音素Ｘの尤
度ｌｘは下式で定義される。Next, the operation of the above conventional example will be explained. In the parameter extraction unit 1, input audio is analyzed for each 10 msec frame, parameters are extracted, and a parameter time series is created. Next, the probability density calculation unit 2 calculates the probability density of the parameters and phonemes obtained for each frame. Next, the word recognition unit 3 performs phoneme segmentation for each dictionary item according to the dictionary phoneme series that constitutes that dictionary item, and according to the following formula 0, segments are segmented according to the type of phoneme and the phoneme. The likelihood l is calculated, and the degree of similarity is obtained as the average of the likelihoods of each phoneme in the dictionary entry. Here, if the phoneme is X, the frame numbers at the start and end of the segmented section corresponding to X are Ns and Ne, and the value of each parameter in the second frame is Cn, then the likelihood of phoneme X is lx is defined by the following formula.

φＬ（Ｃｎ）はある音素乙の確率密度を表わし、０式の
ように定義される。φL(Cn) represents the probability density of a certain phoneme B, and is defined as in equation 0.

Σ、７１（Ｃｎ−μ、）〕・・・・・・■Ｃｎ　：第ａ
フレームにおけるＮ個のパラメータ（ベクトル） μ、：ある音素乙のパラメータの平均値（ベクトル） Σ乙　：ある音素乙のパラメータの共分散行列０式にお
いて、確率密度の割り算における分母のザメンションの
２の範囲は、音素Ｘが何であるかによって異なり、例え
ばＸが音素Ａ（７）の時は＆の範囲は５母音、　Ａ、　
Ｅ、　Ｉ、　Ｏ，Ｕとしている。以上により得られる単
語類似度ＬＭを０式に従って各辞書項目毎に求め、Ｌ、
４が最大となる辞書項目をもって、認識単語としていた
。Σ, 71(Cn-μ,)]......■Cn: ath
N parameters (vector) in a frame μ: Average value (vector) of the parameters of a certain phoneme B ΣO: Covariance matrix of the parameters of a certain phoneme B The range varies depending on what the phoneme X is; for example, when X is the phoneme A (7), the range of & is 5 vowels, A,
They are E, I, O, and U. The word similarity LM obtained above is calculated for each dictionary item according to formula 0, L,
Dictionary entries with a maximum of 4 were selected as recognized words.

１・＝１ぞ７ｊ、／ＮＰ　’　　　　　、、、、、、、
、、、、■ＬＭ：辞書中のＭ番目の単語の類似度ｌｊ　　：辞書音素系列中の３番目の音素の尤度ＮＰ：
辞書辞書数素数図は「土地Ｊ（／１ｏｃｉ／）と発声した時の各音
素／ｌ／、１０／、／ｃ／、／ｉ／に対応する標準パタ
ン中の音素シンボル（１）、　（０）、　（Ｓ）、　（
Ｉ）の確率密度値φ１．φ。、φ９．φ１及び音声パワ
ーＰの時間変化を示す。1・=1zo7j,/NP' ,,,,,,,
,,,, ■LM: Similarity lj of the Mth word in the dictionary: Likelihood NP of the third phoneme in the dictionary phoneme series:
The dictionary number prime number diagram is ``Land J (/1oci/)'' and the phoneme symbols (1), (0 ), (S), (
I) probability density value φ1. φ. ,φ9. It shows temporal changes in φ1 and audio power P.

ここで音素／Ｃ／に対応する標準パタン中の音素シンボ
ルは摩擦音群を表わす（Ｓ）である。第２図において、
辞書単語／ＴＯＣＩ／を仮定した場合の音素／Ｃ／のセ
グメンテーション及び尤度計算は、先ず音声パワー■）
の値が予め設定されたいき値イよシも低い部分（６−７
）を破裂直前の無音部として検出し、次に音素／Ｃ／に
対応する標準パタン中の音素シンボル（Ｓ）の確率密度
値φＳを用いて、音素／Ｃ／の後端８を検出し、区間（
６−８）を音素／Ｃ／のセグメンテーション区間とする
。ここで、無音部が検出できない場合は、音素／Ｃ／の
セグメンテーションができず、尤度計算も行なわれない
。７次に、音素／Ｃ／の尤度は区間（７−８）の（Ｓ）
の確率密度を用いて、■式に従って尤度計算を行なう。Here, the phoneme symbol in the standard pattern corresponding to the phoneme /C/ is (S) representing a fricative group. In Figure 2,
Segmentation and likelihood calculation of the phoneme /C/ assuming the dictionary word /TOCI/ are performed first using the phonetic power ■)
The threshold value is set in advance, and the lower part (6-7
) is detected as a silent part immediately before the rupture, and then the rear end 8 of the phoneme /C/ is detected using the probability density value φS of the phoneme symbol (S) in the standard pattern corresponding to the phoneme /C/, section(
6-8) is the segmentation interval of the phoneme /C/. Here, if a silent part cannot be detected, segmentation of the phoneme /C/ cannot be performed and likelihood calculation is not performed. 7 Next, the likelihood of the phoneme /C/ is (S) in the interval (7-8)
Using the probability density of , calculate the likelihood according to formula (2).

第３図は、「都市Ｊ　（／１ｏｓｉ／）と発声した時の
各音素／ｌ／、１０／、／ｓ／、／ｉ／に対応する標準
パタン中の音素シンボル、（１）、　（０）、　（Ｓ）
、　（Ｉ）の確率密度値、φ７．φ０．φ３．φ１及び
音声パワーＰの時間変化を示す。ここで音素／Ｓ／に対
応する標準パタン中の音素シンボルは摩擦音群を表わす
（Ｓ）であり、これは音素／Ｃ／と対応した標準／Ｃタ
ン中の音素シンボルと同一である。Figure 3 shows the phoneme symbols (1), (0 ), (S)
, (I) probability density value, φ7. φ0. φ3. It shows temporal changes in φ1 and audio power P. Here, the phoneme symbol in the standard pattern corresponding to the phoneme /S/ is (S) representing a fricative group, which is the same as the phoneme symbol in the standard /C pattern corresponding to the phoneme /C/.

第３図において、辞書単語／ＴＯ８Ｉ／を仮定した場合
の音素／Ｓ／のセグメンテーション及び尤度計算は、先
ず、音素／Ｓ／に対応する標準・々タン中の音素シンボ
ル（Ｓ）の確率密度値を用いて音素／Ｓ／の後・端を検
出し、区間（９−１０）を音素／８／のセグメンテーシ
ョン区間とする。次に区間（９−１０）の（Ｓ）の確率
密度値を用いて、■式に従って尤度計算を行なう。第３
図において、入力音声ハｏｓ　ｉ／に対して辞書中の単
語／Ｎｏｓ　ｉ／を仮定した場合、音素／Ｃ／のセグメ
ンテーション及び尤度割算は先ず、音声パワーの値が予
め設定されたいき値よりも低い部分を破裂直前の無音部
として検出するわけであるが、本来の音素／Ｓ／区間に
おいて、音声パワーは、第３図の区間（９ｉ、０）に見
られるように浅いディップの形をしているため、いき値
よりも音声パワーが低い場合、その区間（１１−１２）
を音素／Ｃ／の破裂直前の無音部とみなす。無音部が検
出できたので、次に音素／Ｃ／に対応する標準パタン中
の音素シンボル（Ｓ）の確率密度値を用いて後端１０を
検出し、区間（９−１，０）を音素／Ｃ／のセグメンテ
ーション区間とし、区間（１，２−１０）を（Ｓ）の確
率密度値を用いて■式に従って尤度計算を行なう。従っ
て、入力音素／１ｏｓｉ／に対して辞書中の単語／１ｏ
ｃｉ／を仮定した時、得られた音素／Ｃ／の尤度は本来
の音素である／Ｓ／の尤度と同程度の値となるため、／
Ｓ／と／Ｃ／の識別が困難となり、　／Ｓ／、　／Ｃ／
を余む単語は誤認識し易い欠点があった。In Fig. 3, the segmentation and likelihood calculation of the phoneme /S/ assuming the dictionary word /TO8I/ are performed by first calculating the probability density of the phoneme symbol (S) in the standard tang corresponding to the phoneme /S/. The end/end of the phoneme /S/ is detected using the value, and the interval (9-10) is set as the segmentation interval of the phoneme /8/. Next, using the probability density value of (S) in the interval (9-10), likelihood calculation is performed according to equation (2). Third
In the figure, when the word /Nos i/ in the dictionary is assumed for the input speech Haos i/, the segmentation and likelihood division of the phoneme /C/ are first performed using a preset threshold value of the voice power. However, in the original phoneme /S/ interval, the voice power has a shallow dip shape as seen in the interval (9i, 0) in Figure 3. Therefore, if the audio power is lower than the threshold, that section (11-12)
is regarded as the silent part immediately before the rupture of the phoneme /C/. Now that the silent part has been detected, the rear end 10 is detected using the probability density value of the phoneme symbol (S) in the standard pattern corresponding to the phoneme /C/, and the interval (9-1,0) is converted into a phoneme. /C/ is set as the segmentation interval, and likelihood calculation is performed using the probability density value of (S) for the interval (1, 2-10) according to formula (2). Therefore, for the input phoneme /1osi/, the word /1o in the dictionary
Assuming ci/, the likelihood of the obtained phoneme /C/ is comparable to the likelihood of the original phoneme /S/, so /
It becomes difficult to distinguish S/ and /C/, /S/, /C/
The remaining words had the disadvantage that they were easily misrecognized.

発明の目的本発明は、上記従来例の欠点を除去するものであり、尤
度割算の精度を向上させ、それにより単語認識率を向上
させることを目的とする。OBJECTS OF THE INVENTION The present invention eliminates the drawbacks of the conventional example described above, and aims to improve the accuracy of likelihood division, thereby improving the word recognition rate.

発明の構成本発明は、上記目的を達成するために、破裂音の尤度を
計算する際、セグメンテーションされた区間（破裂直前
の無音区間と破裂又は破裂し摩擦゛する区間で構成され
る）について、破裂する区間の音素の確率密度値だけを
用いて尤度を計算するのではなく、音声パワー等のパラ
メータを併用して、無音区間を含む、破裂音のセグメン
テーション区間に対して尤度を計算することにより、尤
度計算の精度を向」ニさせる効果を持つものである。Structure of the Invention In order to achieve the above object, the present invention calculates the likelihood of a plosive sound with respect to a segmented section (consisting of a silent section immediately before a plosive sound and a section that ruptures or ruptures and causes friction). , instead of calculating the likelihood using only the probability density value of the phoneme in the plosive interval, we also use parameters such as voice power to calculate the likelihood for the segmentation interval of the plosive, including the silent interval. This has the effect of improving the accuracy of likelihood calculation.

実施例の説明以下に本発明の一実施例の構成について図面とともに説
明する。第１図において、音素標準パタンは従来例と同
様である。単語辞書は認識すべき単語を音素の記号列で
表記しである。また、パラメータ抽出により得られるパ
ラメータ時系列は従来例と同様である。DESCRIPTION OF THE EMBODIMENTS The configuration of an embodiment of the present invention will be described below with reference to the drawings. In FIG. 1, the phoneme standard pattern is the same as in the conventional example. A word dictionary represents the words to be recognized as a string of phoneme symbols. Further, the parameter time series obtained by parameter extraction is the same as in the conventional example.

次に、本実施例の動作について説明する。先ず入力音声
からフレーム毎のパラメータを得、さらにそのパラメー
タの値を使って、各音素標準パタンから得られる確率密
度を計算し、各辞書項目毎に、その辞書項目を構成する
辞書系列に従って音素Ｘのセグメンテーションを行ない
、その音素Ｘとその音素Ｘに対応してセグメンテーショ
ンされた区間Ｉ！８を計算する。ここで、破裂音の尤度
計算を台なう際、セグメンテーションされた区間におい
て、破裂又は破裂し摩擦する部分の音素の確率密度値か
ら得られる尤度だけでなく、破裂直前の無音部に対し７
ても音声パワーの値、及びその時間変化を利用して尤度
を求め、得られた各々の部分の尤度から破裂音の尤度を
求める。Next, the operation of this embodiment will be explained. First, parameters for each frame are obtained from the input audio, and then the probability density obtained from each phoneme standard pattern is calculated using the value of the parameter, and for each dictionary item, the phoneme The segmentation is performed on the phoneme X and the segmented interval I! corresponding to the phoneme X! Calculate 8. Here, when calculating the likelihood of a plosive, in the segmented interval, we not only calculate the likelihood obtained from the probability density value of the phoneme in the part that ruptures or ruptures and rubs, but also calculate the likelihood for the silent part immediately before the plosive. 7
However, the likelihood is determined using the value of the voice power and its temporal change, and the likelihood of the plosive is determined from the obtained likelihood of each part.

第２図において、入力音声／１ｏｃｉ／に対して本来の
単語である／ＴＯＣＩ／を仮定した場合、破裂音／Ｃ／
のセグメンテーションは、従来例と同様に、破裂直前の
無音部（６−７）及び破裂し摩擦する部分（７−８）を
検出し、区間（６−８）を破裂音／Ｃ／のセグメンテー
ション区間とする。次に音素／Ｃ／の尤度７！ｏを計算
するわけであるが、先ず、無音部（６−７）における音
声・くワーＰの最小値（ロ）、及び無音部の後端７付近
の０式に示される隣接７レ一ム間ケプストラム距離ＣＤ
の値ヲ用いて０式に従って無音部の尤度らを求める。次
に、破裂し摩擦する部分（７−８”）の尤度１ｃｌを確
率密度値を用いて求め、最後にＪＱとｌｃｌから音素／
Ｃ／の尤度１ｃを０式に従って求める。In Figure 2, if we assume that the input speech /1oci/ is the original word /TOCI/, then the plosive /C/
As in the conventional example, the segmentation is performed by detecting the silent part (6-7) just before the rupture and the part (7-8) that ruptures and rubs, and converts the interval (6-8) into the segmentation interval of the plosive sound /C/. shall be. Next, the likelihood of the phoneme /C/ is 7! To calculate o, first, the minimum value (b) of the voice/lower P in the silent part (6-7), and the adjacent 7 frames near the rear end 7 of the silent part shown in equation 0. cepstral distance CD
Using the value of , calculate the likelihood of the silent part according to the formula 0. Next, the likelihood 1cl of the part (7-8”) that ruptures and rubs is found using the probability density value, and finally, from JQ and lcl, the phoneme /
The likelihood 1c of C/ is determined according to formula 0.

ｌ！Ｑ：１ＱＩＩｌｌ！Ｑ２１ｃ　＝　ｊ’ｃｌ−１゜　　　　　　　　・・曲・・
・■第３図において、入力音声／１ｏｓｉ／に対して、
辞書中の単語／１ｏｃｉ／を仮定した時、音素／Ｃ／の
尤度計算は、上記例で述べたように、先ず無音部（１１
−１２＞の尤度ＪＱを求めるわけであるが、この場合、
無音部の音声パワーＰの最小値ｅ→は第２図の場合と比
べ犬きく、又、無音部後端１２付近の隣接フレーム間ケ
プヌトラム距離ＣＤの値は小さいため、ｌＱの値は、第
２図の場合と比べ小さくなる。従って音素／Ｃ／の尤度
の値も小さくなるため、入力音声／１ｏｓｉ／に対して
辞書中の単語／１ｏｓｉ／を仮定した時の音素／Ｓ／の
尤度の方が音素／Ｃ／の尤度よシも大きくなり、／Ｓ／
と／Ｃ／の識別能力が向上する。従って、音素／８７．
　／Ｃ／を含む単語の認識率が向上する利点がある。　
　　　゛発明の効果本発明は上記のような構成であし、破裂音の尤度を計算
するにあたし、セグメンテーションされた区間中で、破
裂又は破裂を伴ない摩擦する部分におけるその音素又は
音素群の標準パタンとその音素との距離から得られる尤
度、及び破裂直前の無音部における音声パワーの値とそ
の時間変化の大きさから得られる尤度の上記２つの尤度
を併用して破裂音の尤度計算を行なうことにより、従来
例に比べ精度よく尤度を求めることができる利点を有す
る。l! Q:1QIIll! Q2 1c = j'cl-1゜...Song...
・■In Figure 3, for input voice /1osi/,
Assuming the word /1oci/ in the dictionary, the likelihood calculation for the phoneme /C/ first calculates the silent part (11
-12>, but in this case,
The minimum value e→ of the voice power P of the silent part is sharper than that in the case of FIG. It is smaller than the case shown in the figure. Therefore, the likelihood value of the phoneme /C/ is also smaller, so when the input speech /1osi/ is assumed to be the word /1osi/ in the dictionary, the likelihood of the phoneme /S/ is lower than that of the phoneme /C/. The likelihood also increases, /S/
The ability to discriminate between and /C/ is improved. Therefore, the phoneme /87.
This has the advantage of improving the recognition rate for words containing /C/.
゛Effects of the Invention The present invention has the above-mentioned configuration, and when calculating the likelihood of a plosive, it calculates the standard of the phoneme or phoneme group in the part of the segmented interval where there is a plosive or friction accompanied by a plosive. The likelihood of a plosive is calculated by combining the above two types of likelihood: the likelihood obtained from the distance between the pattern and its phoneme, and the likelihood obtained from the value of the speech power in the silent part immediately before the plosive and the magnitude of its temporal change. By performing the degree calculation, there is an advantage that the likelihood can be determined more accurately than in the conventional example.

[Brief explanation of drawings]

、　　第１図は従来および本発明の一実施例における単
語音声認識方法を実施する装置のブロック図、第２図は
土地と発生した時の各音素の確率密度。音声パワー、隣接フレーム間ケプストラム距離の時間変
化を示す図、第３図は都市と発声した時の各音素の確率
密度、音声パワー、隣接フレーム間ケプストラム距離の
時間変化を示す図、第４図は無声部の尤度を求める時用
いる７？、とｊ’Ｑ２を説明するだめの図である。１・・・パラメータ抽出部、２・・・確率密度計算部、
３・・・単語認識部、４・・・音素標準パタン部、５・
・・単語辞書部。代理人の氏名　弁理士　中　尾　敏　男　はか１名第１
図入η合声関縁羊措シーム第３１７１第４図, Fig. 1 is a block diagram of a device implementing the word speech recognition method in the conventional method and an embodiment of the present invention, and Fig. 2 shows the probability density of each phoneme when it occurs in the land. Figure 3 is a diagram showing the temporal changes in voice power and cepstral distance between adjacent frames. Figure 3 is a diagram showing temporal changes in the probability density of each phoneme, voice power, and cepstral distance between adjacent frames when uttered ``city''. 7 used when calculating the likelihood of silent parts? , and j'Q2. 1... Parameter extraction section, 2... Probability density calculation section,
3... Word recognition section, 4... Phoneme standard pattern section, 5.
...Word dictionary department. Name of agent: Patent attorney Toshio Nakao (1st person)
Illustrated η chorus-related sheep measure seam No. 3171 Fig. 4

Claims

[Claims]

(1) Word recognition of input speech is performed using a word dictionary that describes the word to be recognized as a symbol string for each phoneme and a standard pattern for each phoneme or phoneme group expressed by the acoustic parameters of each phoneme or phoneme group. To do this, the input speech is checked against each dictionary entry in the word dictionary, and the input speech is segmented for each phoneme according to the dictionary phoneme sequence that constitutes each dictionary entry.
For this segmented speech interval, use the standard pattern of that phoneme or phoneme group and the distance between that phoneme and find the likelihood of the phoneme in the dictionary entry and the input voice, and use this likelihood value. In calculating the likelihood of plosives including affricates, when recognizing words by calculating the similarity between dictionary entries and input speech, we calculate the likelihood of plosives including affricates. The likelihood obtained from the distance between the standard pattern of the phoneme or group of phonemes and the phoneme in the part where the sound occurs, and the likelihood obtained from the value of the speech power in the silent part immediately before the rupture and the magnitude of its temporal change are used together. A word speech recognition method featuring:

(2) As a standard pattern for each phoneme or phoneme group, use a standard pattern expressed by the distribution of acoustic parameters of each phoneme or phoneme group, and as a distance measure between the standard pattern of a phoneme or phoneme group and that phoneme, 2. The word speech recognition method according to claim 1, which uses a probability density that a segmented speech section is generated from its phonemes.

(3) The word speech recognition method according to claim 1, in which the cepstral distance between adjacent frames is used as the magnitude of the temporal change in speech power in the silent part immediately before bursting.