JPH045391B2

JPH045391B2 -

Info

Publication number: JPH045391B2
Application number: JP59058173A
Authority: JP
Priority date: 1984-03-28
Filing date: 1984-03-28
Publication date: 1992-01-31
Also published as: JPS60202494A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は入力音声と、音素表記された単語辞書
を照合して単語を認識する単語音声認識方法に関
するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本
発明の単語音声認識方法の実施例を実行するため
の装置の機能ブロツク図である。従来例を第１
図、第２図及び第３図とともに説明する。第１図
において、１は入力音声からパラメータの時系列
を作成するパラメータ抽出部、２は音素標準パタ
ンを照合して、音素の確率密度を算出する確率密
度計算部、３は音素毎のセグメンテーシヨン、尤
度計算、単語類似度計算を行なう単語認識部であ
る。また、４は各音素毎の各種パラメータにおけ
る分布を各音素毎の平均値（μi）、及び各種パラ
メータ間の半分散行列（Σi）の形で表わした音素
標準パタンを記憶する音素標準パタン部、５は認
識すべき全単語を音素単位の記号列で表記した単
語辞書が記憶されている単語辞書部である。その
単語辞書は、例えば単語「サツポロ」、「カンケ
イ」は「SAQPORO」、「KAN＝NAI」等と表記
されている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an example of a conventional word speech recognition method and an apparatus for carrying out an embodiment of the word speech recognition method of the present invention. Conventional example first
This will be explained with reference to FIGS. 2 and 3. In Figure 1, 1 is a parameter extraction unit that creates a time series of parameters from input speech, 2 is a probability density calculation unit that calculates the probability density of a phoneme by comparing standard patterns of phonemes, and 3 is a segmentation unit for each phoneme. This is a word recognition unit that performs similarity calculations, likelihood calculations, and word similarity calculations. 4 is a phoneme standard pattern section that stores a phoneme standard pattern representing the distribution of various parameters for each phoneme in the form of an average value (μi) for each phoneme and a semivariance matrix (Σi) between various parameters; Reference numeral 5 denotes a word dictionary section in which a word dictionary in which all words to be recognized are expressed in symbol strings in units of phonemes is stored. In the word dictionary, for example, the words ``Satsuporo'' and ``Kankei'' are written as ``SAQPORO'' and ``KAN=NAI.''

次に上記従来例の動作について説明する。パラ
メータ抽出部１において、入力音素を10ｍｓのフ
レーム毎に分析しパラメータを抽出して、パラメ
ータ時系列を作成する。次に確率密度計算部２に
おいて、フレーム毎に得られたパラメータと音素
標準パタン部４の音素標準パタンを照合し、音素
の確率密度を算出する。次に、単語認識部３にお
いて、各辞書項目毎に、その辞書項目を構成する
辞書音素系列に従つて音素のセグメンテーシヨン
を行ない、下記式に従いその音素の種類と、そ
の音素に対応してセグメンテーシヨンされた区間
の尤度ｌを計算し、その辞書項目における、各音
素の尤度の平均として類似度を求める。ここで、
その音素をＸとし、Ｘに対応してセグメンテーシ
ヨンされた区間の始端と終端のフレーム番号を
Ns、Neとし、第ｎフレームにおける各パラメー
タの値をC_oとすると、音素Ｘの尤度l_xは下式で定
義される。 Next, the operation of the above conventional example will be explained. The parameter extraction unit 1 analyzes input phonemes every 10 ms frame, extracts parameters, and creates a parameter time series. Next, the probability density calculation section 2 compares the parameters obtained for each frame with the phoneme standard pattern of the phoneme standard pattern section 4, and calculates the probability density of the phoneme. Next, the word recognition unit 3 performs phoneme segmentation for each dictionary item according to the dictionary phoneme series that makes up the dictionary item, and then segments the phoneme according to the type of phoneme and the phoneme corresponding to the phoneme according to the following formula. The likelihood l of the segmented interval is calculated, and the similarity is determined as the average of the likelihoods of each phoneme in the dictionary entry. here,
Let the phoneme be X, and the frame numbers at the start and end of the segmented section corresponding to X are
Let Ns and Ne be Ns, and the value of each parameter in the n-th frame be _Co , then the likelihood l _x of phoneme X is defined by the following formula.

φ_i（C_o）はある音素ｉの確率密度を表わし、
式のように定義される。 φ _i (C _o ) represents the probability density of a certain phoneme i,
It is defined as Eq.

φ_i（C_o）＝１／（2π）^N/2｜Σ_i｜^1/2exp〔
−１／２（C_o−μ_i）^TΣ^-1 _i（C_o−μ_i）〕…… C_o：第ｎフレームにおけるＮ個のパラメータ（ベクトル） μ_i：ある音楽ｉのパラメータの平均値（ベクトル） Σ_i：共分散行列式において、確率密度の割り算における分母
のサメンシヨンのｉの範囲は、音素Ｘが何である
かによつて異なり、例えばＸが音楽Ａ(ア)の時はｉ
の範囲は５母音、Ａ、Ｅ、Ｉ、Ｏ、Ｕ、としてい
る。以上により得られる単語類似度L_Mを式に
従つて各辞書項目毎に求め、L_Mが最大となる辞
書項目をもつて、認識単語としていた。 φ _i (C _o )=1/(2π) ^N/2 | Σ _i | ^1/2 exp [
-1/2 (C _o -μ _i ) ^T Σ ^-1 _i (C _o -μ _i )]... _Co : N parameters (vector) in the n-th frame μ _i : Average of parameters of music i Value (vector) Σ _i : Covariance matrix In the formula, the range of i in the denominator summension in probability density division differs depending on the phoneme X. For example, when X is music A, i
The range is five vowels: A, E, I, O, and U. The word similarity L _M obtained above was determined for each dictionary item according to the formula, and the dictionary item with the maximum L _M was selected as a recognized word.

L_M＝_NP 〓^j=1 l_j／NP …… （L_M：辞書中のＭ番目の単語の類似度 l_j：辞書音素系列中のｊ番目の音素の尤度 NP：辞書音素類）第２図は／KAN＝NAI／（関内）と発声した
時の／AN＝NA／の部分の各音素の確率密度の
時間変化を表わしている。この場合の／AN＝
NA／の部分のセグメンテーシヨン及び尤度計算
は、各音素／Ａ／、／Ｎ＝／、／Ｎ／、／Ａ／の
確率密度の値φ_A、φ_N=、φ_N、φ_Aの時間変化に従つ
てセグメンテーシヨンを行なう。／AN＝NA／
の場合は第１番目の／Ａ／に対してセグメンテー
シヨンした区間（ａ−ｂ）を対応させ、式に従
い、φ_Aを用いてl_Aを計算し、／Ｎ＝／、／
Ｎ／、／Ａ／についても同様にl_N=、l_N、l_Aを計算
する。 L _M = _NP 〓 ^j=1 l _j /NP... (L _M : Similarity of the M-th word in the dictionary l _j : Likelihood of the j-th phoneme in the dictionary phoneme series NP: Dictionary phoneme class) Figure 2 shows the temporal change in the probability density of each phoneme in the /AN=NA/ part when /KAN=NAI/ (Kannai) is uttered. /AN= in this case
Segmentation and likelihood calculation for the NA/ part are performed using probability density values φ _A , φ N= , φ _N , φ _A for each phoneme /A/, /N ₌ /, /N/, /A/. Segmentation is performed according to time changes. /AN=NA/
In the case of , the segmented interval (a-b) is made to correspond to the first /A/, and according to the formula, l _A is calculated using φ _A , and /N=/, /
For N/ and /A/, l _N= , l _N , and l _A are calculated in the same way.

第３図は同じ単語／KAN＝NAI／を別の話者
が発声した場合の各音素の確率密度の時間変化を
示している。第３図において、／AN＝NA／の
部分のセグメンテーシヨン及び尤度計算はφ_A、
φ_N=、φ_N、φ_Aの時間変化によつて行なうが、／Ｎ
＝／のセグメンテーシヨンをする場合／Ｎ＝／の
次に来る音素／Ｎ／の確率密度φ_Nが／Ｎ／の区
間で十分大きくならずφ_N=が／Ｎ／の区間に大き
な値を持ち、次の音素／Ａ／の区間の始まりまで
きている。従つて／Ｎ＝／のセグメンテーシヨン
区間は区間（ｇ−ｈ）となり、／Ｎ／の区間を含
むため、／Ｎ＝／の次の音素／Ｎ／のセグメンテ
ーンシヨンを誤り、尤度l_Nも低くなるため、撥
音、鼻音の連続２音素を含む単語は誤認識し易い
欠点があつた。 Figure 3 shows the temporal change in the probability density of each phoneme when the same word /KAN=NAI/ is uttered by different speakers. In Figure 3, the segmentation and likelihood calculation for /AN=NA/ are φ _A ,
This is done by changing φ _N= , φ _N , φ _A over time, but /N
When segmenting =/, the probability density φ _N of the phoneme /N/ that comes after /N=/ is not large enough in the /N/ interval, and φ _N= has a large value in the /N/ interval. and reaches the beginning of the next phoneme /A/. Therefore, the segmentation interval of /N=/ becomes the interval (gh), which includes the interval of /N/, so the segmentation of the next phoneme /N/ after /N=/ is incorrect, and the likelihood l Since _N is also low, words containing two consecutive phonemes, such as a phonic or nasal sound, have the disadvantage of being easily misrecognized.

（発明の目的）本発明は、上記従来例の欠点を除去するもので
あり、尤度計算の精度を向上させ、それにより単
語認識率を向上させることを目的とする。(Objective of the Invention) The present invention is intended to eliminate the drawbacks of the conventional example described above, and aims to improve the accuracy of likelihood calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、撥音、
鼻音が連続する音素系列のセグメンテーシヨン及
び尤度計算を行なう際、撥音、鼻音の連続２音素
をまとめてセグメンテーシヨンし尤度計算を行な
うことにより、セグメンテーシヨン及び尤度計算
の精度を向上させる効果を得るものである。(Structure of the Invention) In order to achieve the above object, the present invention provides repellent sound,
When segmenting and calculating the likelihood of a phoneme sequence with continuous nasal sounds, the accuracy of the segmentation and likelihood calculation can be improved by segmenting and calculating the likelihood of two consecutive phonemes, a nasal sound and a nasal sound. It has the effect of improving.

（実施例の説明）以下に本発明の一実施例について第１図及び第
３図とともに説明する。第１図において、音素標
準パタンは従来例と同様である。単語辞書は、認
識すべき単語を音素の記号列で表記してある。ま
たパラメータ抽出により得られるパラメータ時系
列は従来例と同様である。本実施例の動作につい
て説明する。先ず、パラメータ抽出部１で入力音
声からフレーム毎のパラメータを得、さらに確率
密度計算部２でそのパラメータの値及び、各音素
標準パタンから得られる確率密度を計算する。次
に、単語認識部３において、単語辞書部５内の各
辞書項目毎にその辞書項目を構成する辞書音声系
列に従つて音素Ｘのセグメンテーシヨンを行な
い、その音素Ｘとその音素Ｘに対応してセグメン
テーシヨンされた区間の尤度l_Xを計算するのであ
るが、辞書音素系列中に撥音、鼻音の２連続音素
系列がある場合、第１番目の音素である撥音の確
率密度の値が、次の鼻音の終りまで優勢である。
従つて撥音、鼻音の連続２音素をまとめてセグメ
ンテーシヨンし、そのセグメンテーシヨンした区
間に対して尤度を計算する。第３図の／AN＝
NA／の部分の各音素／Ａ／、／Ｎ＝／、／
Ｎ／、／Ａ／の確率密度φ_A、φ_N=、φ_N、φ_Aをみる
と、φ_N=は／Ｎ／の部分でφ_Nよりも大きな値を持
ち／Ａ／の始まり(h)まで続いている。従つて、
φ_N=の値を用いて、／Ｎ＝Ｎ／の連続２音素をま
とめてｇからｈまでセグメンテーシヨンを行な
い、セグメンテーシヨンした区間（ｇ−ｈ）に対
してφ_N=の値を用いて式に従つて２音素分の尤
度l_N=Nを求める。ここで式と対比して普通の音
素の場合は従来同様式を用いて尤度計算を行な
う。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIGS. 1 and 3. In FIG. 1, the phoneme standard pattern is the same as in the conventional example. In a word dictionary, words to be recognized are expressed as phoneme symbol strings. Further, the parameter time series obtained by parameter extraction is the same as in the conventional example. The operation of this embodiment will be explained. First, a parameter extraction section 1 obtains parameters for each frame from input speech, and a probability density calculation section 2 calculates the values of the parameters and probability densities obtained from each phoneme standard pattern. Next, in the word recognition unit 3, segmentation of the phoneme X is performed for each dictionary item in the word dictionary unit 5 according to the dictionary phonetic sequence that constitutes the dictionary item, and the phoneme X and the phoneme The likelihood _lX of the segmented interval is calculated by is dominant until the end of the next nasal sound.
Therefore, the continuous two phonemes of a nasal and a nasal are segmented together, and the likelihood is calculated for the segmented interval. /AN= in Figure 3
Each phoneme in the NA/ part /A/, /N=/, /
Looking at the probability densities φ _A , φ _N= , φ _N , φ _A of N/, /A/, φ _N= has a larger value than φ _N at the /N/ part, and at the beginning of /A/ (h ). Therefore,
Using the value of φ _N= , segment the two consecutive phonemes /N=N/ from g to h, and then set the value of φ _N= for the segmented interval (gh-h). The likelihood l _N=N for two phonemes is calculated using the formula. Here, in contrast to the formula, in the case of ordinary phonemes, the likelihood is calculated using the formula as in the past.

本実施ににおいては、撥音、鼻音の音素系列を
１つにまとめてセグメンテーシヨン及び尤度計算
を行なうため、撥音、鼻音の連続２音素を含む単
語の認識率が向上する利点がある。 In this implementation, the segmentation and likelihood calculation are performed by combining the phoneme sequences of the pellic and nasal sounds, so there is an advantage that the recognition rate of words containing two consecutive phonemes of the pellic and nasal sounds is improved.

但し、記号の使用は、式に準ずる。 However, the use of symbols is in accordance with the formula.

（発明の効果）本発明は上記のように撥音、鼻音の連続２音素
をまとめてセグメンテーシヨンし、尤度計算を行
なうことにより、従来法に比べ精度よくセグメン
テーシヨン及び尤度計算を行うことができる。(Effects of the Invention) As described above, the present invention performs segmentation and likelihood calculation with higher precision than conventional methods by segmenting two continuous phonemes of a nasal and a nasal and performing likelihood calculation. be able to.

[Brief explanation of drawings]

第１図は従来及び本発明の一実施例における単
語音声認識方法を説明するための図、第２図は／
KAN＝NAI／（カンナイ）と発声した場合の／
AN＝NA／の部分の各要素／Ａ／、／Ｎ
＝／、／Ｎ／、／Ａ／の確率密度φ_A、φ_N=、φ_N、
φ_Aの時間変化を示す図、第３図は第２図の場合
とは別の話者が／KAN＝NAI／と発生した場合
φ_A、φ_N=、φ_N、φ_Aの時間変化を示す図である。１……パラメータ抽出部、２……確率密度計算
部、３……単語認識部、４……音素標準パタン
部、５……単語辞書部。 FIG. 1 is a diagram for explaining the word speech recognition method in the conventional method and an embodiment of the present invention, and FIG.
KAN=NAI/ (Kannai) /
Each element /A/, /N of AN=NA/ part
Probability density of =/, /N/, /A/ φ _A , φ _N= , φ _N ,
Figure 3 _shows the temporal changes in φ _A when a _different speaker than in Figure 2 generates / _KAN ₌ NAI/. FIG. 1... Parameter extraction section, 2... Probability density calculation section, 3... Word recognition section, 4... Phoneme standard pattern section, 5... Word dictionary section.

Claims

[Claims]

1. Word speech recognition that recognizes words in input speech using a word dictionary that describes the words to be recognized as symbol strings in phoneme units and standard patterns for each phoneme that are expressed as distributions of acoustic parameters for each phoneme. In the method, the input speech is matched with each dictionary entry of a word dictionary,
Segment the input speech for each phoneme according to the dictionary phoneme sequence that constitutes each dictionary entry, and use the standard pattern of that phoneme to calculate the probability density that the segmented speech section is generated from that phoneme. The similarity between each dictionary item and the input speech is calculated using the above probability density value for the segmented speech interval, and when recognizing words, the phonic and nasal sounds in the dictionary word are calculated. A word speech recognition method characterized in that, for a continuous phoneme sequence, two consecutive phonemes, a nasal and a nasal, are segmented together and a likelihood calculation is performed.