JPH0412480B2

JPH0412480B2 -

Info

Publication number: JPH0412480B2
Application number: JP59003589A
Authority: JP
Inventors: Hisanori Kanezashi; Kunio Akiba; Takao Irumano
Original assignee: Matsushita Communication Industrial Co Ltd
Current assignee: Panasonic Mobile Communications Co Ltd
Priority date: 1984-01-13
Filing date: 1984-01-13
Publication date: 1992-03-04
Also published as: JPS60149095A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞
書を照合して単語を認識する単語音声認識方法に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例を実行
するための装置の機能ブロツク図である。従来例
を第１図とともに説明する。第１図において、１
は入力音声からパラメータの時系列を作成するパ
ラメータ抽出部、２は音素標準パタンを照合し
て、音素の確率密度を算出する確率度計算部、３
は音素毎のセグメンテーシヨン、尤度計算、単語
類似度計算等を行なう単語認識部である。また、
４は予め予備実験等により作成された、各音素毎
の各種パラメータにおける分布を各音素毎の平均
値（μ_i）、及び各種パラメータ間の共分散行列
（Σ_i）の形で表わした音素標準パタンを記憶する
音素標準パタン部、５は認識すべき全単語を音素
単位の記号列で表記した単語辞書が記憶されてい
る単語辞書部である。その単語辞書は、例えば単
語「サツポロ」、「アサヒカワ」は、それぞれ
「SAQPORO」、「ASAHIKAWA」等と表記され
ている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an apparatus for executing an example of a conventional word speech recognition method. A conventional example will be explained with reference to FIG. In Figure 1, 1
2 is a parameter extraction unit that creates a time series of parameters from input speech; 2 is a probability calculation unit that calculates the probability density of a phoneme by comparing standard patterns of phonemes; 3
is a word recognition unit that performs segmentation, likelihood calculation, word similarity calculation, etc. for each phoneme. Also,
4 is a phoneme standard created in advance through preliminary experiments etc. that expresses the distribution of various parameters for each phoneme in the form of an average value (μ _i ) for each phoneme and a covariance matrix (Σ _i ) between various parameters. A phoneme standard pattern section 5 stores patterns, and a word dictionary section 5 stores a word dictionary in which all words to be recognized are expressed in symbol strings in units of phonemes. In the word dictionary, for example, the words "Satsuporo" and "Asahikawa" are written as "SAQPORO" and "ASAHIKAWA", respectively.

次に上記従来例の動作について説明する。入力
音素をパラメータ抽出部１により10ｍｓのフレー
ム毎に分析しパラメータを抽出して、パラメータ
時系列を作成する。確率密度計算部２はフレーム
毎に得られたパラメータと音素標準パタンを照合
し、そのパラメータの値から生成される音素の確
率密度を算出する。次に、単語認識部３におい
て、各辞書項目毎に、その辞書項目を構成する辞
書音素系列に従つて音素のセグメンテーシヨンを
行ない、下記式に従いその音素の種類と、その
音素に対応してセグメンテーシヨンされた区間の
尤度ｌを計算し、その辞書項目における、各音素
の尤度の平均として類似度を求める。ここで、そ
の音素をＸとし、Ｘに対応してセグメンテーシヨ
ンされた区間の始端と終端のフレーム番号をN_s、
N_eとし、第ｎフレームにおける各パラメータの
値をC_oとすると、音素Ｘの尤度l_xは下式で定義さ
れる。 Next, the operation of the above conventional example will be explained. The input phoneme is analyzed by the parameter extraction unit 1 every 10 ms frame, parameters are extracted, and a parameter time series is created. The probability density calculation unit 2 compares the parameters obtained for each frame with the phoneme standard pattern, and calculates the probability density of the phoneme generated from the parameter values. Next, the word recognition unit 3 performs phoneme segmentation for each dictionary item according to the dictionary phoneme series that makes up the dictionary item, and then segments the phoneme according to the type of phoneme and the phoneme corresponding to the phoneme according to the following formula. The likelihood l of the segmented interval is calculated, and the similarity is determined as the average of the likelihoods of each phoneme in the dictionary entry. Here, let the phoneme be X, and the frame numbers at the start and end of the segmented section corresponding to X are N _s ,
When N _e and the value of each parameter in the n-th frame are C _o , the likelihood l _x of the phoneme X is defined by the following formula.

φ_i（C_o）にある音素ｉの確率密度を表わし、
式のように定義される。 represents the probability density of phoneme i in φ _i (C _o ),
It is defined as Eq.

φ_i（C_o）＝１／（2π）^j/2｜Σ_i｜^1/2exp 〔−１／２（C_o−〓_i）^T _-1 〓 _i（C_o−〓_i）〕 …… Ｃ：１つのフレームにおけるｊ個のパラメータ
（ベルトル）〓_i：ある音素ｉのパラメータの平均値（ベクト
ル） Σ_i：共分散行列式において、確率密度の割り算における分母
のサメンシヨｉの範囲は、音素Ｘが何であるかに
よつて異なり、例えばＸが音素Ａ(ア)の時はｉの範
囲は５母音、Ａ、Ｅ、Ｉ、Ｏ、Ｕとしている。以
上により得られる単語類似度L_Mを式に従つて
各辞書項目毎に求め、L_Mが最大となる辞書項目
をもつて、認識単語としていた。φ _i (C _o )=1/(2π) ^j/2 |Σ _i | ^1/2 exp [−1/2(C _o −〓 _i ) ^T _-1 〓 _i (C _o −〓 _i )] …… C: j parameters in one frame (Bertl) 〓 _i : Average value (vector) of the parameters of a certain phoneme i Σ _i : Covariance matrix In the formula, the range of the denominator i in dividing the probability density is the phoneme It depends on what X is. For example, when X is the phoneme A, the range of i is five vowels, A, E, I, O, and U. The word similarity L _M obtained above was determined for each dictionary item according to the formula, and the dictionary item with the maximum L _M was selected as a recognized word.

L_M＝_NP 〓ⁱ⁼¹ li／NP …… L_M：辞書中のＭ番目の単語の類似度 l_i：辞書音素系列中の音素ｉの尤度 NP：辞書音素数今セグメンテーシヨンされた区間において、認
識すべき単語中の本来の音素をαとすると、音素
αの確率密度はφαとなる。これに対し別種の音
素βを仮定したときの確率密度をφβとするとき、
φαとφβの値が近かつたり、又はφαの方がφβよ
り小さい場合、結果として本来、認識すべきでは
ない単語の類似度が、本来認識すべき単語の類似
度よりも高くなつてしまい、単語を誤認識する欠
点があつた。 L _M = _NP 〓 ⁱ⁼¹ li/NP …… L _M : Similarity of the Mth word in the dictionary l _i : Likelihood of phoneme i in the dictionary phoneme sequence NP: Number of dictionary phonemes that have just been segmented If the original phoneme in the word to be recognized in the interval is α, then the probability density of the phoneme α is φα. On the other hand, when we assume a different type of phoneme β and let the probability density be φβ,
If the values of φα and φβ are close, or if φα is smaller than φβ, as a result, the similarity of words that should not be recognized becomes higher than the similarity of words that should be recognized, It had the drawback of misrecognizing words.

（発明の目的）本発明は上記従来例の欠点を除去するものであ
り、尤度計算の精度を向上させ、それにより単語
認識率を向上させることを目的とする。(Object of the Invention) The present invention is intended to eliminate the drawbacks of the above-mentioned conventional example, and aims to improve the accuracy of likelihood calculation and thereby improve the word recognition rate.

（発明の構成）本発明は、認識すべき単語を音素単位の記後列
で表記した単語辞書と、各音素の音響パラメータ
の分布形で表わされた各音素及び有声音と無声音
の標準パタンを用い、入力音声の単語認識を行な
うにあたり、入力音声を前記単語辞書の各辞書項
目と照合し、各辞書項目を構成する辞書音素系列
に従い各音素毎に入力音声をセグメンテーシヨン
し、その音素の前記標準パタンを用いて、そのセ
グメンテーシヨンされた音声の区間に対して尤度
計算を行なう単語音声認識方法において、セグメ
ンテーシヨンされた音声の区間内で、音素の標準
パタンだけでなく有音声と無音声の標準パタンを
併用して尤度を求め、この尤度の値を用いて辞書
項目と入力音声の類似度を求めて単語を認識する
ことを特徴とするものである。(Structure of the Invention) The present invention provides a word dictionary in which words to be recognized are expressed in post-recording columns of phoneme units, and standard patterns of each phoneme and voiced and unvoiced sounds expressed in the distribution form of acoustic parameters of each phoneme. In order to perform word recognition of input speech, the input speech is checked against each dictionary entry in the word dictionary, and the input speech is segmented for each phoneme according to the dictionary phoneme sequence that constitutes each dictionary entry. In a word speech recognition method that uses the standard pattern to calculate the likelihood for the segmented speech section, not only the standard pattern of phonemes but also the presence of speech are detected within the segmented speech section. The feature of this method is that the likelihood is determined using both the dictionary entry and the standard pattern of non-speech, and this likelihood value is used to determine the degree of similarity between the dictionary entry and the input speech to recognize the word.

（実施例の説明）以下の本発明の一実施例について第２図及び第
３図とともに説明する。第２図において、パラメ
ータ抽出部１、確率密度計算部２、音素標準パタ
ン部４、及び単語辞書部５は第１図に示す従来例
と同様である。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIGS. 2 and 3. In FIG. 2, the parameter extraction section 1, probability density calculation section 2, phoneme standard pattern section 4, and word dictionary section 5 are the same as those in the conventional example shown in FIG.

有声音、無声音標準パタン部６及び有声音、無
声音との距離計算部７が第１図の従来例にはない
ものである。 A voiced sound/unvoiced sound standard pattern section 6 and a distance calculation section 7 between voiced sound and unvoiced sound are not present in the conventional example shown in FIG.

有声音と無声音の標準パタン部６に記憶されて
いるパタンは有声音及び無声音のパラメータにお
ける分布を有声音、無声音の平均値、及び各種パ
ラメータ間の共分散行列の形で表わしたのであり
予め作成しておく。パラメータ抽出部１により得
られるパラメータ時系列は従来例と同様である。 The patterns stored in the standard pattern section 6 for voiced and unvoiced sounds represent the distribution of the parameters of voiced and unvoiced sounds in the form of average values of voiced sounds and unvoiced sounds, and a covariance matrix between various parameters, and are created in advance. I'll keep it. The parameter time series obtained by the parameter extraction unit 1 is the same as in the conventional example.

本実施例の動作について説明する。先ずパラメ
ータ抽出部１で入力音声からフレーム毎のパラメ
ータを得、さらにそのパラメータの値を使つて、
確率密度計算部２で各音素標準パタンから得られ
る確率密度を従来例と同様に計算する。さらに、
有声音、無声音との距離計算部７において、フレ
ーム毎に得られた、上記のパラメータの値と有声
音及び無声音の標準パタンを照合し、そのパラメ
ータの値から生成される有声性または無声性の距
離L_v（ｎ）、L_U（ｎ）を式に従つてフレーム毎に
算出する。 The operation of this embodiment will be explained. First, the parameter extraction unit 1 obtains parameters for each frame from the input audio, and then uses the values of the parameters to
The probability density calculation unit 2 calculates the probability density obtained from each phoneme standard pattern in the same manner as in the conventional example. moreover,
The distance calculation unit 7 between voiced and unvoiced sounds compares the above parameter values obtained for each frame with the standard patterns of voiced and unvoiced sounds, and calculates the voiced or unvoiced sound generated from the parameter values. Distances L _v (n) and L _U (n) are calculated for each frame according to the formula.

L_V(n)＝（C_o−〓_V）^TΣ^-1 _v（C_o−〓_V） L_U(n)＝（C_o−〓_U）^TΣ^-1 _U（C_o−〓_U）但し、 Cn：第ｎフレームにおけるパラメータ（ベクト
ル）〓_V、〓_U：有声音及び無声音のパラメータの平均
値（ベクトル） Σ_V、Σ_U：有声音及び無声音のパラメータの共分
散行列次に、各辞書項目毎に、その辞書項目を構成す
る辞書音素系列に従つて音素Ｘのセグメンテーシ
ヨンを行ない、その音素Ｘとその音素Ｘに対応し
てセグメンテーシヨンされた区間の尤度l_xを計算
するのであるが、l_xを求める際、仮定した音素が
有声音らしいか又は無声音らしいかの尤もらしさ
を表わす項を考え、式に従つて尤度l′_xをl_xの代
わりに用いる。L _V (n)=(C _o −〓 _V ) ^T Σ ^-1 _v (C _o −〓 _V ) L _U (n)=(C _o −〓 _U ) ^T Σ ^-1 _U (C _o −〓 _U ) However, Cn: Parameter (vector) in the nth frame 〓 _V , 〓 _U : Average value (vector) of the parameters of voiced and unvoiced sounds Σ _V , Σ _U : Covariance matrix of parameters of voiced and unvoiced sounds Next, each For each dictionary item, perform segmentation of phoneme X according to the dictionary phoneme sequence that makes up that dictionary item, and calculate the likelihood l _x of the segmented interval corresponding to that phoneme X and that phoneme However, when calculating l _x , consider a term that represents the likelihood of whether the assumed phoneme is a voiced or unvoiced sound, and use the likelihood l' _x instead of l _x according to the formula.

但し、Ψ_V＋Ψ_U＝１第３図は／ZiKU／（軸）及び／SiKU／（敷
く）と発声した場合のφ_xとΨ_uの時間変化を表わ
している。今／ZiKU／、／SiKU／の第１音素
めの有声音／Ｚ／と無音声／Ｓ／に着目すると、
φ_xの値は／Ｚ／も／Ｓ／もほとんど同じであ
り、／ZiKU／と／SiKU／を識別するのは困難
である。しかし、Ψ_Uの値をみると、／Ｓ／のΨ_u
の値に比べ／Ｚ／のΨ_uの値の方が低く、式か
ら求めるl′_xを用いて／ZiKU／と／SiKU／を識
別するのは容易である。 However, Ψ _V + Ψ _U = 1 Figure 3 shows the temporal changes in φ _x and Ψ _u when /ZiKU/ (axis) and /SiKU/ (lay) are uttered. Focusing on the voiced /Z/ and unvoiced /S/ of the first phoneme of /ZiKU/ and /SiKU/,
The value of φ _x is almost the same for /Z/ and /S/, and it is difficult to distinguish between /ZiKU/ and /SiKU/. However, looking at the value of Ψ _U , we find that Ψ _u of /S/
The value of Ψ _u of /Z/ is lower than the value of /Z/, and it is easy to distinguish between /ZiKU/ and /SiKU/ using l′ _x obtained from the formula.

本実施例においては、セグメンテーシヨンされ
た区間内において、音素の有声音と無声音の標準
パタンとの距離を利用した尤度計算を行なうこと
により、高い精度の尤度が得られる利点がある。 This embodiment has the advantage that a highly accurate likelihood can be obtained by performing a likelihood calculation using the distance between a standard pattern of voiced and unvoiced phonemes within a segmented interval.

（発明の効果）本発明は上記のような構成であり、以下に示す
効果が得られるものである。セグメンテーシヨン
された区間内において、有声音と無声音の標準パ
タンとの距離を利用した尤度計算で行なうことに
より高い精度の尤度を得ることができる。(Effects of the Invention) The present invention has the above-described configuration, and provides the following effects. A highly accurate likelihood can be obtained by performing a likelihood calculation using the distance between a standard pattern of voiced sounds and unvoiced sounds within a segmented interval.

[Brief explanation of drawings]

第１図は従来の単語音声認識方法を実施するの
に用いる装置の機能の概略ブロク図、第２図は本
発明の一実施例による単語音声認識方法を実施す
るのに用いる装置の機能の概略を示すブロツク
図、第３図は／ZiKU／及び／SiKU／と発声し
た場合の確率密度φ_xと無声性の項Ψ_uの時間変化
を表わす図である。６……有声音、無声音標準パタン部、７……有
声音、無声音との距離計算部。 FIG. 1 is a schematic block diagram of the functions of an apparatus used to implement a conventional word speech recognition method, and FIG. 2 is a schematic block diagram of the functions of an apparatus used to implement a word speech recognition method according to an embodiment of the present invention. FIG. 3 is a diagram showing the temporal changes in the probability density φ _x and the voicelessness term Ψ _u when /ZiKU/ and /SiKU/ are uttered. 6... Voiced sound, unvoiced sound standard pattern section, 7... Distance calculation section between voiced sound and unvoiced sound.

Claims

[Claims]

1 Equipped with a word dictionary that describes the words to be recognized as symbol strings in phoneme units, and standard patterns for each phoneme and voiced and unvoiced sounds expressed as a distribution of the acoustic parameters of each phoneme, it can recognize words from input speech. In performing this, the input speech is checked against each dictionary entry in the word dictionary, the input speech is segmented for each phoneme according to the dictionary phoneme series that constitutes each dictionary entry, and the standard pattern of the phoneme is used to segment the input speech. When calculating the likelihood for the segmented speech section, the likelihood is calculated using not only the standard pattern of phonemes but also the standard pattern of voiced and unvoiced sounds within the segmented speech section. A word speech recognition method characterized in that a word is recognized by determining the similarity between a dictionary item and input speech using the likelihood value.