JPS60202496A

JPS60202496A - Word voice recognition

Info

Publication number: JPS60202496A
Application number: JP59058177A
Authority: JP
Inventors: 金指　久則; 入間野　孝雄; 秋場　国夫
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1984-03-28
Filing date: 1984-03-28
Publication date: 1985-10-12
Also published as: JPH045395B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞書を照合
して単語を認識する単語音声認識方法に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）従来の単語音声認識方法を、第１図、第２図及び第３図
とともに説明する。第１図において単語辞書部３の単語
辞書は認識すべき全単語を音素系列で表記したものであ
り、例えば単語「サラポロ」、「フッサ」はｒ　５ＡＱ
ＰＯＲＯＪ　ｒ　）ＩＵＱＳＡ　Ｊ等と表記されている
。(Structure of conventional example and its problems) A conventional word speech recognition method will be explained with reference to FIGS. 1, 2, and 3. In FIG. 1, the word dictionary of the word dictionary section 3 represents all the words to be recognized in phoneme sequences. For example, the words "saraporo" and "fussa" are written as r5AQ.
POROJ r ) IUQSA J etc.

パラメータ抽出部１で入力音声を１０ｍ５のフレーム毎
に分析し、パラメータを抽出して、パラメータ時系列を
作成する。次に単語認識部２において上記のノやラメー
タを用いて各辞書項目毎に、その辞書項目を構成する辞
書音素系列に従って、１音素毎に音素のセグメンテーシ
ョンを行ない、音（９）素の種類と、その音素に対応してセグメンテーションさ
れた区間の尤度ｔを、上記・やラメータを用いて計算し
、■式に従ってその辞書項目における、各音素の尤度の
平均として類似度をめる。A parameter extraction unit 1 analyzes input audio every frame of 10m5, extracts parameters, and creates a parameter time series. Next, the word recognition unit 2 performs phoneme segmentation for each phoneme according to the dictionary phoneme series that constitutes the dictionary item for each dictionary entry using the above-mentioned parameter. , the likelihood t of the segmented interval corresponding to that phoneme is calculated using the above-mentioned y parameter, and the degree of similarity is calculated as the average of the likelihoods of each phoneme in the dictionary entry according to formula (2).

以上によシ得られる単語類似度しＭを各辞書項目毎にめ
、ＬＭが最大となる辞書項目をもって認識単語とする。The word similarity obtained above is determined for each dictionary item, and the dictionary item with the maximum LM is determined as a recognized word.

ＬＭ＝ΣＬ　ｉ／ＮＰ　・・・　■ ｉ＝１上記のような方法において、促音のセグメンテーション
及び尤度計算は、■式に示される音声の対数正規化パワ
ー、Ｐ■（Ｎはフレーム番号（扁））及び０式に示され
る隣接フレーム間ケプヌトラム距離ＣＤＨの値をもとに
セグメンテーションを行ない、セグメンテーションされ
た促音の持続時間長、ＬＮＧから０式に従って促音の尤
度をめる。LM=ΣLi/NP... ■ i=1 In the above method, segmentation and likelihood calculation of consonants are performed using the logarithmic normalized power of the speech shown in the formula (■), P (N is the frame number (flat) )) and the value of the inter-adjacent frame cepnutrum distance CDH shown in Equation 0. Segmentation is performed based on the value of the cepnutrum distance CDH between adjacent frames, and the likelihood of the consonant is calculated from the segmented duration length of the consonant and LNG according to Equation 0.

第２図は／５ＡＱＰＯＲＯ／　（札幌）と発声した時の
音声の対数正規化Ａ’クワ−（Ｎ）の時間変化を表わし
ている。FIG. 2 shows the temporal change in the logarithmically normalized A'ku (N) of the voice when /5AQPORO/ (Sapporo) is uttered.

この場合の促音／Ｑｌのセグメンテーション及び尤度計
算は、／Ｖの後端フレームａからＰ（６）の大きさがＴ
Ｐ以下のフレームをサーチして、ＰＨがＴＰ以上である
か、又は、隣接フレーム間ケゾヌトラム距離ＣＤ（転）
が、いき値Ｔ。Ｄより大きくなるフレーム（ｂ）を促音
／Ｑ／の後端フレームとじ／Ｑ／のセグメンテーション
された区間（ａ−ｂ）の持続時間長ＬＮＧを用いて０式
に従って尤度を計算する。In this case, the segmentation and likelihood calculation of the consonant /Ql are as follows: The size of P(6) from the rear end frame a of /V is T
Search for frames below P and check whether PH is above TP or the quezonutrum distance CD (transition) between adjacent frames.
is the threshold value T. For the frame (b) that is larger than D, the likelihood is calculated according to the formula 0 using the duration length LNG of the segmented section (a-b) of the consonant /Q/, which ends the frame at the end of the consonant /Q/.

第３図は／ＨＴＪＱ８Ａ／　（福生）と発声した時の音
声パワーＰ（財）及び隣接フレーム間ケシストラム距離
ＣＤｆｉの時間変化を表わしている。第３図の促音／Ｑ
／の部分に着目すると、／Ｕ／の後端フレーム（Ｃ）か
らＡ／の後端フレーム（ｄ）を探索する場合、第２図の
／Ｑ／の場合と比べ、Ｐ＠の大きさはＱ区間において、
いき値Ｔｐ以下になることはなく、ＣＤｌ［の値もＴ。FIG. 3 shows the temporal changes in the voice power P (goods) and the inter-adjacent frame casistrum distance CDfi when /HTJQ8A/ (Fussa) is uttered. Figure 3 consonant/Q
Focusing on the / part, when searching from the trailing frame (C) of /U/ to the trailing frame (d) of A/, the size of P@ is smaller than that of /Q/ in Figure 2. In the Q interval,
It never goes below the threshold Tp, and the value of CDl[ is also T.

Ｄ以上になることはない。このため、促音Ｑのセグメン
テーションを行なう場合本来の７ｖ区間の後端（ｄ）を
通シ越し、後端フレーム（ｄ）の探索を誤り、尤度も低
くなるため、促音、無声摩擦音が連続した音素系列を含
む単語は誤認識する（５）欠点があった。It will never be higher than D. Therefore, when performing segmentation of the consonant Q, it passes through the rear end (d) of the original 7v interval and searches for the rear end frame (d) incorrectly, lowering the likelihood that the consonant and voiceless fricative are consecutive. There was a drawback that words containing phoneme sequences were incorrectly recognized (5).

（発明の目的）本発明は、上記従来技術の欠点を除去し、セグメンテー
ション及び尤度計算の精度を向上させ、それによシ単語
認識率を向上させることを目的とするものである。(Objective of the Invention) An object of the present invention is to eliminate the drawbacks of the prior art described above, improve the accuracy of segmentation and likelihood calculation, and thereby improve the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、促音、無声摩擦
音が連続する音素系列のセグメンテーション及び尤度計
算を行なう際、促音、無声摩擦音の連続２音素をまとめ
てセグメンテーションし、次に音素の音響パラメータの
分布形で表わされた標準ノｅタンを用いて、そのセグメ
ンテーションされた音声の区間が各音素から生成される
確率密度を計算し、セグメンテーションされた音声の区
間に対して上記確率密度の値を利用して尤度計算を行な
うものである。(Structure of the Invention) In order to achieve the above object, the present invention, when performing segmentation and likelihood calculation of a phoneme sequence in which a consonant and a voiceless fricative are continuous, segments two consecutive phonemes of a consonant and a voiceless fricative together, Next, using the standard no-e tan expressed in the distribution form of the acoustic parameters of phonemes, calculate the probability density that the segmented speech section is generated from each phoneme, and calculate the probability density for the segmented speech section. The likelihood calculation is performed using the above probability density value.

（実施例の説明）以下に本発明の実施例について第３図及び第４図ととも
に説明する。第４図は本実施例の方法を（６）実行するだめの装置の機能ブロック図であシ、・ぐラメ
ータ抽出部１、音素の確率密度計算部２、単語認識部３
、音素標準バタン部６、単語辞書７等からなる。第１図
に示す従来例と異なるのは、音響パラメータの分布形で
表わされた音素の標準・やタンを備えていることである
。また、単語辞書は、認識すべき単語を音素の記号列で
表記しであるが、促音、無声摩擦音の２連続音素系列に
対して予めそれを識別するための符号をつけである。・
母うメータ抽出によシ得られる／Ｐラメータ時系列は従
来例と同様である。(Description of Examples) Examples of the present invention will be described below with reference to FIGS. 3 and 4. FIG. 4 is a functional block diagram of a device for executing the method of this embodiment (6). Grammeter extraction unit 1, phoneme probability density calculation unit 2, word recognition unit 3
, a phoneme standard button part 6, a word dictionary 7, and the like. What differs from the conventional example shown in FIG. 1 is that it includes standard phonemes represented by acoustic parameter distributions. Furthermore, in the word dictionary, words to be recognized are expressed as phoneme symbol strings, and codes are attached in advance to the two consecutive phoneme sequences of consonants and voiceless fricatives to identify them.・
The /P parameter time series obtained by mother meter extraction is the same as in the conventional example.

本実施例の動作について説明する。先ずパラメータ抽出
部１において入力音声からフレーム毎のパラメータを得
、さらにそのノ４ラメータの値を使って、確率密度計算
部２おいて各音素の標準バタンとから得られる確率密度
を計算する。次に単語認識部３によシ各辞書項目毎に、
その辞書項目を構成する辞書音素系列に従って音素Ｘの
セグメンテーションを行ない、その音素Ｘとその音素Ｘ
に対応してセグメンテーションされた区間の尤度ｔＸを
計算するのであるが、促音、無声摩擦音が連続する音素
系列中の促音の部分の性質は、促音、破裂音が連続する
音素系列中の促音の部分の性質とは異なシ、促音の部分
の性質が無声摩擦音の性質に近くなる。従って、無声摩
擦音の確率密度の値を用いて促音、無声摩擦音の２連続
音素をまとめてセグメンテーションし尤度計算を行なう
。The operation of this embodiment will be explained. First, a parameter extraction section 1 obtains parameters for each frame from the input speech, and then, using the value of the four parameters, a probability density calculation section 2 calculates the probability density obtained from the standard bang of each phoneme. Next, the word recognition unit 3 performs the following for each dictionary item:
Segmentation of phoneme X is performed according to the dictionary phoneme series that constitutes the dictionary entry, and the phoneme X and its phoneme
We calculate the likelihood tX of the segmented interval corresponding to The nature of the part is different from shi, and the nature of the part of the consonant is close to that of the voiceless fricative. Therefore, using the probability density value of the unvoiced fricative, two consecutive phonemes of the consonant and the unvoiced fricative are collectively segmented and the likelihood is calculated.

第３図は、／ＨＴＪＱＳＡ／と発声した時の音声・ぐワ
ーＰ（へ）、隣接フレーム間ケノヌトラム距離、ＣＤＩ
Ｊ）及び音素／％’　、　Ａ／　、　／Ｓ／　、　’／
ＩＶ’の確率密度φ□。Figure 3 shows the sound when /HTJQSA/ is uttered, the voice P (he), the Kenonutrum distance between adjacent frames, and the CDI.
J) and phoneme /%', A/, /S/, '/
The probability density of IV' is φ□.

φ０．φ８．φえの時間変化を示す。第３図において、
促音Ａ／の部分のパワーＰ輌はいき値ＴＰ以下にはなら
ず次の音素Ａ／のノ４ワーＰ凶と同程度であり、／Ｖと
の境界（ｄ）の隣接フレーム間ケプストラム距離、ＣＤ
（ｄ）の値もいき値Ｔ。Ｄを超えず大きな変化がない。φ0. φ8. It shows the change in φe over time. In Figure 3,
The power P of the part of the consonant A/ does not fall below the threshold TP, and is about the same as the power of the next phoneme A/, and the cepstral distance between adjacent frames at the boundary (d) with /V, CD
The value of (d) is also the threshold value T. It does not exceed D and there is no significant change.

また／Ｑ／区間の確率密度は／Ｓ／の確率密度、φ８が
優勢であシ、／Ｓ／の後端（ｆ）まで優勢である。従っ
て、促音、無声摩擦音の２連続音素系　８列に対しては
、上記２連続音素系列を持続時間の長い無声摩擦音とみ
なし、無声子音の確率密度を用いて、／Ｓ／の後端（ｆ
）を見つけ、セグメンテーションヲ行ない、セグメンテ
ーション区間長、ＬＱ８及び無声摩擦音の確率密度値φ
８を用いて■式に従い促音、無声摩擦音の２連続音素系
列の尤度ｔＱ８をめる。Also, the probability density of the /Q/ section is the probability density of /S/, φ8 is dominant, and is dominant up to the rear end (f) of /S/. Therefore, for the 8 series of two consecutive phoneme systems of consonants and voiceless fricatives, we regard the two consecutive phoneme series as voiceless fricatives with a long duration, and use the probability density of voiceless consonants to calculate the rear end of /S/ (f
), perform segmentation, and find the segmentation interval length, LQ8, and the probability density value φ of the voiceless fricative.
8 is used to calculate the likelihood tQ8 of two consecutive phoneme sequences of consonants and voiceless fricatives according to formula (■).

本実施例においては、促音、無声摩擦音の２連続音素系
列を持続時間の長い１つの無声摩擦音とみなし、無声摩
擦音の確率密度を用いてセグメンテーション及び尤度計
算を行なうため、促音、無声摩擦音の２連続音素系列を
含む単語の認識率が向上する利点がある。In this example, two consecutive phoneme sequences of a consonant and a voiceless fricative are regarded as one unvoiced fricative with a long duration, and segmentation and likelihood calculation are performed using the probability density of the unvoiced fricative. This has the advantage of improving the recognition rate for words containing continuous phoneme sequences.

Ｃ，Ｄ：定数（９）（ＴＬＧＱ、ＴＬ、ＴＨは予備実験等によシ予めめてお
く。）φ、（ＣＮ）はある音素ｉの確率密度を表わし、
０式のように定義される。C, D: Constant (9) (TLGQ, TL, TH are prepared in advance through preliminary experiments, etc.) φ, (CN) represent the probability density of a certain phoneme i,
It is defined as equation 0.

ＣＮ＝第Ｎフレームにおけるに個のノ４ラメータ（ベク
トル）μ、：ある音素ｉのパラメータの平均値（ベクト
ル）Σ、：共分散行列 ■式において、確率密度の割シ算における分母のｉの範
囲は５母音、鼻音、有声子音、無声子音合わせて１５個
の音素グループとしている。CN = 4 parameters (vector) μ in the Nth frame: Average value (vector) of the parameters of a certain phoneme i Σ: Covariance matrix The range is 15 phoneme groups, including 5 vowels, nasals, voiced consonants, and voiceless consonants.

（発明の効果）本発明は、促音と無声摩擦音が２連続する音素系列のセ
グメンテーション及び尤度計算を行なう際、無声摩擦音
の確率密度を用いて、促音、摩擦音の連続２音素をまと
めてセグメンテーションし尤度計算を行なうことによシ
、従来法に比べ精度よくセグメンテーション及び尤度計
算を行なうことができる利点を有する。(Effects of the Invention) When performing segmentation and likelihood calculation of a phoneme sequence consisting of two consecutive consonants and unvoiced fricatives, the present invention uses the probability density of unvoiced fricatives to segment two consecutive phonemes of consonants and fricatives together. By performing the likelihood calculation, there is an advantage that segmentation and likelihood calculation can be performed with higher precision than the conventional method.

[Brief explanation of the drawing]

第１図は従来例における単語音声認識方法を説明するた
めの図、第２図は／５ＡＱＰＯＲＯ／　（札幌）と発声
した場合の音声の正規化対数パワーＰ■及び隣接フレー
ム間ケプストラム距離、ＣＤ−の時間変化を示す図、第
３図は／）ＩＵＱＳＡ／　（福生）と発声した場合のＰ
（Ｎ）、ＣＤ−及び音素／）Ｉ／　、　／Ｕ／　、　／
Ｓ／。Ａ／の確率密度φ、、φ。、φ８．φ。の時間変化を示
す図、第４図は本発明の一実施例における単語音声認識
方法を説明するための図である。１１・・りぐラメータ抽出部、１２・・・音素の確率密
度計算部、１３・・・単語認識部、１４・・・音素標準
バタン部、１５・・・単語辞書部。（１１）第１図Figure 1 is a diagram for explaining the word speech recognition method in the conventional example, and Figure 2 shows the normalized logarithmic power P■ of the voice when uttering /5AQPORO/ (Sapporo), the cepstral distance between adjacent frames, CD- Figure 3 shows the time change of /) IUQSA/ (Fussa).
(N), CD- and the phoneme /) I/ , /U/ , /
S/. The probability density of A/φ,,φ. , φ8. φ. FIG. 4 is a diagram for explaining a word speech recognition method in an embodiment of the present invention. 11... Rigram meter extraction section, 12... Phoneme probability density calculation section, 13... Word recognition section, 14... Phoneme standard slam section, 15... Word dictionary section. (11) Figure 1

Claims

[Claims] The input speech is compared with the dictionary entries of a word dictionary in which words to be recognized are expressed as symbol strings in phoneme units, and the input speech is divided into units for each dictionary according to the dictionary phoneme series that constitutes each dictionary entry. Segmentation is performed on the input audio using the acoustic parameter analyzed for each time, and the above acoustic i+ parameter is used for the segmented section,
In the word speech recognition method, which recognizes words by determining the similarity between each dictionary item and the input speech, for a phoneme series in which a consonant and a voiceless fricative are consecutive in a dictionary word, two consecutive phonemes of a consonant and a voiceless fricative are grouped together. The segmented speech (1) is segmented by segmentation, and the probability density that the segmented speech section is generated from each phoneme is calculated using a standard baton expressed in the distribution form of the acoustic parameters of the phoneme. A word speech recognition method characterized in that likelihood calculation is performed using the above probability density value for the interval.