JPH0316039B2

JPH0316039B2 -

Info

Publication number: JPH0316039B2
Application number: JP58177345A
Authority: JP
Inventors: Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-09-26
Filing date: 1983-09-26
Publication date: 1991-03-04
Also published as: JPS6068393A

Description

【発明の詳細な説明】産業上の利用分野本発明は音素認識を行なうことを特徴とする音
素認識方法における音素の認識方法に関するもの
である。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a phoneme recognition method in a phoneme recognition method characterized by performing phoneme recognition.

従来例の構成とその問題点近年、不特定話者・多数語を対象とする音声認
識に対する研究開発が盛んになつてきた。Configuration of conventional examples and their problems In recent years, research and development on speech recognition targeting unspecified speakers and multiple languages has become active.

音素認識を行なうことを特徴とする音声認識方
法は、アクセントの違いなどの話者による変動を
受けにくいこと、音声信号を音素系列という少な
い情報量でしかも言語学に対応した記号に変換す
るため、単語辞書の容量が少なくてもよいこと、
単語辞書の内容を容易に作成したり変更したりで
きること、など不特定話者・多数語の認識には適
した方法である。この方法における重要なポイン
トは音素認識を正確に行なうことである。母音の
認識はともかくとして、子音の認識は従来から技
術的に難しい問題とされてきた。したがつて、こ
こでは主に子音に絞つて述べてゆく。 Speech recognition methods, which are characterized by phoneme recognition, are less susceptible to speaker-specific variations such as differences in accents, and convert speech signals into phoneme sequences, which contain a small amount of information and are compatible with linguistics. The capacity of the word dictionary may be small;
This method is suitable for recognizing a wide variety of speakers and a large number of words, as the contents of the word dictionary can be easily created and changed. The important point in this method is to perform phoneme recognition accurately. Aside from vowel recognition, consonant recognition has traditionally been considered a technically difficult problem. Therefore, I will mainly focus on consonants here.

個々の子音の特徴に関する音声学的な研究は、
以前から数多く行なわれている。しかし、音声信
号のセグメンテーシヨンを行なつて子音区間を検
出して音素を認識する方法、いわゆる自動認識に
対する従来例は多くない。ここでは、本出願人を
含むグループが先に出願した技術を従来例として
取りあげ、問題点を挙げる。 Phonetic research on the characteristics of individual consonants is
It has been done many times before. However, there are not many conventional examples of so-called automatic recognition, a method of segmenting a speech signal to detect consonant intervals and recognizing phonemes. Here, we will take up as a conventional example a technique previously applied for by a group including the present applicant, and discuss its problems.

従来例における音素区間の検出方法を第１図に
よつて説明する。次の３つの方法を併用してい
る。 A conventional method for detecting phoneme sections will be explained with reference to FIG. The following three methods are used together.

イ音声・無声判定を行ない無声区間を子音区間
とする。第１図ａに示すように、音声を
10msec程度のフレームに分割し、各フレーム
に対して有声・無声判定を行なう。そして無声
フレームが連続している区間を子音区間とす
る。この方法は無声子音の検出に有効である。B. Perform speech/unvoiced determination and determine unvoiced sections as consonant sections. As shown in Figure 1a, the audio
It is divided into frames of about 10 msec, and each frame is judged to be voiced or unvoiced. Then, a section in which unvoiced frames are continuous is defined as a consonant section. This method is effective for detecting voiceless consonants.

ロ音声区間の各フレームを５母音と鼻音（／
ｍ／，／ｎ／，はつ音）の標準パターンと比較
し、類似度が最も大きくなる音素をそのフレー
ムの認識結果とする。そして、鼻音として認識
されたフレームが連続している区間を子音区間
とする（第１図ｂ）。これは有声子音のスペク
トルが、母音よりも鼻音に近いことを利用した
セグメンテーシヨン方法である。(b) Each frame of the speech section is divided into five vowels and a nasal sound (/
The phoneme with the highest degree of similarity is selected as the recognition result for that frame. Then, a section in which consecutive frames recognized as nasal sounds are defined as a consonant section (FIG. 1b). This is a segmentation method that takes advantage of the fact that the spectrum of voiced consonants is closer to nasal sounds than to vowels.

ハ音声パワーの時間的な動きからパワーのくぼ
み（パワーデイツプ）をを検出し、デイツプ区
間を子音区間とする（第１図ｃ）。これは子音
のパワーが母音部のパワーよりも小さいことを
利用したセグメンテーシヨン方法である。c. Detect a power dip from the temporal movement of voice power, and define the dip section as a consonant section (Fig. 1c). This is a segmentation method that takes advantage of the fact that the power of consonants is smaller than the power of vowels.

このようにして得られた子音区間に対し、従来
法では次のようにして子音を認識していた。第２
図によつて説明する。子音区間の全フレームを対
象として、フレームごとに各子音の標準パターン
との類似度を計算する。図では音素Ａに対する類
似度をla，Ｂに対する類似度をlb，Ｃに対する類
似度をlcと表わしている。類似度は（式１）で求
める。 In the conventional method, consonants are recognized in the consonant interval obtained in this manner as follows. Second
This will be explained using figures. The degree of similarity between each consonant and the standard pattern is calculated for each frame of all frames in the consonant section. In the figure, the degree of similarity to phoneme A is expressed as la, the degree of similarity to phoneme B is expressed as lb, and the degree of similarity to phoneme C is expressed as lc. The degree of similarity is calculated using (Equation 1).

l_j＝−（ｘ−1μ_j）^T・Σ^-1 _j・（ｘ−1μ_j）−K_j
…（式１）ただし、ｊは音素名（ａ，ｂ，ｃ……），K_jは
音素ｊに依存する定数である。またｘは入力特徴
パラメータ（LPCケプストラム係数）ベクトル、
1μは平均値ベクトル（標準パターン）、Σは共分
散行列（標準パターン）である。 l _j =−(x−1μ _j ) ^T・Σ ⁻¹ _j・(x−1μ _j )−K _j
...(Equation 1) However, j is a phoneme name (a, b, c...), and K _j is a constant depending on phoneme j. Also, x is the input feature parameter (LPC cepstral coefficient) vector,
1μ is the mean value vector (standard pattern), and Σ is the covariance matrix (standard pattern).

そして、全フレームに対する音素Ａ，Ｂ，Ｃ…
の類似度和をそれぞれL_a，L_b，L_c……とすると、 L_a＝_K 〓ⁱ⁼¹ l_a,k，L_b＝_K 〓ⁱ⁼¹ l_b,k，L_c＝_K 〓ⁱ⁼¹ l_c,k…… となる。このようにして、類似度和が最大となる
音素を認識された音素とする。 Then, phonemes A, B, C... for all frames.
Let the sum of similarities be L _a , L _b , L _c , respectively, then L _a = _K 〓 ⁱ⁼¹ l _a,k , L _b = _K 〓 ⁱ⁼¹ l _b,k , L _c = _K 〓 ⁱ⁼¹ l _c,k ……. In this way, the phoneme with the largest similarity sum is determined to be the recognized phoneme.

上記で説明した従来例の問題は後半の音素判別
の部分であり、セグメンテーシヨンによつて区間
を決めた後、その全区間に対して、フレームごと
に類似度計算を行なう点である。 The problem with the conventional example described above is in the second half of the phoneme discrimination part, in that after an interval is determined by segmentation, similarity calculation is performed for each frame for the entire interval.

すなわち、子音区間全体を時間的に静的である
と決め込み、全区間を平等に扱つていることであ
る。 In other words, the entire consonant interval is assumed to be temporally static, and all intervals are treated equally.

しかし、母音はともかくとして、子音や半母音
は区間内で時間的に特徴パラメータが変化するも
のであり、その変化形態に各音素の特徴が見出さ
れる。そして、特徴を有する部分（特徴部）は子
音や半母音の種類によつて異なつている。たとえ
ば有声、無声破裂音では、破裂付近に音素を判別
するための特徴が集中し、鼻音では後続母音への
わたりの部分に音素判別のための特徴部があり、
流音や半母音では音素区間全体のパラメータの動
きに特徴がある。 However, apart from vowels, the characteristic parameters of consonants and semi-vowels change over time within an interval, and the characteristics of each phoneme are found in the form of these changes. The characteristic part (characteristic part) differs depending on the type of consonant or semivowel. For example, in voiced and voiceless plosives, the features for phoneme discrimination are concentrated near the plosive, and in nasal sounds, the features for phoneme discrimination are located in the transition to the following vowel.
Flowing sounds and semi-vowels have characteristics in the movement of parameters throughout the phoneme interval.

したがつて、子音や半母音の判別には、各音素
を判別するための特徴部を正確に抽出し、特徴部
におけるパラメータの時間的な動きに着目して音
素判別を行なう方法が有効である。従来例ではこ
のような配慮がなされていない。 Therefore, an effective method for distinguishing between consonants and semi-vowels is to accurately extract the characteristic parts for distinguishing each phoneme, and perform phoneme discrimination by focusing on the temporal movement of parameters in the characteristic parts. In the conventional example, such consideration has not been taken.

発明の目的本発明は従来技術のもつ以上のような欠点を解
消するもので、音素の特徴部を正確に抽出し、特
徴部におけるパラメータの時間的な動きを含めて
音素標準パターンとのマツチングを行なうことに
より、高い精度で音素を判別する手段を提供する
ことを目的とするものである。Purpose of the Invention The present invention solves the above-mentioned drawbacks of the prior art, and aims to accurately extract the characteristic parts of phonemes and match them with the standard phoneme pattern, including the temporal movement of parameters in the characteristic parts. By doing so, the purpose is to provide a means for identifying phonemes with high accuracy.

発明の構成本発明は上記目的を達成するもので、音素区間
で音素の特徴をよく表現する部分（以下特徴部と
記す）に対して各音素ごとに作成された標準パタ
ーン（以下音素標準パターンと記す）と、識別対
象とする音素群に関して特徴部の周囲情報に対し
て作成された標準パターン（以下周囲情報標準パ
ターンと記す）を用意し、入力音声のセグメンテ
ーシヨンを行なつて音素区間中の特徴部候補区間
を求め、前記特徴部候補区間の各時点に対して前
記音素標準パターンと周囲情報標準パターンを適
用してパターンマツチングを行ない、各音素との
類似度を特徴部の周囲の影響を除去した形で特徴
部候補区間全域について求め、前記特徴部候補区
間内における類似度を比較することによつて音素
の判別を行うことを特徴とする音素判別方法を提
供するものである。Structure of the Invention The present invention achieves the above object, and consists of a standard pattern (hereinafter referred to as a phoneme standard pattern) created for each phoneme in a part of the phoneme interval that well expresses the characteristics of the phoneme (hereinafter referred to as the characteristic part). ) and a standard pattern (hereinafter referred to as "surrounding information standard pattern") created for the surrounding information of the characteristic part regarding the phoneme group to be identified, and segmentation of the input speech is performed to find the information in the phoneme section. Find feature candidate sections, perform pattern matching by applying the phoneme standard pattern and surrounding information standard pattern to each point in the feature candidate section, and compare the degree of similarity with each phoneme to the features surrounding the feature section. This invention provides a phoneme discrimination method characterized in that phonemes are discriminated by determining the entire feature part candidate section in a form in which the influence is removed and comparing the degree of similarity within the feature part candidate section.

実施例の説明以下本発明の一実施例を図面を参照しながら説
明する。DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

本実施例は、各音素の特徴部を目視によつて正
確に検出し、多くのデータを使用して、音素標準
バターンを予め作成し、さらに一度に類似度を比
較する全ての音素（音素群に属する音素）を対象
にして、特徴部の周囲情報の標準パターンも作成
しておき、音素の判別は、先ずセグメンテーシヨ
ンパラメータによつて特徴部の候補区間を少し広
めに設定し、次に候補区間の全域に対して、各音
素の標準パターンと周囲情報の標準パターンを１
フレームずつずらせながら適用して類似度を計算
することによつて、候補区間の中から正確に特徴
部を検出するとともに音素の判別を行なうもので
ある。 In this example, the characteristics of each phoneme are accurately detected visually, a standard phoneme pattern is created in advance using a large amount of data, and all phonemes (phoneme groups) are compared for similarity at once. A standard pattern of surrounding information of the feature part is also created for phonemes belonging to ), and phoneme discrimination is performed by first setting the candidate interval of the feature part a little wider using the segmentation parameter, and then For the entire candidate section, set one standard pattern for each phoneme and one standard pattern for surrounding information.
By applying this technique while shifting each frame and calculating the degree of similarity, features can be accurately detected from candidate sections and phonemes can be discriminated.

先ず標準パターンの作成方法を説明する。音素
標準パターンは、音素ごとに目視によつて切出し
た特徴部について特徴パラメータを抽出し、多く
のデータの平均値と共分散行列を求めることによ
つて作成する。特徴パラメータは、線形予測分析
（LPC分析）で求めたL.P.Cケプストラム係数を
使用している。 First, a method for creating a standard pattern will be explained. The phoneme standard pattern is created by extracting feature parameters from visually extracted feature parts for each phoneme, and finding the average value and covariance matrix of a large amount of data. The feature parameters are LPC cepstral coefficients obtained by linear predictive analysis (LPC analysis).

いま１フレームあたりの特徴パラメータの数を
ｄとし、特徴部のフレーム数をｊとすると、特徴
部のパラメータ系列ｘはｘ＝（x⁽¹⁾ ₁，x⁽¹⁾ ₂……x⁽¹⁾ _d，x⁽²⁾ ₁，x⁽²⁾ ₂，x⁽²⁾ _d，…… x^(j) ₁，x^(j) ₂…x^(j) _d） …（式２）で表わされる。x^(k) _jは特徴部の第ｋフレームにお
けるｉ番目のLPCケプストラム係数である。多
くのデータに対してパラメータ系列を抽出し、各
要素の平均値ベクトル1μと要素間の共分散散行
列Σを求め標準パターンとする。 Now, if the number of feature parameters per frame is d and the number of frames of the feature part is j, then the parameter series x of the feature part is x = (x ⁽¹⁾ ₁ , x ⁽¹⁾ ₂ ...x ⁽¹⁾ _d , x ⁽²⁾ ₁ , x ⁽²⁾ ₂ , x ⁽²⁾ _d , ... x ^(j) ₁ , x ^(j) ₂ ... x ^(j) _d ) ... (Formula 2). x ^(k) _j is the i-th LPC cepstral coefficient in the k-th frame of the feature. Parameter sequences are extracted from a large amount of data, and the average value vector 1μ of each element and the covariance matrix Σ between elements are determined and used as a standard pattern.

1μ＝（μ⁽¹⁾ ₁，μ⁽¹⁾ ₂，…μ⁽¹⁾ _d，μ⁽²⁾ ₁，μ⁽²⁾ ₂
，… μ^(j) ₁，μ^(j) ₂……μ^(j) _d）（式３）ただしμ^(k) ₁＝_M 〓^l=1 xμ^(k) _iｌ／Ｍ（Ｍはデータ数）共分散行列は複雑なのでここに記さない。この
ように本実施例の方法では、複数フレームの特徴
パラメータを使用してすなわちパラメータの時間
的動きを考慮して標準パターンを作成しているの
が特徴である。 1μ=(μ ⁽¹⁾ ₁ , μ ⁽¹⁾ ₂ , …μ ⁽¹⁾ _d , μ ⁽²⁾ ₁ , μ ⁽²⁾ ₂
,... μ ^(j) ₁ , μ ^(j) ₂ ...μ ^(j) _d ) (Formula 3) where μ ^(k) ₁ = _M 〓 ^l=1 xμ ^(k) _i l/M (M is the number of data ) Since the covariance matrix is complex, it is not described here. As described above, the method of this embodiment is characterized in that a standard pattern is created using feature parameters of a plurality of frames, that is, by taking into account the temporal movement of the parameters.

次に周囲情報の標準パターン作成方法を説明す
る。周囲情報の標準パターンは、音素判別におい
て類似度を相互に比較する音素群に対して１種を
作成する。たとえば、有声破裂音群（／ｂ／，／
ｄ／，／ｑ／）に対して１つ、鼻音群（／
ｍ／，／ｎ／，／η／）に対して１つ、……とい
う割合である。 Next, a method for creating a standard pattern for surrounding information will be explained. One type of standard pattern of surrounding information is created for a group of phonemes whose similarities are compared with each other in phoneme discrimination. For example, voiced plosives (/b/, /
d/, /q/), one for the nasal group (/
The ratio is one for every m/, /n/, /η/).

周囲情報の標準パターンは、特徴部の周囲の情
報の性質をパターン化したものである。各音素群
には、その音素群に共通の性質がある。たとえば
有声破裂音の特徴部は破裂付近であるが、破裂の
前には必ず数フレームのバズ区間があり、破裂の
後は急速に母音に接続する。無声破裂音では特徴
部（破裂付近）の前に無音区間がある。 The standard pattern of surrounding information is a pattern of the nature of information surrounding a feature. Each phoneme group has properties common to that phoneme group. For example, the characteristic part of a voiced plosive is near the plosive, but there is always a buzz section of several frames before the plosive, and after the plosive, it quickly connects to a vowel. In voiceless plosives, there is a silent section before the characteristic part (near the plosive).

また鼻音では特徴部（後続母音へのわたりの部
分）の前に定常的な鼻音性フレームが存在する。
周囲情報の標準パターンは、このような音素群に
共通な性質を標準パターン化するものである。 In nasal sounds, there is a constant nasal frame before the characteristic part (the part that transitions to the following vowel).
The standard pattern of surrounding information is a standard pattern of characteristics common to such phoneme groups.

第３図に具体的な作成方法の一例を示す。特徴
部（図の斜線部）に対し、前後に十分な長さの区
間を設定して周囲情報区間を決める。区間の長さ
は音素群ごとに設定する。周囲情報区間Ｌに対し
て、ｌフレームずつ区切りながらｄ×ｌ次元のパ
ラメータ系列を、図に示すように１フレームずつ
シフトさせながら、全区間に対して求める。この
操作を音素群に属する全データに対して適用し、
音素標準パターン作成の場合と同様の方法で、周
囲情報に対する標準パターンを求める。このよう
にすると、特徴部のデータも混入するが、定常性
を有する周囲情報のウエイトが格段に大きいた
め、問題とならない。 FIG. 3 shows an example of a specific production method. The surrounding information section is determined by setting sections of sufficient length before and after the characteristic part (the shaded section in the figure). The length of the interval is set for each phoneme group. For the surrounding information section L, a d×l dimensional parameter series is obtained for the entire section by dividing it into l frames and shifting it by one frame as shown in the figure. Apply this operation to all data belonging to the phoneme group,
A standard pattern for surrounding information is obtained in the same manner as in the case of creating a phoneme standard pattern. If this is done, the data of the characteristic part will also be mixed in, but this will not be a problem because the weight of stationary surrounding information is much greater.

次に音素の認識方法を説明する。先ずセグメン
テーシヨンパラメータによつて子音区間を決め、
子音区間の後端を基準点として、特徴部の候補区
間を多少広めに設定する。セグメンテーシヨンパ
ラメータには何を用いてもよいが、本実施例では
高域、低域のパワーデイツプを主に用い、デイツ
プの立上り部を基準点としている。パワーデイツ
プがとらえにくい場合、有声・無声判定結果およ
び鼻音性を併用している。 Next, a method for recognizing phonemes will be explained. First, determine the consonant interval using the segmentation parameter,
Using the rear end of the consonant section as a reference point, the candidate section of the characteristic part is set to be somewhat wider. Although any segmentation parameter may be used, in this embodiment, the power dips of the high and low frequencies are mainly used, and the rising portion of the dip is used as the reference point. When the power dip is difficult to detect, the voiced/unvoiced judgment result and nasality are used together.

次に、特徴部候補区間から特徴部を検出して音
素の判別を行なう方法を説明する。今後の説明で
は簡単のために、音素群が２音素（音素１、音素
２）で構成されているものとする。音素数が増し
ても、考え方は同じである。 Next, a method of detecting a feature part from a feature part candidate section and determining a phoneme will be explained. In the following explanation, for simplicity, it is assumed that the phoneme group is composed of two phonemes (phoneme 1 and phoneme 2). Even if the number of phonemes increases, the idea remains the same.

特徴部候補区間をt₁〜t₂とする。いま、時点ｔ
（t₁≦ｔ≦t₂）における未知入力ベクトル（判別
されるべきデータ）をx_tとする。x_tは（式２）と
同様の形式である。そして音素１の標準パターン
（平均値）を1μ₁、音素２の標準パターン（平均
値）を1μ₂、周囲情報の標準パターン（平均値）
を1μ_eとし、音素１、音素２および周囲情報の全
てに共通な共分散行列をΣとする。Σは各々の共
分散行列を平均することによつて作成する。 Let the characteristic part candidate section be t ₁ to _{t 2} . Now at time t
Let x _t be the unknown input vector (data to be determined) at (t ₁ ≦t≦t ₂ ). x _t has the same format as (Equation 2). Then, the standard pattern for phoneme 1 (average value) is 1μ ₁ , the standard pattern for phoneme 2 (average value) is 1μ ₂ , and the standard pattern for surrounding information (average value)
Let be 1 μ _e , and let Σ be the covariance matrix common to all of phoneme 1, phoneme 2, and surrounding information. Σ is created by averaging each covariance matrix.

時間ｔにおける未知入力の音素１との類似度
（距離）をL_1,tとすると L_1,t＝（x_t−1μ₁）^T・Σ^-1・（x_t−1μ₁） −（x_t−1μ_e）^T・Σ^-1・（x_t−1μ_e） …（式４）同様に音素２との距離をL₂._tとすると L₂,_t＝（x_t−1μ₂）^T・Σ^-1・（x_t−μC₂） −（x_t−μ_e）・Σ^-1・（x_t−μ_e）…（式５）とする。これらの式の第１項は音素に対するマハ
ラノビス距離、第２項は周囲情報に対するマハラ
ノビス距離である。したがつて、これらの式の意
味は、時点ｔにおける未知入力と音素標準パター
ンとの類似度（距離）から周囲情報に対する距離
を減じたものを、新たに音素との距離とすること
である。（式４）および（式５）の計算をt₁〜t₂
の期間を対象として行ない、L₁,_t，L₂,_tのうち、
この期間に最小となつた方の音素を認識音素とす
る。 Let L _1,t be the similarity (distance) between the unknown input and phoneme 1 at time t, then L _1,t = (x _t −1μ ₁ ) ^T・Σ ⁻¹・(x _t −1μ ₁ ) −(x _t −1μ _e ) ^T・Σ ⁻¹・(x _t −1μ _e ) …(Formula 4) Similarly, if the distance to phoneme 2 is L ₂ _.t , then L ₂ , _t = (x _t −1μ ₂ ) ^T・Σ ^-1・(x _t −μC ₂ ) −(x _t − μ _e )・Σ ⁻¹・(x _t − μ _e )...(Equation 5). The first term in these equations is the Mahalanobis distance for the phoneme, and the second term is the Mahalanobis distance for the surrounding information. Therefore, the meaning of these equations is that the similarity (distance) between the unknown input and the phoneme standard pattern at time t minus the distance to the surrounding information is set as the new distance to the phoneme. Calculate (Equation 4) and (Equation 5) from t ₁ to _{t 2}
The calculation was conducted for the period of , and out of L ₁ , _t and L ₂ , _t ,
The phoneme that is the smallest during this period is the recognized phoneme.

実際には（式４），（式５）は次のように簡単な
式に展開できる（本質的でないので導出は略す）。 In reality, (Formula 4) and (Formula 5) can be expanded into a simple formula as follows (the derivation is omitted as it is not essential).

L₁,_t＝α₁・x_t−｜B₁ …（式４）′ L₂,_t＝α₂・x_t−｜B₂ …（式５）′ α₁，α₂，｜B₁，｜B₂を新たに周囲情報を含んだ
標準パターンとする。 L ₁ , _t = α ₁・x _t − |B ₁ … (Equation 4)′ L ₂ , _t = α ₂・x _t − |B ₂ … (Equation 5)′ α ₁ , α ₂ , |B ₁ , | Set B ₂ as a new standard pattern that includes surrounding information.

上記の方法の概念的な説明を第４図で行なう。 A conceptual explanation of the above method is given in FIG.

第４図ａに示す状況において、子音の判別を行
なう場合を考える。この子音の真の特徴部（斜線
部）に対し、特徴部候補区間Ｔがt₁〜t₂として求
められたものとする。ｂは音素１、音素２に対す
る距離の時間的変動をそれぞれ実線と破線で示し
たものである。Ａ，Ｂ，Ｃは距離が極小となる位
置を示す。真の特徴部（Ｂ点）においては音素１
の分が音素２よりも小さく、この子音は音素１と
して判別されるべきである。しかるに、セグメン
テーシヨンパラメータによつて自動的に求めた特
徴部候補区間内においては、音素２がＡ点におい
て最小となるため、このままでは音素２に誤判別
されてしまう。第４図ｃは未知入力周囲情報の標
準パターンとの距離を示したものであり、真の特
徴部付近で値が大くなる。これは、標準パターン
が主に周辺の情報によつて作成されているためで
ある。第４図ｄは周囲情報を含んだ音素標準バタ
ーンとの距離であり、ｂからｃを減じたものと等
価である。ｄではＡ点よりもＢ点の値が小さくな
つており、この子音は正しく音素１として判別さ
れることになる。 Consider the case where consonant discrimination is performed in the situation shown in FIG. 4a. Assume that the characteristic part candidate section T is calculated as _t1 to _t2 for the true characteristic part (shaded part) of this consonant. b shows temporal fluctuations in the distances to phoneme 1 and phoneme 2 using solid lines and broken lines, respectively. A, B, and C indicate positions where the distance is minimum. In the true feature (point B), phoneme 1
is smaller than that of phoneme 2, and this consonant should be determined as phoneme 1. However, within the feature candidate section automatically determined using the segmentation parameters, phoneme 2 is at its minimum at point A, so if this continues, it will be misclassified as phoneme 2. FIG. 4c shows the distance between the unknown input surrounding information and the standard pattern, and the value increases near the true feature. This is because the standard pattern is created mainly using peripheral information. FIG. 4 d is the distance from the standard phoneme pattern that includes surrounding information, and is equivalent to b minus c. At point d, the value at point B is smaller than at point A, and this consonant is correctly determined as phoneme 1.

このように、本発明の方法を用いるとによつ
て、セグメンテーシヨンパラメータで求めた大ま
かな特徴部候補区間から、正確に真の特徴部を抽
出して音素を判別することができる。 In this way, by using the method of the present invention, it is possible to accurately extract true features from the rough feature candidate sections determined by the segmentation parameters and to discriminate phonemes.

なお、上記においてはマハラノビス距離で説明
したが、その他の距離においても考え方は同じで
ある。たとえば（式１）を使用する場合、距離の
かわりに尤度を用い、極小値のかわりに極大値を
使えばよい。また、上記では子音によつて説明し
たが、時間的に変動する音素、たとえば半母音に
対しても同様な方法が適用できる。 Note that although the explanation was given above using the Mahalanobis distance, the concept is the same for other distances. For example, when using (Equation 1), the likelihood may be used instead of the distance, and the local maximum value may be used instead of the local minimum value. Further, although the above explanation has been made using consonants, a similar method can also be applied to phonemes that change over time, such as semi-vowels.

第５図は本発明の方法を実現するためのブロツ
ク図である。１は特徴パラメータ抽出部で、入力
音声の分析を行なつて特徴パラメータを抽出する
部分である。 FIG. 5 is a block diagram for implementing the method of the invention. Reference numeral 1 denotes a feature parameter extraction unit, which analyzes input speech and extracts feature parameters.

ここではLPC分析を行なつてLPCケプストラ
ム係数を求め類似度計算部２に送出する。また、
特徴パラメータ抽出部１では入力音声の高域パワ
ーと低域パワーを求めてセグメンテーシヨン部４
に送出する。類似度計算部２は入力パラメータ
と、標準パターン格納部３に格納されている各標
準パターンとの距離をフレームごとに計算する。
標準パターン格納部３には音素標準パターン、周
囲情報の標準パターンの他にセグメンテーシヨン
に用いる、有声・無音標準パターンと５母音・鼻
音の標準パターンも格納されている。セグメンテ
ーシヨン部４では、類似度計算部２から送出され
る似度情報と、特徴パラメータ抽出部１から送出
されるパワー情報によつて音素区間を決める。音
素判別部５は、音素区間と各音素に対する類似
度、周囲情報の類似度によつて特徴部を抽出し、
特徴部における音素の類似度を比較して、音素の
判別を行ない、結果を出力する。 Here, LPC analysis is performed to obtain LPC cepstral coefficients and send them to the similarity calculation unit 2. Also,
The feature parameter extraction unit 1 calculates the high-frequency power and low-frequency power of the input audio, and then extracts the high-frequency power and low-frequency power of the input voice.
Send to. The similarity calculation unit 2 calculates the distance between the input parameter and each standard pattern stored in the standard pattern storage unit 3 for each frame.
In addition to phoneme standard patterns and surrounding information standard patterns, the standard pattern storage unit 3 also stores voiced/unvoiced standard patterns and five vowel/nasal standard patterns used for segmentation. The segmentation unit 4 determines phoneme intervals based on the similarity information sent from the similarity calculation unit 2 and the power information sent from the feature parameter extraction unit 1. The phoneme discrimination unit 5 extracts a feature part based on the similarity between the phoneme interval and each phoneme, and the similarity of surrounding information,
The similarity of phonemes in the feature parts is compared, phonemes are discriminated, and the results are output.

本実施例によつて、語中子音を対象として平近
76.1％の認識率を得た。従来法で同様の評価を行
なうと72.5％であつた。従来法では一部、子音群
として認識しているものもあることを考慮すれ
ば、本実施例の効果が顕著であることがわかる。
使用したデータは男女計20名が発声した212単語
であり、上記結果の信頼性は十分である。 In this example, we can use the average consonant to
A recognition rate of 76.1% was obtained. When similar evaluation was performed using the conventional method, it was 72.5%. Considering that some consonant groups are recognized in the conventional method, it can be seen that the effect of this embodiment is remarkable.
The data used was 212 words uttered by a total of 20 men and women, and the reliability of the above results is sufficient.

また周囲情報の標準パターンを使用することの
効果を調べるため、有声破裂音と鼻音によつて実
験を行なつた。その結果、周囲情報の標準パター
ンを用いない場合には音声破裂音での認識率が
72.7％、鼻音で64.1％であつたのに対し、周囲情
報の標準パターンを用いると、それぞれ74.7％、
75.2％に向上した。特に鼻音に対して顕著な効果
が現われている。これは、鼻音のパワーデイツプ
が不明瞭なため、音素区間が正確に抽出できない
ことが原因である。周囲情報の標準パターンを導
入すると、音素区間が不明確な場合でも、特徴部
が正確に検出でき、認識率が向上することによ
り、本実施例の効果が検証できた。 In addition, to investigate the effect of using standard patterns of surrounding information, we conducted experiments with voiced plosives and nasals. As a result, the recognition rate for vocal plosives is low when the standard pattern of surrounding information is not used.
72.7% and 64.1% for nasal sounds, while using the standard pattern of surrounding information, 74.7% and 64.1% for nasal sounds, respectively.
This improved to 75.2%. In particular, a remarkable effect appears on nasal sounds. This is because the power dip of the nasal sound is unclear, so the phoneme interval cannot be extracted accurately. When the standard pattern of surrounding information was introduced, even when the phoneme section was unclear, the characteristic part could be detected accurately and the recognition rate improved, thereby verifying the effect of this example.

発明の効果以上要するに本発明は音素区間で音素の特徴を
よく表現する部分（以下特徴と記す）に対して各
音素ごとに作成された標準パターンと、識別対象
とする音素群に関して特徴部の周囲情報に対して
作成された標準パターンを用意し、入力音声のセ
グメンテーシヨンを行なつて音素区間中の特徴部
候補区間を求め、前記特徴部候補区間の各時点に
対して前記音素標準バターンと周囲情報標準パタ
ーンを適用してパターンマツチングを行ない、各
音素との類似度を特徴部の周囲の影響を除去した
形で特徴部候補区間全域について求め、前記特徴
部候補区間内における類似度を比較することによ
つて音素の判別を行うことを特徴とする音素判別
方法を提供するもので、イ音声の自動セグメンテーシヨンを行つて、高
い精度で音素を判別することができる。Effects of the Invention In summary, the present invention provides a standard pattern created for each phoneme for a portion that well expresses the characteristics of a phoneme (hereinafter referred to as a feature) in a phoneme interval, and a standard pattern around the characteristic portion for a group of phonemes to be identified. A standard pattern created for the information is prepared, segmentation of the input speech is performed to obtain feature part candidate sections in the phoneme section, and the phoneme standard pattern and the above phoneme standard pattern are used for each point in the feature part candidate section. Pattern matching is performed by applying the surrounding information standard pattern, and the degree of similarity with each phoneme is determined for the entire feature candidate section in a form that removes the influence of the surroundings of the feature, and the similarity within the feature candidate section is calculated. This provides a phoneme discrimination method characterized by discriminating phonemes through comparison. (a) It is possible to perform automatic segmentation of speech and discriminate phonemes with high accuracy.

ロ音素標準バターンと周囲情報標準パターンを
用いることにより、音素判別に有効な部分（特
徴部）を自動的にしかも正確に抽出し、マツチ
ングを行なうことができる。(b) By using the phoneme standard pattern and the surrounding information standard pattern, it is possible to automatically and accurately extract parts (characteristic parts) that are effective for phoneme discrimination and perform matching.

等の利点を有する。It has the following advantages.

[Brief explanation of the drawing]

第１図は従来における音素認識方法を説明する
図、第２図は従来における類似度計算法を説明す
る図、第３図は本発明の一実施例における音素判
別方法の、周囲情報標準パターンの作成法を説明
する図、第４図は本発明における同方法の、特徴
部部の検出及び音素判別を行う方法を説明する
図、第５図は本発明における同方法を具現化する
ためのブロツク図である。１……特徴パラメータ抽出部、２……類似度計
算部、３……標準パターン格納部、４……セグメ
ンテーシヨン部、５……音素判別部。 FIG. 1 is a diagram explaining a conventional phoneme recognition method, FIG. 2 is a diagram explaining a conventional similarity calculation method, and FIG. 3 is a diagram of a surrounding information standard pattern of a phoneme discrimination method in an embodiment of the present invention. FIG. 4 is a diagram explaining the method for detecting characteristic parts and discriminating phonemes according to the same method according to the present invention. FIG. 5 is a block diagram for realizing the method according to the present invention. It is a diagram. 1... Feature parameter extraction unit, 2... Similarity calculation unit, 3... Standard pattern storage unit, 4... Segmentation unit, 5... Phoneme discrimination unit.

Claims

[Scope of Claims] 1. A standard pattern (hereinafter referred to as a phoneme standard pattern) created for each phoneme for a part of the phoneme interval that well expresses the characteristics of the phoneme (hereinafter referred to as a feature part), and a recognition target. A standard pattern (hereinafter referred to as the surrounding information standard pattern) created for the surrounding information of the feature for the phoneme group to be used is prepared, and the input speech is segmented to find the feature candidate section in the phoneme section. , pattern matching is performed by applying the phoneme standard pattern and surrounding information standard pattern to each point in the feature candidate section, and the degree of similarity with each phoneme is determined by removing the influence of the surroundings of the feature. 1. A phoneme discrimination method, characterized in that phonemes are discriminated by determining the entire feature part candidate section and comparing the degree of similarity within the feature part candidate section. 2. The phoneme discrimination method according to claim 1, characterized in that a statistical distance measure is used as a distance measure in pattern matching. 3. The phoneme discrimination method according to claim 2, wherein the statistical distance measure is Mahalanobis distance.