JPH0455520B2

JPH0455520B2 -

Info

Publication number: JPH0455520B2
Application number: JP59174325A
Authority: JP
Inventors: Masakatsu Hoshimi; Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-08-22
Filing date: 1984-08-22
Publication date: 1992-09-03
Also published as: JPS6152700A

Description

【発明の詳細な説明】産業上の利用分野本発明は音素認識を行うことを特徴とする音声
認識方法における音素認識方法に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a phoneme recognition method in a speech recognition method characterized by performing phoneme recognition.

従来例の構成とその問題点入力音声を音素単位に分けて音素の組合せとし
て認識し（音素認識とよぶ）音素単位で表記され
た単記辞書との類似度を求めて認識結果を出力す
る従来の単語認識システムのブロツク図を第１図
に示す。Configuration of conventional examples and their problems The conventional method divides the input speech into phoneme units and recognizes them as combinations of phonemes (called phoneme recognition), and outputs the recognition result by calculating the similarity with a single dictionary written in phoneme units. A block diagram of the word recognition system is shown in FIG.

まず、あらかじめ多数話者の音声を10msの分
析区間毎に音響分析部１によつてフイルタバンク
を用いて分析し、得られたスペクトル情報をもと
に特徴抽出部２によつて特徴パラメータを求め
る。この特徴パラメータから／ａ／、／ｏ／等の
母音や、／ｍ／、／ｂ／等の子音に代表される音
素毎又は音素グループ毎に標準パターンを作成し
て標準パターン登録部５に登録しておく。次に、
入力された不特定話者の音声を、同様に分析区間
毎に音響分析部１によつて分析し、特徴抽出部２
によつて特徴パラメータを求める。この特徴パラ
メータと標準パターン登録部５の標準パターンを
用いてセグメンテーシヨン部３において母音と子
音の区切り作業（以下セグメンテーシヨンと呼
ぶ）を行なう。この結果をもとに、音素判別部４
において、標準パターン登録部５の標準パターン
と照合することによつて、最も類似度の高い標準
パターンに該当する音素をその区間における音素
と決定する。最後に、この結果作成した音素の時
系列（以下音素系列と呼ぶ）を単語認識部６に送
り、同様に音素系列で表現された単語辞書７と最
も類似度の大きい項目に該当する単語を認識結果
として出力する。 First, the voices of multiple speakers are analyzed in advance by the acoustic analysis section 1 using a filter bank for each 10 ms analysis interval, and the feature parameters are determined by the feature extraction section 2 based on the obtained spectrum information. . From these feature parameters, a standard pattern is created for each phoneme or phoneme group represented by vowels such as /a/, /o/ and consonants such as /m/, /b/, and is registered in the standard pattern registration section 5. I'll keep it. next,
The input voice of an unspecified speaker is similarly analyzed by the acoustic analysis section 1 for each analysis section, and the voice of the unspecified speaker is analyzed by the acoustic analysis section 1
Find the feature parameters by Using these feature parameters and the standard pattern in the standard pattern registration section 5, the segmentation section 3 performs a separation operation between vowels and consonants (hereinafter referred to as segmentation). Based on this result, the phoneme discriminator 4
By comparing the standard pattern with the standard pattern in the standard pattern registration section 5, the phoneme corresponding to the standard pattern with the highest degree of similarity is determined as the phoneme in that section. Finally, the time series of phonemes created as a result (hereinafter referred to as the phoneme series) is sent to the word recognition unit 6, and the word corresponding to the item with the highest similarity to the word dictionary 7 similarly expressed in the phoneme series is recognized. Output as result.

本構成において音素判別部４で音素の判別を行
う場合、従来では、子音区間と判定された区間に
ついてフレーム毎に音素の特徴を示す特徴パラメ
ータを求めあらかじめ用意されている各音素また
は音素群の標準パターンと比較してフレーム毎に
子音分類を行なう。この結果を子音分類ツリーに
適用して条件の一致したものを認識された子音と
するしかし、この場合語頭子音は明確な判定を行な
わずに音素群の判定にとどまつている。たとえ
ぱ／ｂ／、／ｄ／、／ｇ／を有声破裂音群として
いる。 When phonemes are discriminated by the phoneme discriminator 4 in this configuration, conventionally, the feature parameters indicating the characteristics of the phoneme are obtained for each frame for the section determined to be a consonant section, and the standard for each phoneme or phoneme group is prepared in advance. Consonant classification is performed for each frame by comparing it with the pattern. This result is applied to the consonant classification tree, and those that meet the conditions are recognized as consonants. However, in this case, initial consonants are not clearly determined, but only phoneme groups are determined. For example, pa /b/, /d/, /g/ are voiced plosives.

有声破裂音群の判別については、たとえば「日
本語有声破裂音の分析」細谷、藤崎、日本音響学
会音声研究会（S80−67）などで報告されてい
る。しかし、これらの方法は分析時間、アルゴリ
ズムの複雑さのために、実際の単語認識システム
に使用された例は報告されていない。 Discrimination of voiced plosives has been reported, for example, in ``Analysis of Japanese voiced plosives'' by Hosoya and Fujisaki, Acoustical Society of Japan Speech Study Group (S80-67). However, due to the analysis time and complexity of the algorithms, these methods have not been reported to be used in actual word recognition systems.

以上述べたように、従来の方法では、語頭子音
については音素群の判別にとどまつており認識対
象単語によつては問題が生じる。また、音素群内
での判別方法も報告されているが、まだ、分析時
間、アルゴリズムの複雑さなどの問題があり実際
のシステムに使用されていない。 As described above, the conventional method only discriminates phoneme groups for word-initial consonants, which may cause problems depending on the word to be recognized. In addition, a discrimination method within a phoneme group has been reported, but it has not yet been used in actual systems due to problems such as analysis time and algorithm complexity.

発明の目的本発明は、以上のような従来の問題点を解決す
るためになれたもので、語頭子音の認識を分析時
間、アルゴリズムを考慮して実際のシステムで使
用出来るようにした音素認識方法を提供すること
を目的とする本発の構成本発明は上記目的を達成するもので、入力音声
の語頭子音のセグメンテーシヨンを、有声無声判
定による方法、母音鼻音判定による方法、パワー
変化による方法、ケプストラム距離による方法の
４つの方法を任意に適用して行い、どの方法でセ
グメンテーシヨンされたかによつて語頭子音を無
声子音群、有声子音群、パワー変化に特徴がある
子音群、持続時間の短かい子音群などの複数個の
音素群として認識し、次に前記音素区間中で特徴
部（音素の判別に有効な部分）を自動的に検出
し、前記特徴部に対して前に認識された音素群に
属する個々の音素の標準パターンとの類似度計算
を行なつて音素を判別することを特徴とする語頭
子音認識方法を提供するものである。Purpose of the Invention The present invention was developed in order to solve the above-mentioned conventional problems, and provides a phoneme recognition method that can be used in an actual system by considering the analysis time and algorithm for recognizing initial consonants. The present invention achieves the above-mentioned object, and is capable of segmenting initial consonants of input speech by methods based on voiced/unvoiced determination, methods based on vowel nasality determination, and methods based on power change. , the method using cepstral distance is arbitrarily applied, and depending on which method was used for segmentation, initial consonants are divided into voiceless consonants, voiced consonants, consonants characterized by power changes, and duration. is recognized as a group of multiple phonemes, such as a group of short consonants, and then automatically detects a feature part (a part that is effective for phoneme discrimination) in the phoneme interval, and performs previous recognition on the feature part. The present invention provides a word-initial consonant recognition method characterized in that a phoneme is determined by calculating the similarity of each phoneme belonging to a phoneme group with a standard pattern.

実施例の説明本実施例の概要は、以下の通りである。Description of examples The outline of this example is as follows.

イ４つのセグメンテーシヨン法による結果を利
用して、語頭子音の無声子音群、有声子音群、
パワー変化に特徴がある子音群、持続時間の短
かい子音群の４つに大分類する。B. Using the results of the four segmentation methods, the voiceless consonant group, voiced consonant group, and
Consonants are classified into four groups: consonants characterized by power changes and consonants with short duration.

ロ多音素群ごとに特徴部を設定し、その特徴部
に対して多音素の標準パターンを、あらかじめ
作成しておく、音素標準パターンは、目視によ
つて正確にラベル付けした多くのデータを使用
して作成する。また、音素標準パターンの他
に、各音素群に対して、特徴部の周囲情報の標
準パターンを１種類作成しておく。(b) A characteristic part is set for each polyphoneme group, and a standard polyphoneme pattern is created in advance for that characteristic part.The phoneme standard pattern uses a lot of data that has been accurately labeled by visual inspection. and create it. In addition to the phoneme standard pattern, one type of standard pattern for surrounding information of the characteristic part is created for each phoneme group.

ハ音素の判別入力音声の語頭子音セグメンテーシヨンを行な
い、子音区間を求める。そして子音区間の一部
（たとえば端点）を基準点として設定する。一方、
この子音区間が上記イにおける大分類のうち、ど
の音素群に属するかを決定する。次に、この決定
された音素群に属する標準パターンを音素区間に
おける特徴部に対して適用して音素の判別を行な
う。ところで、特徴部を自動的にしかも正確に求
めることは一般には困難であるため、次のように
する。すなわち上記の基準点を参照して、多少の
幅を持つて特徴部の候補区間を求めておき、候補
区間の全範囲に対して周囲情報標準パターンを適
用して各音素との類似度を計算する。各音素との
類似度計算に当つては、音素標準パターンと未知
人力との類似度から、上記イに述べた音素群の周
囲情報の標準パターンとの類似度を除去する。こ
のようにすることによつて、特徴部の候補区間の
うち特徴部に相当しない部分（すなわち特徴部の
周囲に相当する部分）の情報を除去することがで
き、正確な特徴部をとらえて音素の判別を行なう
ことができる。C. Phoneme discrimination Perform word-initial consonant segmentation of input speech to find consonant intervals. Then, a part of the consonant interval (for example, an end point) is set as a reference point. on the other hand,
It is determined which phoneme group this consonant interval belongs to among the major classifications in A above. Next, the standard pattern belonging to the determined phoneme group is applied to the feature part in the phoneme interval to discriminate the phoneme. By the way, since it is generally difficult to automatically and accurately obtain characteristic parts, the following procedure is performed. In other words, by referring to the reference points above, find a candidate section for the characteristic part with some width, and then calculate the degree of similarity with each phoneme by applying the surrounding information standard pattern to the entire range of the candidate section. do. When calculating the degree of similarity with each phoneme, the degree of similarity with the standard pattern of the surrounding information of the phoneme group described in A above is removed from the degree of similarity between the standard phoneme pattern and the unknown human power. By doing this, it is possible to remove information on parts that do not correspond to the feature part (in other words, the parts around the feature part) out of the feature candidate interval, and to capture the accurate feature part and make a phoneme. can be determined.

以下、子音認識を例として、本発明の一実施例
を図面を参照しながら詳細に説明する。 Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings, taking consonant recognition as an example.

本実施例は次に示す４つのセグメンテーシヨン
法の結果を利用して語頭子音を(1)無声子音群、(2)
有声子音群、(3)パワー変化に特徴がある子音群、
(4)持続時間の短かい子音群の４つに大分類する。 This example uses the results of the following four segmentation methods to classify word-initial consonants into (1) voiceless consonants, (2)
voiced consonants, (3) consonants characterized by power changes,
(4) Classify into four groups of consonants with short duration.

フレーム毎（本実施例では１フレームは10m
sec）の有声・無声判定による方法フレーム毎の母音・鼻音判定による方法パワーの時間的変化をとらえる方法ケプストラム距離による方法〜までの方法を併用し、語頭子音が検出さ
れた時には、以後の方法を適用せず、検出結果に
基づいてセグメンテーシヨンを行なう。以下に
〜までの方法について説明する。 Every frame (in this example, one frame is 10m)
sec) method by voiced/unvoiced determination Method by frame-by-frame vowel/nasality determination Method to capture temporal changes in power Method by cepstral distance Methods used in conjunction with ~ and when an initial consonant is detected, the following methods are used Segmentation is performed based on the detection results. The methods up to ~ will be explained below.

最初にの方法である有声・無声判定による方
法について述べる。 The first method, which uses voiced/unvoiced judgment, will be described.

語頭の無声子音のセグメンテーシヨンは、各フ
レーム毎に行なう有声・無声判定結果を利用する
ことによつて、正確に行なうことが出来る。 Segmentation of voiceless consonants at the beginning of words can be performed accurately by using the voiced/unvoiced determination results performed for each frame.

有声・無声判定の方法は零交差波、スペクトル
の傾き、第１次の自己相関係数の値等を利用する
方法があるが、どの方法を用いてもよい。本実施
例では、有声・無声判定の標準パターンと比較す
ることによつて判定している。 There are methods for determining voicedness/unvoicedness that use zero-crossing waves, spectral slope, first-order autocorrelation coefficient values, etc., but any method may be used. In this embodiment, the determination is made by comparing with a standard pattern for voiced/unvoiced determination.

ここで、語頭から無声の判定があるフレーム数
以上連続する時（例えば４フレーム以上）この区
間を子音区間と判定する。この方法は、すべての
無声子音に対して有効である。 Here, when the unvoiced state continues for a certain number of frames or more from the beginning of the word (for example, 4 frames or more), this section is determined to be a consonant section. This method is valid for all voiceless consonants.

次に、の方法である母音・鼻音判定による方
法について説明する。 Next, the method using vowel/nasal sound determination, which is the method of (2), will be explained.

フレーム毎の音素認識は本実施例ではLPCケ
プストラム係数を用いて、あらかじめ用意してあ
る各音素の標準パターンとの比較によつて行な
う。標準パターンとしては５母音（／ａ／／
ｉ／、／ｕ／、／ｅ／、／ｏ／）・鼻音（／Ｎ／
で表わす）と無声子音（／ｓ／で表わす）を用い
た。このようにして、各フレーム毎に類似度の最
も大きい音素（第１候補音素）と２番目に類似度
の大きい音素（第２候補音素）を求める。フレー
ム毎の第１候補音素と第２候補音素をそれぞれフ
レーム番号の順に並べた系列を第１候補音素時系
列、第２候補音素系列とする。上記音素系列を語
頭から順に見た時／Ｎ／が第１候補または第２候
補音素系列を含めてあるフレーム数以上（例えば
４フレーム以上）連続した時この区間を子音区間
であると判定する。 In this embodiment, phoneme recognition for each frame is performed using LPC cepstral coefficients by comparison with a standard pattern of each phoneme prepared in advance. The standard pattern is 5 vowels (/a//
i/, /u/, /e/, /o/)/nasal (/N/
(represented by ) and voiceless consonants (represented by /s/) were used. In this way, the phoneme with the highest degree of similarity (first candidate phoneme) and the phoneme with the second highest degree of similarity (second candidate phoneme) are determined for each frame. A series in which the first candidate phoneme and second candidate phoneme for each frame are arranged in order of frame number is defined as a first candidate phoneme time series and a second candidate phoneme series. When the above phoneme sequence is viewed in order from the beginning of the word, if /N/ continues for a certain number of frames or more (for example, 4 frames or more) including the first candidate or second candidate phoneme sequence, this section is determined to be a consonant section.

この方法は、特に鼻音を中心とする有声子音に
対して有効である。 This method is particularly effective for voiced consonants, mainly nasal sounds.

次に、の方法であるパワーの時間的変化によ
る方法について説明する。 Next, a method based on a temporal change in power, which is the method of (2), will be explained.

語頭が主に破裂性の子音で始まる時、パワー値
の時間的変化をプロツトすると第２図のａのよう
になる。これは破裂性のためにパワーが急激に立
上がり、後続の母音との渡りの部分においてａの
ように凹状になるからである。 When the beginning of a word is mainly a plosive consonant, the temporal change in power value is plotted as shown in Figure 2a. This is because the power rises rapidly due to its plosive nature, and the part where it crosses with the following vowel becomes concave like a.

ｂはａのパワーの時間的変化の値を微分したも
のである。P₁〜P₃はａの変曲点のフレーム番号
を示している。ここでは有声区間の始まるフレー
ム番号を１にしている。ここで、ａ，ｂのように
P₁，P₃の微分値が正、P₂の微分値が負、かつP₃
＜ｍ（ｍはフレーム番号を示すいき値）を満足す
る時、語頭からP₃までを語頭子音区間として判
定する。 b is the value obtained by differentiating the value of the temporal change in power of a. _P1 to _P3 indicate frame numbers of inflection points of a. Here, the frame number at which the voiced section begins is set to 1. Here, like a and b
The differential values of P ₁ and P ₃ are positive, the differential value of P ₂ is negative, and P ₃
<m (m is a threshold value indicating a frame number), the period from the beginning of the word to _P3 is determined as the initial consonant section.

最後に、の方法であるケプストラム距離によ
る方法について説明する。 Finally, the method using cepstral distance will be explained.

ケプストラム距離の使い方としては、ある基準
フレームのスペクトル・パターンと語頭から基準
フレームまでの各フレームのスペクトル・パター
ンとを比較する方法を用いる。 The cepstral distance is used by comparing the spectral pattern of a certain reference frame with the spectral pattern of each frame from the beginning of the word to the reference frame.

本実施例において、スペクトル・パターンの特
徴を表わすパラメータとしてLPCケプストラム
係数C₁〜C_o（ただし、ｎは正の整数）を用いる。
スペクトルが安定して現われる基準フレームの選
び方として、語頭からｍフレーム目（本実施例で
はｍ＝７）と固定する。これは、比較的持続時間
の長い子音（持続時間がｍフレーム以上）は〜
の方法で検出できるからである。 In this embodiment, LPC cepstral coefficients C ₁ to _Co (where n is a positive integer) are used as parameters representing the characteristics of the spectral pattern.
As a method of selecting a reference frame in which a spectrum appears stably, the m-th frame from the beginning of a word (m=7 in this embodiment) is fixed. This means that consonants with a relatively long duration (duration of m frames or more) are ~
This is because it can be detected using the following method.

ある２フレーム間のスペクトル・パターンを比
較する方法として、式１を用いる。 Equation 1 is used as a method for comparing spectral patterns between two frames.

〓（ｉ，ｊ）＝_o 〓^l=1 （C_l(i)−C_l(j)）² …式１式１においてC_l(i)は語頭からｉフレーム目におけ
るｌ番目のLPCケプストラム係数を表わしてい
る。同様にC_l(j)はｊフレーム目におけるｌ番目の
LPCケプストラム係数を表わしている。〓（ｉ，
ｊ）が大きいほど２つのフレームのスペクトル・
パターンが異なつていることになる。〓(i,j)= _o 〓 ^l=1 (C _l (i)−C _l (j)) ² ...Equation 1 In Equation 1, C _l (i) is the l-th LPC cepstral coefficient in the i-th frame from the beginning of the word. It represents. Similarly, C _l (j) is the l-th frame in the j-th frame.
It represents the LPC cepstral coefficients. 〓(i,
j) is larger, the spectrum of the two frames
The pattern will be different.

この式１を用いて基準フレームと語頭から基準
フレームまでの各フレームとの〓（ｉ，ｍ）（た
だし１≦ｉ≦ｍ−１）を計算し、最大値を〓_nax
とする。この〓_naxの値があるいき値より大きい
か小さいかによつて語頭子音の有無を判定する。
この方法で検出された場合、語頭子音区間として
は〓（ｉ，ｍ）の値の変化が一番大きいフレーム
までを子音区間とする。この方法は、持続時間の
短かい子音を検出するのに有効である。 Using this equation 1, calculate 〓(i, m) (where 1≦i≦m-1) between the reference frame and each frame from the beginning of the word to the reference frame, and calculate the maximum value as 〓 _nax
shall be. The presence or absence of an initial consonant is determined depending on whether the value of 〓 _nax is larger or smaller than a certain threshold value.
When detected using this method, the consonant section is defined as the word-initial consonant section up to the frame in which the value of 〓(i, m) has the largest change. This method is effective for detecting consonants with short durations.

以上説明した〜までの方法によつて語頭子
音のセグメンテーシヨンを行ないの方法で検出
された子音を無声子音群（／ｚ／、／ｈ／、／
ｓ／、／ｃ／、／ｐ／、／ｔ／、／ｋ／）の方
法によつて検出された子音を有声子音群（／
ｍ／、／ｎ／、／ｂ／、／ｄ／、／ｇ／、／
ｒ／、／ｚ／）の方法で検出された子音をパワ
ー変化に特徴がある子音群（／ｂ／、／ｄ／、／
ｇ／、／ｚ／、／ｐ／、／ｔ／、／ｋ／）の方
法で検出された子音を持続時間の短かい子音群
（／ｍ／、／ｎ／、／ｂ／、／ｄ／、／ｇ／、／
ｒ／、／ｚ／、／ｈ／、／ｐ／、／ｔ／、／
ｋ／）というような４つの子音群に大分類を行な
う。 The initial consonants are segmented by the methods described above, and the consonants detected by the method are divided into unvoiced consonants (/z/, /h/, /
Consonants detected by the method of s/, /c/, /p/, /t/, /k/) are combined into voiced consonants (/
m/, /n/, /b/, /d/, /g/, /
The consonants detected using the method r/, /z/) are divided into consonant groups (/b/, /d/, /
g/, /z/, /p/, /t/, /k/) are combined into short-duration consonant groups (/m/, /n/, /b/, /d/). , /g/, /
r/, /z/, /h/, /p/, /t/, /
It is roughly classified into four consonant groups such as k/).

このようにして大分類によつて候補を絞つた
後、各子音群内で細分類を行なう。細分類の方法
としては、音素標準パターンとの類似度を求め、
各音素に対する類似度を比較することによつて子
音を判別する。 After narrowing down the candidates by major classification in this way, subclassification is performed within each consonant group. The method of subdivision is to find the degree of similarity with the standard phoneme pattern,
Consonants are determined by comparing the degree of similarity for each phoneme.

無声破裂音、有声破裂音は破裂点から後続母音
へ遷移する部分に特徴がある。したがつて無声破
裂音群内または有声破裂音群内で細分類を行なう
には破裂点付近の時間的な動きを考慮した類似度
計算を行なうことが必要である。鼻音は母音への
わたりの部分に特徴があり、この部分の時間的動
きを考慮した類似度計算が必要である。流音／
ｒ／は区間全体のスペクトル変化と持続時間に特
徴がある。／ｚ／はバズ部とそれに続く摩擦部を
有することに特徴がある。 Voiceless plosives and voiced plosives are characterized by the transition from the plosive point to the following vowel. Therefore, in order to perform subclassification within a voiceless plosive group or a voiced plosive group, it is necessary to perform a similarity calculation that takes into account the temporal movement around the plosive point. Nasal sounds are characterized by the transition to a vowel, and it is necessary to calculate the degree of similarity by taking into account the temporal movement of this part. Ruon/
r/ is characterized by the spectral change and duration of the entire interval. /z/ is characterized by having a buzz part and a friction part following it.

このように各子音群によつて特徴部には差異が
あるが、特徴点を基準とした時間的な動きが重要
な情報であることは共通している。特徴点を自動
検出する方法としては、無声子音群は音素の始端
である語頭フレーム、有声子音群は鼻音判定から
母音判定へ変わるフレーム・パワー変化に特徴が
ある子音群はパワーの立上りフレーム、持続時間
の短かい子音群は音素の終端とする。しかし、正
確に特徴フレームを自動検出することは容易では
ない。そこで、自動検出の誤差による誤認識を減
少させるために自動検出した特徴フレームの前後
数フレームにわたつて類似度を計算し、類似度が
最大となるフレームの値をその音素に対する類似
度とする。 As described above, although there are differences in the characteristic parts depending on each consonant group, it is common that the temporal movement based on the characteristic point is important information. The method for automatically detecting feature points is to use the word-initial frame, which is the beginning of a phoneme, for voiceless consonants, the frame that changes from nasal to vowel for voiced consonants, and the rising frame of power for consonants characterized by a change in power, and for consonants that are characterized by power changes. A consonant group with a short duration is considered to be the end of a phoneme. However, it is not easy to automatically detect feature frames accurately. Therefore, in order to reduce erroneous recognition due to automatic detection errors, the degree of similarity is calculated over several frames before and after the automatically detected feature frame, and the value of the frame with the maximum degree of similarity is taken as the degree of similarity for that phoneme.

次に類似度の計算に関しては、下記式２または
式３ベイズ判定に基づく距離： P_i＝１／（2π）ｄ／２・｜Σ｜１／２・exp｛−１／
２（〓−〓_i）^T・Σ^-1・（〓−〓_i）｝…式２マハラノビス距離： L_i＝（〓−〓_i）^T・Σ^-1・（〓−〓_i） …式３を使用して、時間的な動きを考慮した類似度を計
算する。すなわち、類似度計算に使用するデータ
として単一フレームの特徴パラメータでなく、複
数フレーム（いまｌフレームとする）の特徴パラ
メータを使用する。式１または式２で入力特徴パラメータ〓＝（x₁ ⁽¹⁾，x₂ ⁽¹⁾…x_d ⁽¹⁾，x₁ ⁽²⁾，x₂ ⁽²⁾…x_d ⁽²⁾…
，
x₁(l)，x₂(l)，…x_d(l)）標準パターンの平均値 μ＝（μ₁ ⁽¹⁾，μ₂ ⁽¹⁾，…μ_d ⁽¹⁾，μ₁ ⁽²⁾，μ₂ ⁽²⁾，
…μ_d ⁽²⁾，
…，μ₁(l)，μ₂(l)，…μ_d(l)）のようにｄ×ｌ次元のデータを用いる。共分散行
列Σも同様にｄ×ｌ次元とする（複雑になるので
記さない）。このように複数フレームのデータを
用いることによつて、パラメータが持つスペクト
ルの特徴とその時間的な変動の特徴を同時に音素
標準パターンと比較することができる。 Next, regarding the calculation of similarity, the following formula 2 or formula 3 Distance based on Bayesian judgment: P _i =1/(2π)d/2・|Σ|1/2・exp{−1/
2 (〓−〓 _i ) ^T・Σ ^-1・(〓−〓 _i )}...Formula 2 Mahalanobis distance: L _i = (〓−〓 _i ) ^T・Σ ^-1・(〓−〓 _i ) ...Formula 3 is used to calculate similarity considering temporal movement. That is, instead of the feature parameters of a single frame, the feature parameters of a plurality of frames (here, 1 frame) are used as data used for similarity calculation. Input feature parameters 〓=(x ₁ ⁽¹⁾ , x ₂ (1) …x _d ⁽¹⁾ ^, x 1 (2) , x ₂ ⁽ ₂ ⁾ …x _d ⁽²⁾ …
，
x ₁ (l), x ₂ (l), ...x _d (l)) Average value of standard pattern μ = (μ ₁ ⁽¹⁾ , μ ₂ ⁽¹⁾ , ...μ _d ⁽¹⁾ , μ ₁ ^{(2 )} , μ ₂ ⁽²⁾ ,
…μ _d ⁽²⁾ ,
..., μ ₁ (l), μ ₂ (l), ... μ _d (l)). Similarly, the covariance matrix Σ is assumed to have d×l dimensions (not described because it would be complicated). By using multiple frames of data in this way, it is possible to simultaneously compare the spectral characteristics of the parameters and the characteristics of their temporal fluctuations with the phoneme standard pattern.

次に標準パターンの作成法を述べる。標準パタ
ーンは目視によつて音声中から正確に切出した多
くのデータを使用して作成する。 Next, we will explain how to create a standard pattern. The standard pattern is created using a lot of data that is accurately extracted from the audio by visual inspection.

音素標準パターンは、同一音素の多くのデータ
に対し、特徴部に相当するｌフレームのデータを
切り出してｄ×ｌ次元の特徴ベクトルを求め、多
くのデータの平均値と共分散行列を求めることに
よつて音素ごとに作成しておく。 The phoneme standard pattern is created by cutting out 1 frames of data corresponding to the feature part from a lot of data of the same phoneme, finding a d×l dimensional feature vector, and then finding the average value and covariance matrix of a lot of data. Create one for each phoneme.

周囲情報の標準パターンは音素群ごとに１種類
ずつ作成する。これは音素群内においては、周囲
情報が各音素に対して共通していることによる。
周囲情報の標準パターンは、このようにその音素
群に対して普遍的な周囲の情報を標準パターン化
したものである。第３図にその作成方法を示す。
特徴部（図の斜線部）の近傍に対し、特徴部に比
較して時間的に十分に長い区間を周囲情報区間Ｌ
として設定する。この中間に対し、図に示すよう
に、ｌフレームの特徴パラメータ（ｄ×ｌ次元）
を１フレームずつシフトさせながら全区間にわた
つて取り出す。このような手続を同一音素群に属
する多くのデータに対して適用し、平均値ベクト
ルと共分散行列を求め、これを周囲情報の標準パ
ターンとする。このように周囲情報の標準パター
ンには特徴部のデータも含まれているが、それよ
りも特徴部の近傍のデータの比重が格段に大きい
ものになつている。 One standard pattern of surrounding information is created for each phoneme group. This is because surrounding information is common to each phoneme within a phoneme group.
The standard pattern of surrounding information is thus a standard pattern of universal surrounding information for that phoneme group. Figure 3 shows how to create it.
In the vicinity of the characteristic part (the shaded part in the figure), a section that is sufficiently long in time compared to the characteristic part is defined as the surrounding information section L.
Set as . For this intermediate, as shown in the figure, the feature parameters of l frame (d x l dimensions)
is extracted over the entire interval while shifting it one frame at a time. Such a procedure is applied to a lot of data belonging to the same phoneme group to obtain an average value vector and a covariance matrix, and this is used as a standard pattern of surrounding information. In this way, although the standard pattern of surrounding information includes data on the characteristic part, the data in the vicinity of the characteristic part has a much greater weight than that.

次に、上記の方法で作成した標準パターンを使
用して、大分類されたデータを細分類する具体的
な方法を述べる。 Next, we will describe a specific method for subdividing broadly classified data using the standard pattern created by the above method.

なお、今後の説明では簡単のために式２の距離
尺度を使用し、１つの音素群が２音素（音素１，
音素２）で構成される場合を取りあげる。音素数
が増しても考え方は同様である。 In the following explanation, for simplicity, we will use the distance measure of Equation 2, and one phoneme group will be divided into two phonemes (phoneme 1,
Let us consider the case consisting of phoneme 2). The idea remains the same even if the number of phonemes increases.

特徴部は前に述べた方法で特徴フレームを検出
し、そのフレームを基準にして大まかな候補区間
を求める。この区間を時間的にt₁〜t₂とする。い
ま時間ｔにおける未知入力ベクトル（細分類され
るべきデータ）を〓_t（ｔ＝t₁〜t₂）音素１の標準パターン（平均値）を〓₁ 音素２の標準パターン（平均値）を〓₂ 周囲情報の標準パターン（平均値）を〓_e とし、音素１、音素２および周囲情報の全てに共
通な共分散行列をΣとする。Σは各々の共分散行
列を平均することによつて作成する。 The feature section detects a feature frame using the method described above, and calculates a rough candidate section using that frame as a reference. This interval is defined as t ₁ to _{t 2} in terms of time. Now, the unknown input vector (data to be subclassified) at time t is 〓 _t (t = t ₁ ~ _{t 2} ) Standard pattern of phoneme 1 (average value) 〓 Standard pattern of ₁ phoneme 2 (average value) 〓 ₂ Let _e be the standard pattern (average value) of surrounding information, and let Σ be the covariance matrix common to all of phoneme 1, phoneme 2, and surrounding information. Σ is created by averaging each covariance matrix.

時間ｔにおける未知入力の音素１との類似度
（距離）をL₁・ｔとすると L₁・ｔ＝（〓_t−〓₁）^T・Σ^-1・（〓_t−〓₁）−（
〓_t−〓_e）^T・Σ^-1・（〓_t−〓_e）…式４同様に音素２との距離をL₂・ｔとすると L₂・ｔ＝（〓_t−〓₂）^T・Σ^-1・（〓_t−〓₂）−（
〓_t−〓_e）^T・Σ^-1・（X_t−U_e）…式５とする。これらの式の意味するところは、時間ｔ
における未知入力と音素標準パターンとの類似度
から周囲情報に対する類似度を減じたものを新た
に音素との類似度とすることである。そして式４
および式５の計算をt₁〜t₂の期間を対象として行
ない、L₁・ｔ、L₂・ｔのうち、この期間に最小
となつた方の音素の認識音素とする。 Letting the similarity (distance) of unknown input to phoneme 1 at time t be L ₁・t, L ₁・t=(〓 _t −〓 ₁ ) ^T・Σ ⁻¹・(〓 _t −〓 ₁ )−(
〓 _t −〓 _e ) ^T・Σ ^-1・(〓 _t −〓 _e )…Formula 4 Similarly, if the distance to phoneme 2 is L ₂・t, then L ₂・t=(〓 _t −〓 ₂ ) ^T・Σ ^-1・(〓 _t −〓 ₂ )−(
〓 _t −〓 _e ) ^T・Σ ⁻¹・(X _t −U _e )…Equation 5 is used. What these expressions mean is that the time t
The similarity between the unknown input and the phoneme standard pattern minus the similarity with respect to the surrounding information is set as the new similarity with the phoneme. and formula 4
The calculation of Equation 5 is performed for the period from t ₁ to t ₂ , and the phoneme that becomes the minimum during this period is set as the recognized phoneme among L ₁ ·t and L ₂ ·t.

実際には式４、式５は次のように簡単な式に展
開できる（導出は略す）。 In reality, Equations 4 and 5 can be expanded into simple equations as follows (the derivation is omitted).

L₁・ｔ＝〓₁・〓_t−B₁ 式4′ L₂・ｔ＝〓₂・〓_t−B₂ 式5′ 〓₁、〓₂、B₁、B₂が周囲情報を含んだ標準パタ
ーンである。 L ₁・t=〓 ₁・〓 _t −B ₁ equation 4′ L ₂・t=〓 ₂・〓 _t −B ₂ equation 5′ 〓 ₁ , 〓 ₂ , B ₁ , B ₂ are standard including surrounding information It's a pattern.

上記の方法の意味を第４図によつて概念的に説
明する。 The meaning of the above method will be conceptually explained with reference to FIG.

音素区間が第４図ａに示す状況において、子音
の判別を行なう場合を考える。この子音の真の特
徴部（斜線部）に対し、特徴部候補区間Ｔが時間
t₁〜t₂として求められたものとする。ｂは式３に
よつて求めた。音素１（実線）、音素２（斜線）
に対する類似度の時間的変動を示したものであ
る。Ａ，Ｂ，Ｃは類似度が極小となる位置を示
す。真の特徴部（Ｂ点）においては音素１の方が
音素２よりも小さく、この子音は音素１として判
別されるべきである。しかるに、セグメンテーシ
ヨンパラメータによつて自動的に求めた特徴部候
補区間内においては、音素２がＡ点において最小
となるため、このままでは音素２に誤判別されて
しまう。第６図ｃは未知入力の周囲情報の標準パ
ターンと距離を示したものであり、真の特徴部付
近で値が大きくなる。これは、標準パターンが主
に周辺の情報によつて作成されているためであ
る。第６図ｄは周囲情報を含んだ音素標準パター
ンとの距離であり、ｂからｃを減じたものと等価
である。ｄではＡ点よりもＢ点の値が小さくなつ
ており、この子音は正しく音素１として判別され
ることになる。 Consider the case where consonant discrimination is performed in a situation where the phoneme section is shown in FIG. 4a. For the true characteristic part (hatched part) of this consonant, the characteristic part candidate section T is time
It is assumed that the values are obtained as t ₁ to t ₂ . b was determined by Equation 3. Phoneme 1 (solid line), Phoneme 2 (diagonal line)
This figure shows temporal fluctuations in the degree of similarity. A, B, and C indicate positions where the degree of similarity is minimal. In the true characteristic portion (point B), phoneme 1 is smaller than phoneme 2, and this consonant should be determined as phoneme 1. However, within the feature candidate section automatically determined using the segmentation parameters, phoneme 2 is at its minimum at point A, so if this continues, it will be misclassified as phoneme 2. FIG. 6c shows a standard pattern of surrounding information of unknown input and distances, and the value increases near the true feature. This is because the standard pattern is created mainly using peripheral information. FIG. 6 d is the distance to the phoneme standard pattern containing surrounding information, and is equivalent to b minus c. At point d, the value at point B is smaller than at point A, and this consonant is correctly determined as phoneme 1.

このように、本実施例の方法を用いることによ
つて、セグメンテーシヨンパラメータで求めた大
まかな特徴部候補区間から、正確に真の特徴部を
自動的に抽出して音素を判別することができる。 As described above, by using the method of this embodiment, it is possible to automatically extract true features accurately and discriminate phonemes from the rough feature candidate sections determined by the segmentation parameters. can.

なお、上記においては式３を基本とするマハラ
ノビス距離で説明したが、その他の距離において
も同様な方法が使用できる。 In addition, although the Mahalanobis distance based on Formula 3 was explained above, a similar method can be used for other distances.

また、上記では子音によつて説明したが、時間
的に変動する音素、たとえば半母音に対しても同
様な方法が適用できる。 Further, although the above explanation has been made using consonants, a similar method can also be applied to phonemes that change over time, such as semi-vowels.

このように、大分類によつて候補数を絞り、細
分類には自動的に抽出した特徴部を基本として時
間的な動きを考慮した統計的距離尺度で音素を判
別する方法は、音素（特に子音や半母音）の音声
学的な性質を利用した合理的な認識法である。 In this way, the method of narrowing down the number of candidates through major classification and discriminating phonemes using a statistical distance scale that takes into account temporal movement based on automatically extracted feature parts for subclassification is a method that uses phonemes (especially This is a rational recognition method that utilizes the phonetic properties of consonants and semi-vowels.

本実施例によつて、〜の方法でセグメンテ
ーシヨンすることが出来た全語頭子音（／
ｐ／、／ｔ／、／ｋ／、／ｃ／、／ｂ／、／
ｄ／、／ｇ／、／ｍ／、／ｎ／、／ｒ／、／
ｚ／、／ｓ／、／ｈ／）を対象として、平均で約
70.3％の認識率を得た。データは男女計20名がそ
れぞれ発声した212単セツトを使用しており、十
分な信頼性がある。また従来法では子音群内での
細分化が行われていないことを考慮すれば、本発
明による実施例の効果が大きいことがわかる。 In this example, all initial consonants (/
p/, /t/, /k/, /c/, /b/, /
d/, /g/, /m/, /n/, /r/, /
z/, /s/, /h/), on average about
A recognition rate of 70.3% was obtained. The data uses 212 single sets uttered by 20 men and women, and is sufficiently reliable. Furthermore, considering that in the conventional method, subdivision within a consonant group is not performed, it can be seen that the effects of the embodiment according to the present invention are large.

発明の効果以上要約すると、本発明は入力単語の語頭子音
のセグメンテーシヨンを４つの方法を併用して行
ない、これら４つの方法のどの方法によつてセグ
メンテーシヨンされたかによつて語頭子音を無声
子音群、有声子音群、パワー変化に特徴がある子
音群、持続時間の短かい子音群などの複数個の音
素群として認識し、次に前記音素区間中で特徴部
（音素の判別に有効な部分）を自動的に検出し、
前記特徴部に対して前に認識された音素群に属す
る個々の音素の標準パターンとの類似度計算を行
なつて音素を判別することを特徴とする語頭子音
認識方法を提供するもので、イ語頭子音の自動セグメンテーシヨンを行つて
高い精度で音素を認識することができる。Effects of the Invention To summarize above, the present invention performs segmentation of the initial consonant of an input word by using four methods in combination, and determines the initial consonant depending on which of these four methods is used for segmentation. It is recognized as a plurality of phoneme groups, such as voiceless consonants, voiced consonants, consonants with power changes, and consonants with short duration. automatically detects
The present invention provides a word-initial consonant recognition method characterized in that a phoneme is determined by calculating the degree of similarity between the characteristic part and a standard pattern of each phoneme belonging to a previously recognized phoneme group, Automatic segmentation of word-initial consonants can be performed to recognize phonemes with high accuracy.

ロ音素判別に対して有効な部分（特徴部）を自
動的にしかも正確に抽出し、マツチングを行な
うことができる。(b) It is possible to automatically and accurately extract parts (feature parts) that are effective for phoneme discrimination and perform matching.

ハ従来、判別が難しいとされていた有声破裂音
群内、無声破裂音群内、鼻音群内の細分類を自
動セグメンテーシヨンと組合わせて行なうこと
ができる。C. Subclassification within voiced plosives, voiceless plosives, and nasals, which have been difficult to distinguish in the past, can be performed in combination with automatic segmentation.

ニ４つの語頭子音セグメンテーシヨン法の結果
を利用して子音の大分類を行なうためにアルゴ
リズムを簡単にすることが出来る。D. Using the results of the four initial consonant segmentation methods, it is possible to simplify the algorithm for major consonant classification.

等の利点がある。There are advantages such as

[Brief explanation of drawings]

第１図は従来の音声認識システムの機能ブロツ
ク図、第２図は本発明の一実施例における語頭子
音をパワー変化によつて検出する方法の説明図、
第３図は同実施例の周囲情報標準パターンの作成
法を説明する図、第４図は同実施例の特徴部の検
出及び音素判別を行う方法を説明する図である。１……音響分析部、２……特徴抽出部、３……
セグメンテーシヨン部、４……音素判別部、５…
…標準パターン登録部、６……単語認識部、７…
…単語辞書。 FIG. 1 is a functional block diagram of a conventional speech recognition system, and FIG. 2 is an explanatory diagram of a method for detecting a word-initial consonant by changing power in an embodiment of the present invention.
FIG. 3 is a diagram illustrating a method for creating a surrounding information standard pattern according to the same embodiment, and FIG. 4 is a diagram illustrating a method for detecting characteristic parts and discriminating phonemes according to the same embodiment. 1... Acoustic analysis section, 2... Feature extraction section, 3...
Segmentation section, 4... Phoneme discrimination section, 5...
...Standard pattern registration section, 6...Word recognition section, 7...
...word dictionary.

Claims

[Claims] 1. First method for detecting word-initial consonants by voiceless determination
A second method of detecting the initial consonant by vowel nasality determination, a third method of detecting the initial consonant by capturing temporal changes in power, and a fourth method of detecting the initial consonant by cepstral distance. By segmenting the input speech by applying these in any order, the initial consonants are divided into four phonemes: unvoiced consonants, voiced consonants, consonants characterized by power changes, and consonants with short duration. A characteristic part (a part effective for phoneme discrimination) is automatically extracted within the phoneme interval, and a standard pattern of each phoneme belonging to the previously recognized phoneme group is compared to the characteristic part. A phoneme recognition method characterized by determining phonemes by calculating similarity. 2. First, segmentation parameters are used to determine candidate sections for feature parts, and then feature extraction and The phoneme recognition method according to claim 1, characterized in that phoneme discrimination is performed. 3. The phoneme recognition method according to claim 1, wherein the standard pattern similarity calculation is performed using a statistical distance measure and a standard pattern that includes temporal movement of phonemes.