JPH0558557B2

JPH0558557B2 -

Info

Publication number: JPH0558557B2
Application number: JP61196273A
Authority: JP
Inventors: Yoichi Yamada; Keiko Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1986-08-21
Filing date: 1986-08-21
Publication date: 1993-08-26
Also published as: JPS6350900A

Description

【発明の詳細な説明】（産業上の利用分野）この発明は音声認識装置、特にパタンマツチン
グ技術を用いた音声認識装置に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition device, and particularly to a speech recognition device using pattern matching technology.

（従来の技術）音声認識を行うに際し、入力音声の母音定常部
の特徴を安定及び正確に抽出することは認識性能
を向上させるために非常に大切なことである。そ
れは人間が発声する音声の中で母音定常部が時間
的に占める割合が子音又は母音から母音へ、或は
母音から子音等へ遷移する部分である過渡部（非
定常部）に比較して大であること、又、継続時間
が比較的大であるので発声タイミング等の影響に
よるバラツキが小さく安定に特徴を抽出すること
が出来ることにより、母音定常部の特徴を主体と
して利用する認識方式が有効であるという理由に
よる。(Prior Art) When performing speech recognition, it is very important to stably and accurately extract the features of the constant vowel part of input speech in order to improve recognition performance. This is because the constant portion of vowels in human speech occupies a larger proportion in time than the transitional portion (non-stationary portion), which is the portion that transitions from consonant or vowel to vowel, or from vowel to consonant, etc. In addition, since the duration is relatively long, there is little variation due to the influence of utterance timing, etc., and features can be extracted stably. Therefore, a recognition method that mainly uses the features of the vowel stationary part is effective. This is because.

従来装置において母音定常部の特徴抽出のため
に使用して有効な技術としてローカルピーク抽出
の技術が提案されている。この技術は母音定常部
のホルマント周波数帯域を検出しようとする技術
である。 A local peak extraction technique has been proposed as an effective technique for extracting features of vowel stationary parts in conventional devices. This technique attempts to detect the formant frequency band of the vowel stationary part.

第３図Ａ〜Ｃはこの技術を説明するための図で
ある。この技術によれば、Ａ／Ｄ変換された入力
音声信号に対し、中心周波数（各中心周波数に対
応するチヤネル（中心周波数の番号付け）番号ｋ
（ｋは正の整数）が付してある）の異なるバンド
パスフイルタによる周波数分析及び対数変換を順
次に所定の時間間隔（以後フレームと称す）毎に
行つた後得られた周波数スペクトルを算出し（第
３図Ａ）、これら周波数スペクトルからこれらス
ペクトルの最小自乗近似直線を減じてスペクトル
の正規化を行い（第３図Ｂ）、正規化スペクトル
の値が「０」より大となるチヤネルの中で出力信
号の値が極大となるチヤネルのローカルピーク値
を「１」とし、残りのチヤネルのローカルピーク
値を全て「０」と設定する１ビツト特徴量として
ローカルピークパタンを抽出している（第３図
Ｃ）。 FIGS. 3A to 3C are diagrams for explaining this technique. According to this technology, the center frequency (channel (numbering of center frequencies) number k corresponding to each center frequency) is
Frequency analysis and logarithmic transformation using different bandpass filters (marked with k is a positive integer) are performed sequentially at predetermined time intervals (hereinafter referred to as frames), and the resulting frequency spectrum is calculated. (Figure 3A), normalize the spectrum by subtracting the least squares approximation straight line of these spectra from these frequency spectra (Figure 3B), and in the channel where the value of the normalized spectrum is greater than "0". The local peak pattern is extracted as a 1-bit feature by setting the local peak value of the channel where the value of the output signal is maximum to ``1'' and setting all the local peak values of the remaining channels to ``0''. Figure 3C).

上記抽出したローカルピークパタンと予め用意
されている標準パタンとの類似度計算を行い、認
識対象カテゴリ毎に類似度を算出し、全ての認識
対象カテゴリの中で最大の類似度を与えるカテゴ
リ名を認識結果として出力する。 Calculate the similarity between the local peak pattern extracted above and the standard pattern prepared in advance, calculate the similarity for each recognition target category, and select the category name that gives the maximum similarity among all recognition target categories. Output as recognition result.

（発明が解決しようとする問題点）ローカルピークは母音定常部のホルマント帯域
を安定に抽出する特徴であり安定性の高い認識を
行うことが出来る。(Problems to be Solved by the Invention) The local peak is a feature that stably extracts the formant band of the vowel stationary part, and highly stable recognition can be performed.

しかしながら／ｓ／、／ch／等の摩擦音を一
例とした子音部に関しては特徴を安定に抽出する
ことは難しい。なぜならばローカルピークは正規
化スペクトルが極大となる帯域を抽出する技術で
あり、母音定常部においてはホルマント周波数帯
域に相当するチヤネルにおいて正規化スペクトル
が極大となることにより母音定常部の主たる特徴
であるところのホルマント周波数を安定に抽出出
来るが、一方、摩擦音等の子音部は母音定常部に
おけるホルマント周波数のように特定のチヤネル
（周波数帯域）において正規化スペクトルが極大
となる性質を有していないことにより子音部にお
けるローカルピークの出現位置は不安定で一意に
は定まりにくい。 However, it is difficult to stably extract features for consonant parts, such as fricatives such as /s/ and /ch/. This is because local peaking is a technique to extract the band where the normalized spectrum is maximum, and in the vowel stationary part, the normalized spectrum is maximum in the channel corresponding to the formant frequency band, which is the main feature of the vowel stationary part. However, the formant frequency can be extracted stably, but on the other hand, consonant parts such as fricatives do not have the property that the normalized spectrum becomes maximum in a specific channel (frequency band) like the formant frequency in the constant vowel part. Therefore, the appearance position of the local peak in the consonant part is unstable and difficult to be determined uniquely.

従つて、「イチ」と「シチ」のように母音定常
部が両者とも同等である音声を認識判定する場
合、ローカルピークだけでは子音部の特徴を安定
に抽出出来ないことにより両者を正確に識別判定
することは難しくなる問題点があり、認識性能の
低下を招いていた。 Therefore, when recognizing and judging speech in which both vowel stationary parts are the same, such as "ichi" and "shichi," it is difficult to accurately identify the two because the features of the consonant part cannot be stably extracted using local peaks alone. This has the problem of making it difficult to judge, leading to a decline in recognition performance.

この発明は以上述べた問題点を除去し、入力音
声のローカルピーク特徴および子音性特徴を時系
列パタンとして抽出し、それぞれの標準パタンと
の類似度演算に使用するように構成することによ
り、より正確な類似度を出力し、その結果、認識
性能の優れた音声認識装置を提供することを目的
とする。 This invention eliminates the above-mentioned problems and extracts the local peak features and consonant features of input speech as time-series patterns, and uses them to calculate the similarity with each standard pattern. The present invention aims to provide a speech recognition device that outputs accurate similarity and, as a result, has excellent recognition performance.

（問題点を解決するための手段）この目的の達成を図るため、この発明によれ
ば、入力音声に対し、音声始端時刻より音声終端
時刻まで（音声区間）における特徴量の時系列パ
タンを押出し、この時系列パタンと予め用意され
ている標準パタンとの類似度計算を行い、各認識
対象カテゴリに対して類似度を算出し、全ての認
識対象カテゴリの中で最大の類似度を有するカテ
ゴリ名を認識結果とする音声認識装置において、ａ複数のチヤネル（中心周波数の番号付け）に
よる周波数分析、対数変換を行い周波数スペク
トルを抽出した後、周波数スペクトルに対し声
帯音源特性の正規化を行つた正規化スペクトル
パタンを算出するスペクトル正規化部と、ｂ音声区間内において、高域及び低域チヤネル
領域における正規化スペクトル値の大小関係に
基づいて、子音的性質を有すると判定されたフ
レームにおいては子音性パタン抽出を行い、子
音的性質を有しないと判定されたフレームにお
いては子音性パタン抽出を行わない（全てのチ
ヤネル成分において値を「０」とする）処理を
遂次行い子音性パタンを作成する子音性パタン
抽出部と、ｃ正規化スペクトルパタンの値が正値かつ極大
値となるチヤネル成分を「１」、その他の全て
のチヤネル成分を「０」とする処理を音声区間
内の全てのフレームに対して行いローカルピー
クパタンを作成するローカルピークパタン抽出
部と、ｄｂ）項で算出した子音性パタンと予め用意さ
れている子音性標準パタンとの類似度計算を行
い、各認識対象カテゴリに対する子音性類似度
を算出する子音性類似度計算部と、ｅ子音性標準パタンの記憶部と、ｆｃ）項で算出したローカルピークパタンと予
め用意されているローカルピーク標準パタンと
の類似度計算を行い、各認識対象カテゴリに対
するローカルピーク類似度を算出するローカル
ピーク類似度計算部と、ｇ子音性類似度とローカルピーク類似度の両者
を参照して各認識対象カテゴリ毎に総合類似度
を算出し、該総合類似度が全ての認識対象カテ
ゴリの中で最大となるカテゴリ名を認識結果と
する判定部とを具えることを特徴とする。(Means for Solving the Problem) In order to achieve this objective, according to the present invention, a time-series pattern of features from the voice start time to the voice end time (speech interval) is extruded from the input voice. , calculate the similarity between this time series pattern and a standard pattern prepared in advance, calculate the similarity for each recognition target category, and select the category name that has the maximum similarity among all recognition target categories. In a speech recognition device whose recognition result is a. Frequency analysis using multiple channels (numbering of center frequencies), logarithmic transformation, frequency spectrum extraction, and normalization of the vocal fold sound source characteristics to the frequency spectrum. a spectrum normalization unit that calculates a normalized spectrum pattern, and b) a consonant in a frame determined to have consonant properties based on the magnitude relationship of the normalized spectrum values in the high-frequency and low-frequency channel regions within the speech interval; Consonant pattern extraction is performed, and in frames that are determined not to have consonant characteristics, consonant pattern extraction is not performed (the value is set to "0" in all channel components), and a consonant pattern is created. c. A consonant pattern extraction unit that A local peak pattern extraction unit that creates a local peak pattern for each frame calculates the similarity between the consonant pattern calculated in step d and b) and a consonant standard pattern prepared in advance, and calculates the similarity between each recognition target category. a consonantal similarity calculation unit that calculates the consonantal similarity for; e a consonantal standard pattern storage unit; and f the similarity between the local peak pattern calculated in section c) and the local peak standard pattern prepared in advance. a local peak similarity calculation unit that performs calculations and calculates the local peak similarity for each recognition target category; and a determining unit that determines the category name for which the total similarity is the largest among all recognition target categories as a recognition result.

さらに、この発明の実施に当つては、子音性パ
タン抽出部は、音声区間内の全てのフレームに対
し、処理を行なつて子音性パタンを作成するた
め、ａフレームにおける正規化スペクトル値につい
て高域チヤネル領域における正規化スペクトル
値が低域チヤネル領域における正規化スペクト
ル値に比較して大であるフレームを子音的性質
を有すると判定するフレーム判定手段と、ｂ子音的性質を有すると判定した場合、このフ
レームにおいて正規化スペクトルの値が所定の
閾値より大であるチヤネル成分の子音性パタン
の値をこのチヤネルにおける正規化スペクトル
の値とし、その他のチヤネル成分の子音性パタ
ンの値を「０」とする第１子音性パタン値決定
手段と、ｃ子音的性質を有すると判定されなかつた場
合、このフレームにおける子音性パタンの値は
全てのチヤネル成分について「０」とする第２
子音性パタン値決定手段とを具えるように構成するのが好適である。 Furthermore, in carrying out the present invention, the consonantity pattern extraction unit performs processing on all frames within the speech interval to create consonance patterns, so the normalized spectrum value in the a frame is high. frame determination means for determining that a frame in which the normalized spectrum value in the low frequency channel region is larger than the normalized spectrum value in the low frequency channel region has consonant properties; b. when determining that the frame has consonant properties; , the value of the consonant pattern of the channel component whose normalized spectrum value is larger than a predetermined threshold in this frame is set as the value of the normalized spectrum in this channel, and the value of the consonant pattern of the other channel components is set to "0". a first consonantity pattern value determination means which sets the value of the consonance pattern in this frame to "0" for all channel components if it is not determined that the frame has consonant properties;
Preferably, the method includes a consonant pattern value determining means.

（作用）このように、この発明の音声認識装置によれ
ば、ローカルピーク類似度を判定部に加えると共
に、スペクトル正規化部から得られた正規化スペ
クトルから、子音性パタン抽出部及び子音性類似
度計算部によつて順次処理を行つて、高域及び低
域チヤネル領域における正規化スペクトル値の大
小関係に基づいて得られた子音性フレームの子音
性類似度を得、この子音性類似度を判定部に加
え、よつてこの判定部においてローカルピーク類
似度と子音性類似度とを加算した総合類似度を求
めて認識を行うのであるから、正確かつ安定な音
声認識を行うことが出来る。(Operation) As described above, according to the speech recognition device of the present invention, the local peak similarity is added to the determining section, and the consonant pattern extraction section and the consonant similarity The consonantal similarity of the consonantal frame obtained based on the magnitude relationship of the normalized spectral values in the high-frequency and low-frequency channel regions is sequentially processed by the degree calculation unit, and this consonantal similarity is In addition to the determination section, this determination section performs recognition by calculating the total similarity obtained by adding the local peak similarity and the consonant similarity, so that accurate and stable speech recognition can be performed.

（実施例）以下、図面を参照して、この発明の実施例につ
き説明する。(Embodiments) Hereinafter, embodiments of the present invention will be described with reference to the drawings.

第１図はこの発明の実施例を示すブロツク図、
第２図Ａは子音性パタン抽出部の機能ブロツク
図、第２図Ｂは子音性パタンの抽出部の動作説明
を行うための流れ図である。第１図及び第２図Ａ
及びＢを用いてこの発明の音声認識装置の実施例
の構成をその動作と共に説明する。 FIG. 1 is a block diagram showing an embodiment of this invention.
FIG. 2A is a functional block diagram of the consonant pattern extraction section, and FIG. 2B is a flowchart for explaining the operation of the consonant pattern extraction section. Figures 1 and 2A
The configuration of an embodiment of the speech recognition device of the present invention will be explained using FIGS. and B along with its operation.

入力音声Ｄ１は周波数分析部１０へ入力され
る。 Input audio D1 is input to frequency analysis section 10.

周波数分析部１０は所定の帯域数（以後、帯域
の番号付けをチヤネルと称す）のバンドパスフイ
ルタ分析を行い、その出力であるところの周波数
スペクトルＤ２を所定の時間間隔（フレーム）毎
に算出し、スペクトル正規化部１１及び音声区間
検出部１２へ出力する。 The frequency analysis unit 10 performs bandpass filter analysis of a predetermined number of bands (hereinafter, the numbering of bands is referred to as a channel), and calculates the frequency spectrum D2, which is the output thereof, at each predetermined time interval (frame). , is output to the spectrum normalization section 11 and the speech section detection section 12.

音声区間検出部１２は周波数スペクトルＤ２の
値の大きさなどから入力音声の始端時刻と終端時
刻とを決定して始端時刻信号Ｄ３及び終端時刻信
号Ｄ４を発生し、両者をローカルピークパタン抽
出部１３及び子音性パタン抽出部１４へ出力す
る。スペクトル正規化部１１は周波数スペクトル
から周波数スペクトルの最小自乗近似直線を差し
引くことにより正規化スペクトルＤ５を算出し、
ローカルピークパタン抽出部１３及び子音性パタ
ン抽出部１４へ出力する。 The voice section detection unit 12 determines the start time and end time of the input voice from the magnitude of the value of the frequency spectrum D2, etc., generates a start time signal D3 and an end time signal D4, and sends both signals to the local peak pattern extraction unit 13. and output to the consonant pattern extraction section 14. The spectrum normalization unit 11 calculates a normalized spectrum D5 by subtracting the least square approximation straight line of the frequency spectrum from the frequency spectrum,
It is output to the local peak pattern extraction section 13 and the consonant pattern extraction section 14.

ローカルピークパタン抽出部１３は、該フレー
ムにおける正規化スペクトルの値が正値となるチ
ヤネルの中で正規化スペクトルの値が極大となる
チヤネルのローカルピークパタンの値を「１」、
他の全てのチヤネルのローカルピークパタンの値
を「０」とする処理を始端フレームから終端フレ
ームまでの全てのフレームに対して遂次行い、ロ
ーカルピークパタンＤ６としてローカルピーク類
似度計算部１５へ出力する。 The local peak pattern extraction unit 13 sets the value of the local peak pattern of the channel in which the value of the normalized spectrum is maximum among the channels in which the value of the normalized spectrum in the frame is a positive value to "1",
The process of setting the value of the local peak pattern of all other channels to "0" is sequentially performed on all frames from the start frame to the end frame, and output to the local peak similarity calculation unit 15 as a local peak pattern D6. do.

ローカルピーク類似度計算部１５はローカルピ
ークパタンＤ６と予めローカルピーク標準パタン
記憶部１６に記憶されている全てのローカルピー
ク標準パタンとの類似度を計算し、各認識対象カ
テゴリに対するローカルピーク類似度Ｄ８を判定
部１９へ出力する。 The local peak similarity calculation unit 15 calculates the similarity between the local peak pattern D6 and all local peak standard patterns stored in the local peak standard pattern storage unit 16 in advance, and calculates the local peak similarity D8 for each recognition target category. is output to the determination unit 19.

尚、上述した周波数分析部１０、スペクトル正
規化部１１、音声区間検出部１２、ローカルピー
クパタン抽出部１３、ローカルピーク類似度計算
部１５、ローカルピーク標準パタン記憶部１６及
び判定部１９は、既に提案されているローカルピ
ーク抽出技術による音声認識装置の構成成分であ
るので、特別の機能を有する場合を除きその詳細
な説明は省略する。 Note that the frequency analysis unit 10, spectrum normalization unit 11, speech interval detection unit 12, local peak pattern extraction unit 13, local peak similarity calculation unit 15, local peak standard pattern storage unit 16, and determination unit 19 described above have already been implemented. Since this is a component of a speech recognition device based on the proposed local peak extraction technique, a detailed explanation thereof will be omitted unless it has a special function.

この発明の音声認識装置においては、上述した
ローカルピーク類似度Ｄ８の他に子音性類似度を
加えて判定部１９において総合的に類似度を判定
するように構成したものであるから、子音性パタ
ン抽出部１４、子音性類似度計算部１７及び判定
部１９を以下説明するような動作を行うように構
成する。 In the speech recognition device of the present invention, in addition to the local peak similarity D8 described above, the consonant similarity is added to determine the overall similarity in the determining section 19, so that the consonant pattern The extraction unit 14, the consonant similarity calculation unit 17, and the determination unit 19 are configured to operate as described below.

子音性パタン抽出部１４は、第２図の説明の項
で後述する方法により子音性パタンＤ７を作成
し、子音性類似度計算部１７へ出力するように構
成する。 The consonant pattern extraction unit 14 is configured to create a consonant pattern D7 by a method described later in the explanation section of FIG. 2, and output it to the consonant similarity calculation unit 17.

子音性類似度計算部１７は子音性パタンＤ７と
予め子音性標準パタン記憶部１８に記憶されてい
る全ての子音性標準パタンとの類似度を計算し、
各認識対象カテゴリに対する子音性類似度Ｄ９を
判定部１９へ出力するように構成する。 The consonant similarity calculation unit 17 calculates the similarity between the consonant pattern D7 and all the consonant standard patterns stored in advance in the consonant standard pattern storage unit 18,
The consonant similarity degree D9 for each recognition target category is output to the determination unit 19.

判定部１９は認識対象カテゴリ毎にローカルピ
ーク類似度と子音性類似度の総和を算出し、該類
似度総和値が全ての認識対象カテゴリの中で最大
となるカテゴリ名を認識結果Ｄ１０として出力す
るように構成する。 The determination unit 19 calculates the sum of local peak similarity and consonant similarity for each recognition target category, and outputs the category name for which the total similarity value is the largest among all recognition target categories as recognition result D10. Configure it as follows.

ところで、上述した子音性パタン抽出部１４
は、例えば、第２図Ａに示すように、子音的性質
のフレームか否かを判定するフレーム判定手段２
０と、子音的性質のフレームであるときの第１子
音性パタン値決定手段２１と、子音的性質のフレ
ームでないときの第２子音性パタン値決定手段２
２とを具える。 By the way, the above-mentioned consonant pattern extraction unit 14
For example, as shown in FIG. 2A, frame determining means 2 determines whether the frame has consonant properties
0, the first consonant pattern value determining means 21 when the frame has consonant properties, and the second consonant pattern value determining means 2 when the frame does not have consonant properties.
2.

フレーム判定手段２０では、フレームにおける
正規化スペクトル値について高域チヤネル領域に
おける正規化スペクトル値が低域チヤネル領域に
おける正規化スペクトル値に比較して大であるフ
レームを子音的性質を有すると判定を行なう。 The frame determining means 20 determines that a frame in which the normalized spectral value in the high channel region is larger than the normalized spectral value in the low channel region has consonant properties. .

第１子音性パタン値決定手段２１では、子音的
性質を有すると判定した場合、フレームにおいて
正規化スペクトルの値が所定の閾値より大である
チヤネル成分の子音性パタンの値をこのチヤネル
における正規化スペクトルの値とし、その他のチ
ヤネル成分の子音性パタンの値を「０」と決定す
る。 When it is determined that the first consonantity pattern value determining means 21 has consonant properties, the value of the consonance pattern of the channel component whose normalized spectrum value is larger than a predetermined threshold value in the frame is normalized in this channel. The value of the spectrum is determined, and the value of the consonant pattern of the other channel components is determined to be "0".

第２子音性パタン値決定手段２２では、子音的
性質を有すると判定されなかつた場合、フレーム
における子音性パタンの値は全てのチヤネル成分
について「０」と決定する。 The second consonant pattern value determining means 22 determines that the consonant pattern values in the frame are "0" for all channel components when it is not determined that the frame has consonant properties.

次に第２図Ｂの流れ図を用いてこの発明の実施
例の一主要部である子音性パタン抽出部１４の動
作を詳細に説明する。尚、以下の説明において処
理ステツプをＳで表わして説明する。又、ここで
説明する動作の手順は単なる一好適例であるにす
ぎず、従つてその他の任意好適な手順で行つても
良い。 Next, the operation of the consonant pattern extraction section 14, which is a main part of the embodiment of the present invention, will be explained in detail using the flowchart shown in FIG. 2B. In the following description, each processing step will be represented by S. Further, the operation procedure described here is only one preferred example, and therefore any other suitable procedure may be used.

始端フレーム番号をSFR、終端フレーム番号
をEFR、正規化スペクトルをNSP（ｉ，ｊ）
（ｉ；チヤネル番号、ｊ；フレーム番号）、子音性
パタンをCMP（ｉ，ｊ）（ｉ；チヤネル番号、
ｊ；フレーム番号）、周波数分析チヤネル数を
CHNNOとする。又、子音性パタン抽出を行う
フレーム番号をｊとする。 The starting frame number is SFR, the ending frame number is EFR, and the normalized spectrum is NSP (i, j).
(i: channel number, j: frame number), consonant pattern by CMP (i, j) (i: channel number,
j; frame number), the number of frequency analysis channels
CHNNO. Furthermore, the frame number for which consonant pattern extraction is performed is assumed to be j.

先ず、フレーム判定手段２０において、ｊ＝
SFRに初期設定を行う（S1）。 First, in the frame determination means 20, j=
Perform initial settings for SFR (S1).

次に、次式により該フレーム（フレーム番号
ｊ）における高域チヤネル成分の低域チヤネル成
分に対する正規化スペクトル出力の相対的な大き
さを算出する（S2）。 Next, the relative magnitude of the normalized spectrum output of the high frequency channel component to the low frequency channel component in the frame (frame number j) is calculated using the following equation (S2).

SUB（ｊ）＝｛_CHNNO 〓^i=HS NSP（ｉ，ｊ）｝／（CHNNO−HS＋１）−｛_LE 〓ⁱ⁼¹ NSP（ｉ，ｊ）｝／LE ……(1) 但し、HS及びLEはそれぞれ経験的に、 LE＝CHNNO／３，HS＝２・
CHNNO／３程度に設定する。そして、 SUB（ｊ）＞THL1 条件(A) （THL1は所定の閾値で０程度に設定する）を満足するか否かを判定する（S3）。SUB(j)={ _CHNNO 〓 ^i=HS NSP(i,j)}/(CHNNO−HS+1)−{ _LE 〓 ⁱ⁼¹ NSP(i,j)}/LE……(1) However, HS and LE Empirically, LE=CHNNO/3, HS=2・
Set to about CHNNO/3. Then, it is determined whether or not the condition (A): SUB(j)>THL1 (THL1 is set at a predetermined threshold value of about 0) is satisfied (S3).

次に、第１子音性パタン値決定手段２１では、
上記条件(A)を満足する時すなわち子音的性質を有
すると判定された時、正規化スペクトル出力 NSP（ｉ，ｊ）＞THL2 条件(B) （THL2は所定の閾値で０程度に設定する）を満足するか判定し（S4）、条件(B)を満足するチ
ヤネルにおける子音性パタンの値は、 CMP（ｉ，ｊ）＝NSP（ｉ，ｊ）とする（S5）。 Next, in the first consonant pattern value determining means 21,
When the above condition (A) is satisfied, that is, when it is determined that it has consonant properties, the normalized spectrum output NSP (i, j) > THL2 Condition (B) (THL2 is set to about 0 with a predetermined threshold) (S4), and the value of the consonant pattern in the channel that satisfies condition (B) is set as CMP (i, j) = NSP (i, j) (S5).

上記条件(B)を満足しないチヤネルにおける子音
性パタンの値は、 CMP（ｉ，ｊ）＝０とする（S6）。 The value of the consonant pattern in channels that do not satisfy the above condition (B) is set to CMP (i, j) = 0 (S6).

一方、第２子音性パタン値決定手段２２では、
上記条件(A)を満足しない時すなわち子音的性質を
有すると判定されなかつた時は、該フレームにお
ける全てのチヤネルにおける子音性パタンの値
は、 CMP（ｉ，ｊ）＝０とする（S6）。 On the other hand, in the second consonant pattern value determining means 22,
When the above condition (A) is not satisfied, that is, when it is not determined that the frame has consonant properties, the value of the consonant pattern in all channels in the frame is set to CMP (i, j) = 0 (S6) .

第１及び第２子音性パタン値決定手段２１及び
２２において該フレームにおける子音性パタン抽
出終了後、フレーム番号ｊに１を加算する
（S7）。 After the first and second consonant pattern value determining means 21 and 22 finish extracting the consonant pattern in the frame, 1 is added to the frame number j (S7).

次に全てのフレームについて前述の各処理を終
了しているかどうかを次の条件で調べ、ｊ≦EFR この条件を満足する時は、ステツプS2からの処
理を遂次繰り返し行い、満足しない時は該入力音
声における子音性パタン抽出を終了する。 Next, it is checked whether the above-mentioned processes have been completed for all frames using the following condition: j≦EFR When this condition is satisfied, the process from step S2 is sequentially repeated, and when it is not satisfied, the process is The consonant pattern extraction in the input speech ends.

次に、第４図及び第５図は発声音が「イチ」及
び「シチ」の場合にローカルピークパタン及び子
音性パタンの抽出結果を主として説明するための
図である。 Next, FIGS. 4 and 5 are diagrams mainly for explaining the extraction results of local peak patterns and consonant patterns when the utterances are "ichi" and "shichi".

第４図Ａ及び第５図Ａは横軸にフレーム番号及
び縦軸に音声パワーを取つてそれぞれ示した「イ
チ」及び「シチ」にそれぞれ対応する音声パワー
図である。第４図Ｂ及びＣ、第５図Ｂ及びＣはそ
れぞれ「イチ」及び「シチ」に対応するローカル
ピークパタン図及び子音性パタン図であり、それ
ぞれ横軸にフレーム番号及び縦軸にチヤネル番号
を取つて示してある。さらに、第５図Ｄ及びＥは
「シチ」の子音部に対応する周波数スペクトル図
及び正規化スペクトル図であり、横軸にチヤネル
番号を取つて示してある。 FIG. 4A and FIG. 5A are audio power diagrams corresponding to "ichi" and "shichi", respectively, with the frame number on the horizontal axis and the audio power on the vertical axis. Figure 4 B and C and Figure 5 B and C are local peak pattern diagrams and consonant pattern diagrams corresponding to "ichi" and "shichi", respectively, with the frame number on the horizontal axis and the channel number on the vertical axis, respectively. It is shown here. Further, FIGS. 5D and 5E are frequency spectrum diagrams and normalized spectrum diagrams corresponding to the consonant part of "shichi", and the channel numbers are shown on the horizontal axis.

第４図Ａに示す「イチ」の発声音に対し、ロー
カルピークパタン抽出部（第１図に１３で示す）
で得られたローカルピークパタンが「１」である
領域は第４図Ｂに黒い部分で示したように現われ
る。さらに「イチ」の発声音に対し、子音性パタ
ン抽出部（第１図に１４で示す）で得られた子音
性パタンの値が対応するチヤネル番号及びフレー
ム番号における正規化スペクトルの値と等しくな
る領域を第４図Ｃに黒い部分CO1で示す。 For the uttered sound of “ichi” shown in FIG. 4A, the local peak pattern extraction unit (indicated by 13 in FIG. 1)
The region where the local peak pattern obtained in is "1" appears as shown by the black part in FIG. 4B. Furthermore, for the pronunciation of "ichi", the value of the consonant pattern obtained by the consonant pattern extraction unit (indicated by 14 in Figure 1) becomes equal to the value of the normalized spectrum at the corresponding channel number and frame number. The area is shown in FIG. 4C by the black area CO1.

一方、第５図Ａに示す発声音「シチ」の場合に
はローカルピークパタンが「１」となる領域及び
子音性パタンの値が対応するチヤネル番号及びフ
レーム番号における正規化スペクトルの値と等し
くなる領域はそれぞれ第５図ＢおよびＣに黒く示
した部分CO2，CO3のように現われる。この場
合、第５図Ａに示す「シチ」の語頭部分のスペク
トル出力は第５図Ｄに示すように現われ、その正
規化スペクトル出力は第５図Ｅに示すようにな
り、条件(A)を満足するので子音性フレームと判定
され、さらに、条件(B)の閾値THL2を０とすると
NSP（ｉ，ｊ）＞THL2を満足して子音性パタン
の値が対応するチヤネル番号及びフレーム番号に
おける正規化スペクトルの値と等しくなる領域は
CO2で示される。 On the other hand, in the case of the vocal sound "shichi" shown in FIG. 5A, the region where the local peak pattern is "1" and the value of the consonant pattern are equal to the values of the normalized spectrum at the corresponding channel number and frame number. The regions appear as the black areas CO2 and CO3 shown in FIGS. 5B and 5C, respectively. In this case, the spectral output of the initial part of the word "shichi" shown in FIG. 5A appears as shown in FIG. 5D, and its normalized spectral output becomes as shown in FIG. 5E, satisfying condition (A). Since it is satisfied, it is determined to be a consonant frame, and further, if the threshold THL2 of condition (B) is set to 0,
The region where NSP (i, j) > THL2 is satisfied and the value of the consonant pattern is equal to the value of the normalized spectrum at the corresponding channel number and frame number is
Denoted as CO2.

このように語頭の子音性パタンの相違によつて
「イチ」と「シチ」との両者を正確に識別判定出
来る。 In this way, both "ichi" and "shichi" can be accurately identified based on the difference in the consonant patterns at the beginning of the word.

第６図Ａ及びＢは子音性パタンの認識への貢献
を示す総合類似度の説明図である。第６図Ａは第
４図Ａの音声パタンを有する発声音「イチ」のカ
テゴリ名「イチ」及びカテゴリ名「シチ」に対す
る総合類似度を示す図であり、第６図Ｂは第５図
Ａの音声パタンを有する発声音「シチ」のカテゴ
リ名「イチ」及びカテゴリ名「シチ」に対する総
合類似度を示す図である。これら図において〓〓
〓で示す部分はローカルピーク類似度を示し、〓
〓〓で示す部分は子音性類似度を示す。この第６
図Ａからも理解出来るように発声音「イチ」の総
合類似度と、「シチ」の標準パタンの総合類似度
は大きく異なる。従つて、いずれの場合にも子音
性パタンの相違により両者を正確に識別判定出来
る。 FIGS. 6A and 6B are explanatory diagrams of the overall similarity indicating the contribution of consonant patterns to recognition. FIG. 6A is a diagram showing the overall similarity of the uttered sound "ichi" having the speech pattern of FIG. 4A to the category name "ichi" and the category name "shichi", and FIG. FIG. 3 is a diagram showing the overall similarity of the uttered sound "shichi" having the voice pattern with respect to the category name "ichi" and the category name "shichi". In these figures 〓〓
The part indicated by 〓 indicates the local peak similarity, and 〓
The part indicated by 〓〓 indicates the degree of consonant similarity. This sixth
As can be understood from Figure A, the total similarity of the vocal sound "ichi" and the total similarity of the standard pattern of "shichi" are significantly different. Therefore, in either case, the two can be accurately identified based on the difference in consonant patterns.

（発明の効果）上述した説明からも明らかなように、この発明
の音声認識装置によれば、母音定常部の特徴を安
定に抽出した結果であるところの時系列的なロー
カルピークパタンと、子音部の特徴を安定に抽出
した結果であるところの時系列的な子音性パタン
との両者を併せて認識判定を行う方式としたの
で、それぞれのパタンのより正確な類似度を得、
その結果、子音特徴も加味した正確かつ安定な音
声認識を行なうことが出来る。(Effects of the Invention) As is clear from the above description, the speech recognition device of the present invention distinguishes between the time-series local peak pattern, which is the result of stably extracting the features of the vowel stationary part, and the consonant Since we adopted a method that performs recognition judgment based on both the time-series consonantity pattern, which is the result of stably extracting the features of
As a result, it is possible to perform accurate and stable speech recognition that also takes consonant features into account.

[Brief explanation of the drawing]

第１図はこの発明の音声認識装置の一実施例を
示すブロツク図、第２図Ａはこの発明の子音性パ
タン抽出部の機能ブロツク図、第２図Ｂは、この
発明の子音性パタン抽出部の処理手順を示す流れ
図、第３図はこの発明の説明に供する従来のロー
カルピークパタン抽出を示す図、第４図は発声音
「イチ」の音声パワーに対するローカルピークパ
タン及び子音性パタンの説明図、第５図は発声音
「シチ」の音声パワーに対するローカルピークパ
タン及び子音性パタンの説明図、第６図は子音性
パタンの認識への貢献を示す総合類似度の説明図
である。１０……周波数分析部、１１……スペクトル正
規化部、１２……音声区間検出部、１３……ロー
カルピークパタン抽出部、１４……子音性パタン
抽出部、１５……ローカルピーク類似度計算部、
１６……ローカルピーク標準パタン記憶部、１８
……子音性標準パタン記憶部、１９……判定部、
２０……フレーム判定手段、２１……第１子音性
パタン値決定手段、２２……第２子音性パタン値
決定手段。 FIG. 1 is a block diagram showing an embodiment of the speech recognition device of the present invention, FIG. 2A is a functional block diagram of the consonant pattern extraction section of the present invention, and FIG. 2B is a consonant pattern extractor of the present invention. FIG. 3 is a diagram showing the conventional local peak pattern extraction used to explain the present invention. FIG. 4 is an explanation of the local peak pattern and consonant pattern for the voice power of the uttered sound "ichi". FIG. 5 is an explanatory diagram of the local peak pattern and consonant pattern with respect to the voice power of the vocalization sound "shichi", and FIG. 6 is an explanatory diagram of the overall similarity indicating the contribution of the consonant pattern to recognition. 10...Frequency analysis section, 11...Spectrum normalization section, 12...Speech interval detection section, 13...Local peak pattern extraction section, 14...Consonant pattern extraction section, 15...Local peak similarity calculation section ,
16...Local peak standard pattern storage section, 18
... Consonance standard pattern storage unit, 19... Judgment unit,
20... Frame determining means, 21... First consonant pattern value determining means, 22... Second consonant pattern value determining means.

Claims

[Claims] 1. Extracting a time-series pattern of features from an input voice from a voice start time to a voice end time (speech interval), and comparing the time-series pattern with a standard pattern prepared in advance. In a speech recognition device that calculates the degree of similarity for each recognition target category, and uses the category name with the highest degree of similarity among all recognition target categories as the recognition result, a. a spectrum normalization unit that calculates a normalized spectrum pattern by performing frequency analysis and logarithmic transformation (numbering of frequencies), extracting a frequency spectrum, and normalizing vocal cord sound source characteristics with respect to the frequency spectrum; Based on the magnitude relationship of the normalized spectrum values in the high and low channel regions, consonantity patterns are extracted for frames that are determined to have consonant properties, and frames that are determined not to have consonant properties are extracted. c. a consonant pattern extraction unit that sequentially performs a process of not extracting consonant patterns (setting the value to "0" in all channel components) in the frames in which the consonant pattern is extracted, and creates a consonant pattern; c. A local peak pattern that creates a local peak pattern by performing processing on all frames within a voice section to set the channel component whose value is a positive value and local maximum value to "1" and all other channel components to "0". an extraction unit; d) a consonant similarity calculation that calculates the similarity between the consonant pattern calculated in the above b) and a consonant standard pattern prepared in advance, and calculates the consonant similarity for each recognition target category; e a storage unit for the consonant standard pattern; a local peak similarity calculation unit that calculates the degree of similarity; 1. A speech recognition device comprising: a determination unit that determines the largest category name among target categories as a recognition result. 2. The consonant pattern extraction unit performs processing on all frames in the speech interval to create a consonant pattern, so that: a) the normalized spectrum in the high channel region for the normalized spectral value in the frame; frame determination means for determining that a frame whose value is larger than the normalized spectral value in the low channel region has consonant properties; b. when it is determined that the frame has consonant properties, the frame is subjected to the normalization in the frame; A first consonantity in which the value of the consonance pattern of a channel component whose spectrum value is larger than a predetermined threshold value is the value of the normalized spectrum in the channel, and the value of the consonance pattern of other channel components is set to "0". c) second consonantity pattern value determination means for setting the value of the consonantity pattern in the frame to "0" for all channel components if the frame is not determined to have consonant properties; A speech recognition device according to claim 1, characterized in that: