JPH0682275B2

JPH0682275B2 - Voice recognizer

Info

Publication number: JPH0682275B2
Application number: JP1236471A
Authority: JP
Inventors: 康弘小森
Original assignee: ATR JIDO HONYAKU DENWA
Current assignee: ATR JIDO HONYAKU DENWA
Priority date: 1989-09-11
Filing date: 1989-09-11
Publication date: 1994-10-19
Anticipated expiration: 2009-10-19
Also published as: JPH0398098A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は音声認識装置に関し、特に、入力された音声
を音韻グループごとにセグメンテーションを行ない、こ
のセグメンテーション法とこれに適用する音韻識別ニュ
ーラル・ネットワークを融合して音韻を認識するような
音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly to segmentation of input speech into phoneme groups, and this segmentation method and a phoneme identification neural network applied thereto. The present invention relates to a speech recognition apparatus that recognizes phonemes by merging with.

［従来の技術および発明が解決しようとする課題］従来の音声認識の方法は、連続した音声波形に時間区分
を入れてセグメンテーションを行なった後に、音韻認識
を行なう方法と、連続した音声波形の時間区分およびそ
の部分の音韻認識を同時に行なういわゆる音韻スポッテ
ィング方法とが提案されている。[Problems to be Solved by Conventional Techniques and Inventions] Conventional speech recognition methods include a method of performing phoneme recognition after segmenting a continuous speech waveform with time divisions, and a method of performing continuous speech waveform timing. A so-called phonological spotting method has been proposed in which phonological recognition of a segment and its part is performed simultaneously.

しかしながら、前者の方法においては、各音韻の存在す
る音韻環境にかかわらず、画一的なパワーやスペクトル
の変化などの単純なパラメータの組合わせでセグメンテ
ーションを行なっているため、高精度な音韻認識率を得
ることができない。さらに、セグメンテーション法は、
音韻区間の同定にのみ用いられ、セグメンテーション方
法により得られる音韻グループを用いて最終的な音韻の
認識結果を決定し、音韻認識率の向上を図ったものは提
案されていない。また、後者の方法においては、連続す
る音韻の境界付近で音韻の誤認識や挿入誤りが多く、そ
の結果高い音韻認識率が得られないという欠点があっ
た。However, in the former method, segmentation is performed with a combination of simple parameters such as uniform power and spectrum change regardless of the phoneme environment in which each phoneme exists. Can't get Furthermore, the segmentation method is
There is no proposal to improve the phoneme recognition rate by using the phoneme groups obtained by the segmentation method to determine the final phoneme recognition result, which is used only for identifying phoneme intervals. Also, the latter method has a drawback in that there are many phoneme misrecognitions and insertion errors near the boundaries of consecutive phonemes, and as a result, a high phoneme recognition rate cannot be obtained.

それゆえに、この発明の主たる目的は、セグメンテーシ
ョン誤りによる音韻の誤認識および音韻スポッティング
法による音韻境界における音韻の誤認識や挿入誤りを解
決し、さらにセグメンテーション法により得られる音韻
グループを用いて最終的な音韻の認識結果を決定し、高
い音韻認識が可能な音声認識装置を提供することであ
る。Therefore, the main object of the present invention is to solve phonological misrecognition due to segmentation error and phonological misrecognition and insertion error at phonological boundaries by phonological spotting method, and further to use the phonological groups obtained by segmentation method to make a final decision. It is an object of the present invention to provide a speech recognition device capable of determining a phoneme recognition result and performing high phoneme recognition.

［課題を解決するための手段］請求項１に係る発明は、入力された音声を認識する音声
認識装置であって、入力された音声のある周波数帯域に
おけるパワーの大きさと、ある周波数帯域におけるパワ
ーの変化量と、ある周波数帯域におけるスペクトルの変
化量と、ある周波数帯域と他のある周波数帯域とにおけ
るパワーの比の音響特徴に基づいて、入力された音声を
音韻グループごとに区間と確信度を決定してセグメンテ
ーションを行なうセグメンテーション手段と、それぞれ
が音韻グループに対応して設けられ、セグメンテーショ
ンされた音韻グループのうち最大確信度の与えられた音
韻グループの音韻を認識する複数の時間遅れニューラル
ネットワークと、決定された音韻グループの区間と確信
度および時間遅れニューラルネットワークによって認識
された音韻とに基づいて音声を認識する認識手段とを備
えて構成される。[Means for Solving the Problem] The invention according to claim 1 is a voice recognition device for recognizing an input voice, the magnitude of the power of the input voice in a certain frequency band, and the power in the certain frequency band. Based on the acoustic characteristics of the amount of change in the spectrum, the amount of change in the spectrum in a certain frequency band, and the ratio of the power in a certain frequency band to another frequency band, the input speech is divided into sections and confidence factors for each phonological group. Segmentation means for determining and performing segmentation, each provided corresponding to the phoneme group, a plurality of time-delayed neural networks for recognizing the phonemes of the phoneme group given the maximum certainty factor among the segmented phoneme groups, Interval of phonological group determined and certainty factor and time delay neural network The recognition means for recognizing a voice based on the phoneme recognized by.

請求項２に係る発明は、請求項１と同様のセグメンテー
ション手段と、セグメンテーションされた音韻グループ
の音韻を認識する時間遅れニューラルネットワークと、
決定された音韻グループと時間遅れニューラルネットワ
ークの認識結果との妥当性を表わす関数を決定する関数
決定手段と、決定された音韻グループの区間と確信度お
よび決定された関数とに基づいて音声を認識する認識手
段とを備えて構成される。The invention according to claim 2 is the same segmentation means as in claim 1, a time-delay neural network for recognizing the phonemes of a segmented phoneme group,
Function deciding means for deciding a function representing the validity of the decided phonological group and the recognition result of the time-delay neural network, and speech recognition based on the decided phonological group section, the certainty factor and the decided function. And a recognition means for doing so.

請求項３に係る発明は、請求項１のセグメンテーション
手段および複数の時間遅れニューラルネットワークを備
えるとともに、セグメンテーション手段によって決定さ
れた音韻グループと時間遅れニューラルネットワークの
認識結果との妥当性を表わす関数を決定する関数決定手
段と、決定された音韻グループの関数と確信度および関
数に基づいて音声を認識する認識手段とを備えて構成さ
れる。The invention according to claim 3 comprises the segmentation means of claim 1 and a plurality of time-delayed neural networks, and determines a function representing the validity of the phoneme group determined by the segmentation means and the recognition result of the time-delayed neural network. Function determining means, and a recognizing means for recognizing speech based on the determined phoneme group function and certainty factor and function.

［作用］この発明に係る音声認識装置は、入力された音声を音韻
グループごとに区間と確信度を決定してセグメンテーシ
ョンを行ない、そのうち最大確信度の与えられた音韻グ
ループの音韻を時間遅れニューラルネットワークによっ
て認識し、決定された音韻グループの区間と確信度およ
び時間遅れニューラルネットワークによって認識された
音韻とに基づいて音声を認識する。[Operation] The speech recognition apparatus according to the present invention performs segmentation by determining an interval and a certainty factor of an input speech for each phoneme group and segmenting the phoneme of the phoneme group given the maximum certainty factor into a time-delayed neural network. The speech is recognized based on the section of the phoneme group that is recognized and determined by the certainty factor and the phoneme recognized by the time delay neural network.

［発明の実施例］第１図はこの発明が適用される音声認識装置の概略ブロ
ック図である。第１図を参照して、音声認識装置はアン
プ１とローパスフィルタ２とA/D変換器３と処理装置４
とを含む。アンプ１は入力された音声信号を増幅し、ロ
ーパスフィルタ２は増幅された音声信号から折返し雑音
を除去する。A/D変換器３は音声信号をサンプリングし
てディジタル信号に変換する。処理装置４はコンピュー
タ５と磁気ディスク６と端末類７とプリンタ８とを含
む。コンピュータ５はA/D変換部３から入力されたサン
プリングされた音声のディジタル信号に基づいて、後述
の第２図ないし第５図に示した手法を用いて音声認識を
行なう。[Embodiment of the Invention] FIG. 1 is a schematic block diagram of a voice recognition apparatus to which the present invention is applied. Referring to FIG. 1, the voice recognition device includes an amplifier 1, a low pass filter 2, an A / D converter 3 and a processing device 4.
Including and The amplifier 1 amplifies the input audio signal, and the low-pass filter 2 removes aliasing noise from the amplified audio signal. The A / D converter 3 samples the audio signal and converts it into a digital signal. The processing device 4 includes a computer 5, a magnetic disk 6, terminals 7, and a printer 8. The computer 5 performs voice recognition based on the sampled voice digital signal input from the A / D converter 3 by using a method shown in FIGS. 2 to 5 described later.

第２図ないし第５図はこの発明の音韻を識別して音声を
認識する各種方式を示す図である。2 to 5 are diagrams showing various methods of recognizing speech by identifying phonemes of the present invention.

まず、第２図ないし第５図に示すそれぞれの手法におい
て、共通の構成について説明する。第２図ないし第５図
に示した各方式は、３つの部分からなり、それぞれ音韻
セグメンテーション部、音韻識別部および音韻決定部か
らなる。これらの具体的な説明は、本願発明者が先に成
した特許出願（特願平１−61928号公報）において詳細
に説明しており、ここでは簡単に説明する。音韻セグメ
ンテーション部はルールベースで行なわれ、音韻候補の
検出が、音韻クラスごとにスペクトログラム上の大局的
な音響特徴を用いて、音韻の存在し得る大まかな位置が
検出される。ここでの音韻クラスは、たとえば無声摩擦
音や有声摩擦音などである。First, a common configuration in each method shown in FIGS. 2 to 5 will be described. Each of the systems shown in FIGS. 2 to 5 is composed of three parts, each of which comprises a phoneme segmentation section, a phoneme identification section, and a phoneme determination section. A detailed description thereof is given in a patent application (Japanese Patent Application No. 1-61928) previously filed by the inventor of the present application, and will be briefly described here. The phonological segmentation unit performs rule-based detection of phonological candidates by using global acoustic features on the spectrogram for each phonological class to detect a rough position where a phonological unit can exist. The phonological class here is, for example, an unvoiced fricative or a voiced fricative.

次に、音韻環境の仮説が行なわれる。すなわち、検出さ
れた音韻候補ごとに、それぞれの前後に音韻の種類が仮
説される。次に、音韻環境の仮説の下で音韻境界の検出
および仮説の検証が行なわれる。正しい仮説の下では、
仮説ごとに高い確信度が得られ、結果として音韻環境が
検出される。逆に誤った仮説では、確信度が低くなり、
音韻環境を得るに至らない。仮説が正しいか否かの判断
は、スペクトログラム上の音響特徴、すなわち入力され
た音声の或る周波数帯域におけるパワーの大きさと、パ
ワーの変化量と、スペクトルの変化量と、他の周波数帯
域に対するパワーの比等の音響特徴に基づいて判断され
る。次に、仮説された音韻クラスごとに最大確信度を与
える音韻境界がセグメンテーション結果とされ、その音
韻の始終端と音韻クラスが確信度付きで出力される。Next, a hypothesis of the phonological environment is made. That is, for each detected phoneme candidate, a phoneme type is hypothesized before and after each. Next, phonological boundary detection and hypothesis verification are performed under the phonological environment hypothesis. Under the correct hypothesis,
A high degree of certainty is obtained for each hypothesis, and as a result, the phonological environment is detected. On the other hand, the wrong hypothesis leads to lower confidence.
It does not reach the phonological environment. Whether or not the hypothesis is correct is determined by the acoustic features on the spectrogram, that is, the magnitude of power in a certain frequency band of the input voice, the amount of change in power, the amount of change in spectrum, and the power for other frequency bands. Judgment is made based on acoustic characteristics such as the ratio. Next, the phoneme boundary that gives the maximum certainty factor for each hypothesized phoneme class is set as the segmentation result, and the start and end of the phoneme and the phoneme class are output with certainty factor.

第６図は音韻を識別するための時間遅れニューラル・ネ
ットワーク（TDNN）の一例を示す図である。次に、第６
図を参照して、上述のようにして検出されたセグメンテ
ーションの音韻を識別する方法について説明する。第６
図に示した時間遅れニューラル・ネットワークは18の子
音を有声破裂音，無声破裂音，鼻音，有声摩擦音，無声
摩擦音，流音の６つのクラスにグループ化し、それぞれ
のグループが入力層11に入力される。入力層11は従来か
ら知られているバックプロパゲーションの学習により、
セグメンテーションされた音韻の識別を行なう。各クラ
スの識別は中間層12によって行なわれる。この実施例で
は、時間遅れニューラル・ネットワークの学習は、すべ
ての子音の終端位置を入力層11の前から2/3の位置に合
わせて行なわれる。同様にして、音韻識別では、セグメ
ンテーション結果の終端が入力層11の同じ位置に適用さ
れ、時間遅れニューラル・ネットワークの出力層13が出
力する最大確信度を与える音韻を識別結果とする。FIG. 6 is a diagram showing an example of a time delay neural network (TDNN) for identifying phonemes. Next, the sixth
A method of identifying the phoneme of the segmentation detected as described above will be described with reference to the drawings. Sixth
The time-delayed neural network shown in the figure groups 18 consonants into 6 classes: voiced plosives, unvoiced plosives, nasal sounds, voiced fricatives, unvoiced fricatives, and stream sounds, and each group is input to the input layer 11. It The input layer 11 is learned from the conventionally known back propagation,
Identify segmented phonemes. The identification of each class is performed by the middle layer 12. In this embodiment, the learning of the time-delayed neural network is performed by adjusting the end positions of all the consonants to the positions 2/3 from the front of the input layer 11. Similarly, in the phoneme identification, the end of the segmentation result is applied to the same position of the input layer 11, and the phoneme giving the maximum certainty degree output by the output layer 13 of the time delay neural network is used as the identification result.

第２図ないし第５図に示した音韻決定部では、音韻クラ
スごとにセグメンテーション結果およびその区間に適用
した時間遅れニューラル・ネットワークが出力する音韻
識別結果を用いて、最大確信度を与える音韻とその区間
が決定される。The phoneme determination unit shown in FIG. 2 to FIG. 5 uses the segmentation result for each phoneme class and the phoneme identification result output by the time-delay neural network applied to that section to determine the phoneme that gives the maximum certainty factor and its phoneme. The section is determined.

第２図に示した方式は最も単純なセグメンテーション法
と音韻識別法の組合わせにより音韻を識別し、音声を認
識するものである。入力された音声は分析され、特徴抽
出が行なわれた後、セグメンテーション部において、た
とえば無声摩擦音の確信度が0.62であり、有声摩擦音の
確信度が0.51であるという決定が行なわれる。そして、
確信度の大きい無声摩擦音が選択され、この無声摩擦音
が第６図に示した時間遅れニューラル・ネットワークに
入力され、前述の特願平１−61928号に開示されている
方式を用いて音韻識別が行なわれて音韻の認識が行なわ
れる。The system shown in FIG. 2 identifies a phoneme by a combination of the simplest segmentation method and phoneme identification method to recognize a voice. After the input voice is analyzed and feature extraction is performed, the segmentation unit determines that the confidence of unvoiced fricatives is 0.62 and the confidence of voiced fricatives is 0.51, for example. And
An unvoiced fricative with a high degree of certainty is selected, and this unvoiced fricative is input to the time-delayed neural network shown in FIG. 6, and phonological discrimination is performed using the method disclosed in Japanese Patent Application No. 1-61928. The phoneme recognition is performed.

第３図に示した例は、セグメンテーション法を音韻グル
ープの絞り込みに用いた手段により音韻が識別され、音
声を認識するものである。この例では、入力された音声
は分析され特徴抽出の結果、セグメンテーション部にお
いて最大確信度を与える結果が決定され、その音韻グル
ープが有声音グループであるか無声音グループであるか
に応じて有声子音識別用時間遅れネットワークあるいは
無声子音識別時間遅れニューラル・ネットワークが選択
的に適用されてその区間内の音韻識別が行なわれる。In the example shown in FIG. 3, the phoneme is identified by the means using the segmentation method for narrowing down the phoneme group, and the voice is recognized. In this example, the input speech is analyzed, and as a result of feature extraction, the result that gives the maximum certainty factor in the segmentation unit is determined, and voiced consonant identification is performed depending on whether the phoneme group is a voiced sound group or an unvoiced sound group. A time delay network or unvoiced consonant recognition time delay neural network is selectively applied to identify phonemes within the interval.

一般に、識別音韻の種類が少ないほど時間遅れニューラ
ル・ネットワークの識別能力が上がることから、セグメ
ンテーション結果の音韻クラス間に混同がない場合、ク
ラスごとに音韻識別を行なう時間遅れニューラル・ネッ
トワークを用いた方が識別率が向上することが期待され
る。つまり、セグメンテーション部により音韻クラスの
絞り込みを行ない、そのクラス内の音韻識別が行なわれ
る。In general, the smaller the number of distinct phonemes, the better the discrimination ability of the time-delayed neural network. Therefore, if there is no confusion among the phoneme classes of the segmentation results, it is better to use a time-delayed neural network that performs phoneme discrimination for each class. It is expected that the identification rate will be improved. That is, the phoneme class is narrowed down by the segmentation unit, and the phoneme identification within the class is performed.

第７図は第３図で説明した有声子音識別用時間遅れニュ
ーラル・ネットワークおよび無声子音識別用時間遅れニ
ューラル・ネットワークの一例を示す図である。第７図
（ａ）に示した無声子音識別用ニューラル・ネットワー
クは無声８子音（p,t,k,ch,ts,s,sh,h）を識別するもの
であり、入力層21と中間層22と出力層23とを含む。ま
た、第７図（ｂ）に示した有声子音識別用時間遅れニュ
ーラル・ネットワークは有声７子音（b,d,g,m,n,r,z）
を識別するものであり、入力層31と中間層32と出力層33
とを含む。FIG. 7 is a diagram showing an example of the time-delayed neural network for voiced consonant identification and the time-delayed neural network for unvoiced consonant identification described in FIG. The unvoiced consonant discrimination neural network shown in FIG. 7 (a) is for discriminating eight unvoiced consonants (p, t, k, ch, ts, s, sh, h). 22 and an output layer 23. In addition, the time-delayed neural network for voiced consonant identification shown in FIG. 7 (b) is used for voiced 7 consonants (b, d, g, m, n, r, z).
Input layer 31, intermediate layer 32 and output layer 33.
Including and

第４図に示した例は、セグメンテーション法の音韻グル
ープと音韻識別法の結果の妥当性を表わす関数を用いて
音韻を識別して音声を認識するものであり、第２図およ
び第３図で説明した実施例と同様にして、セグメンテー
ション部において無声摩擦音と有声摩擦音の確信度が決
定され、その後第６図に示した時間遅れニューラル・ネ
ットワークを用いて、その区間内の音韻識別が行なわれ
て音韻認識が行なわれる。すなわち、この第４図に示し
た例では、音韻区間の候補とその音韻グループが出力さ
れ、時間遅れニューラル・ネットワークの識別音韻とセ
グメンテーション結果の音韻クラスの妥当性を考慮に入
れることができ、音韻セグメンテーションおよび音韻識
別の能力がともに向上することが期待できる。The example shown in FIG. 4 identifies a phoneme by using a phoneme group of the segmentation method and a function representing the validity of the result of the phoneme identification method to recognize a voice, and in FIG. 2 and FIG. Similar to the embodiment described, the segmentation unit determines the certainty factors of unvoiced fricatives and voiced fricatives, and then uses the time-delayed neural network shown in FIG. Phonological recognition is performed. That is, in the example shown in FIG. 4, phoneme segment candidates and their phoneme groups are output, and the validity of the phoneme class of the time delay neural network and the phoneme class of the segmentation result can be taken into consideration. It is expected that both the ability of segmentation and phoneme identification will be improved.

ここで、その妥当性を表わす関数の一例として、次の第
（１）式および第（２）式を用いて、最大の確信度（Ce
rtainty Factor）を与える音韻を認識結果とする方法
として示す。Here, as an example of a function representing the validity, the maximum confidence factor (Ce
This is shown as a method of using the phoneme that gives the rtainty factor) as the recognition result.

CFrec＝combine （CFseg,CFnn） …（１） CFnn＝ｋ・Wnn・ｆ（arg（seg）,arg（nn）） …
（２）但し、 CFrec:最終音韻認識の確信度 CFseg:セグメンテーション結果の確信度 CFnn:音韻識別結果の確信度 Wnn:時間遅れニューラル・ネットワークの識別音韻の出
力値 arg（seg）：セグメンテーション結果の音韻クラス arg（nn）：時間遅れニューラル・ネットワークの識別
音韻 k:係数（時間遅れニューラル・ネットワークの信頼度,k
が大きいほど時間遅れニューラル・ネットワークの出力
結果を信用している。）ｆ（）：識別音韻と音韻クラスの妥当性を示す関数。
時間遅れニューラル・ネットワークの識別音韻がセグメ
ンテーション結果の音韻クラスに属せば1.0,属さなけれ
ば1.0,有声音／無声音が一致していれば0.5を与える。CFrec = combine (CFseg, CFnn) (1) CFnn = k · Wnn · f (arg (seg), arg (nn))
(2) where CFrec: confidence of final phoneme recognition CFseg: confidence of segmentation result CFnn: confidence of phoneme identification result Wnn: output value of identification phoneme of time-delayed neural network arg (seg): phoneme of segmentation result Class arg (nn): time-delayed neural network identification phoneme k: coefficient (time-delayed neural network reliability, k
The larger is, the more reliable the output result of the time delay neural network. ) F (): a function indicating the validity of the identified phoneme and the phoneme class.
It gives 1.0 if the identification phoneme of the time-delay neural network belongs to the phoneme class of the segmentation result, 1.0 if it does not, and 0.5 if the voiced / unvoiced sounds match.

combine（）:MYCINの確信度計算モデル第５図に示した例は、セグメンテーション法を音韻グル
ープの絞り込みに用いた手段により音韻の識別手段を選
択し、セグメンテーション法の音韻グループと音韻識別
法の結果の妥当性を表わす関数を用いたことにより音韻
を識別し、音声を認識するものである。combine (): MYCIN confidence calculation model In the example shown in Fig. 5, the phoneme identification means is selected by the means using the segmentation method to narrow down the phoneme groups, and the result of the phoneme group and phoneme identification method of the segmentation method is selected. The phoneme is identified and the voice is recognized by using the function representing the validity of.

第８図はこの発明の各方式による子音認識結果をテーブ
ルに示した図である。18子音識別時間遅れニューラル・
ネットワークと有声音／無声音の２つの時間遅れニュー
ラル・ネットワークとを用いた場合、時間遅れニューラ
ル・ネットワークの識別音韻とセグメンテーション結果
の音韻クラスとの妥当性を考慮する場合としない場合、
さらに妥当性を考慮する場合どの程度時間遅れニューラ
ル・ネットワークの出力結果を信用するかなどの条件を
変えた実験を行なった。第８図において、18−CONS−TD
NNは18子音識別時間遅れニューラル・ネットワークを用
いた場合を示し、V/UV−TDNNは有声音／無声音の２つの
時間遅れニューラル・ネットワークを用いた場合を示
し、NO COMBは時間遅れニューラル・ネットワークの識
別音韻とセグメンテーション結果の音韻クラスの妥当性
を考慮しない場合を示し、with COMBは考慮した場合を
示す。FIG. 8 is a table showing the result of consonant recognition by each method of the present invention. 18 consonant recognition time delay neural
When a network and two time-delayed neural networks of voiced / unvoiced sound are used, with and without consideration of the validity of the identification phoneme of the time-delayed neural network and the phoneme class of the segmentation result,
Further, when considering the validity, an experiment was carried out by changing the conditions such as how much time delay neural network output results should be trusted. In Figure 8, 18-CONS-TD
NN shows the case of using the 18 consonant discrimination time delay neural network, V / UV-TDNN shows the case of using two time delay neural networks of voiced sound / unvoiced sound, and NO COMB is the time delay neural network. The case in which the validity of the identification phoneme and the phoneme class of the segmentation result is not considered is shown, and with COMB is taken into consideration.

前述の第（１）式および第（２）式の時間遅れニューラ
ル・ネットワークに対する依存度としては、ｋ＝0.4,0.
8の２つの値を用いた。ｋが大きいほど時間遅れニュー
ラル・ネットワークの出力結果を信用していることにな
る。Recognition Rateは音韻セグメンテーション，音
韻識別ともに正しく行なわれた場合を示し、Insertion
Error Rateは付加誤り率を示し、Segmentation Rat
eは音韻の始終端境界誤差が50msec以内に検出され正し
くセグメンテーションされたと判断された割合を示し、
Boundary Alignment Errorは正しく検出された境界の
視察ラベルに対するずれを示し、within Correct Seg
mentation Rateはこの発明により正しくセグメンテー
ションされた区間の中での音韻識別率を示す。第８図に
示したテーブルは、音韻クラスの絞り込みを行なった上
で時間遅れニューラル・ネットワークを適用する方法の
有効性、また時間遅れニューラル・ネットワークの識別
音韻とセグメンテーション結果の音韻クラスの妥当性を
考慮する方法の有効性を示している。The degree of dependence of the equations (1) and (2) on the time delay neural network is k = 0.4,0.
Two values of 8 were used. The larger k is, the more reliable the output result of the time delay neural network is. Recognition Rate indicates the case where both phoneme segmentation and phoneme identification are performed correctly.
Error Rate indicates the additional error rate, and Segmentation Rat
e indicates the rate at which the start / end boundary error of the phoneme was detected within 50 msec and was judged to be correctly segmented.
Boundary Alignment Error indicates the deviation of the correctly detected boundary with respect to the inspection label.
The “mentation rate” indicates the phoneme identification rate in the segment that is correctly segmented according to the present invention. The table shown in FIG. 8 shows the effectiveness of the method of applying the time-delay neural network after narrowing down the phoneme class, and the validity of the identification phoneme of the time-delay neural network and the phoneme class of the segmentation result. It demonstrates the effectiveness of the method considered.

なお、音韻グループの絞り込みは、有声音／無声音など
の分け方に限ることなく、摩擦音声，鼻音声音，破裂性
音などの分け方も可能であり、この分け方に応じた音声
識別方法を適用すればよい。Note that the phoneme groups can be narrowed down not only by dividing voiced sounds / unvoiced sounds, but also by separating frictional sounds, nasal sounds, plosive sounds, and the like. do it.

また、上述の実施例の音韻識別方式においては、時間遅
れニューラル・ネットワークを用いたが、その他の一般
的な統計的手法による音韻グループ内の音韻識別方法を
用いてもよい。たとえば、一般のニューラル・ネットワ
ークによる音韻識別方法や、HMMによる音韻識別方法
や、ベイズ則による音韻識別方法や、線形判別による音
韻識別方法や、LVQなどの方法にて設計した標準パター
ンを用いた音韻識別方法などが適用可能である。Further, although the time-delay neural network is used in the phoneme identification method of the above-described embodiment, a phoneme identification method within a phoneme group by another general statistical method may be used. For example, phoneme identification methods using general neural networks, phoneme identification methods using HMM, phoneme identification methods using Bayesian rules, phoneme identification methods using linear discrimination, and phonemes using standard patterns designed by methods such as LVQ. An identification method or the like can be applied.

［発明の効果］以上のように、この発明によれば、入力された音声を音
韻グループごとに区間と確信度を決定してセグメンテー
ションを行ない、最大確信度の与えられた音韻グループ
の音韻を認識し、決定された音韻グループの区間と確信
度および時間遅れニューラルネットワークによって認識
された音韻とに基づいて音声を認識することができ、高
い性能の音韻認識を可能にすることができる。[Effects of the Invention] As described above, according to the present invention, the segment and the certainty factor of the input speech are determined for each phoneme group, segmentation is performed, and the phoneme of the phoneme group given the maximum certainty factor is recognized. Then, the speech can be recognized based on the determined section of the phoneme group and the phoneme recognized by the certainty factor and the time-delay neural network, which enables high-performance phoneme recognition.

[Brief description of drawings]

第１図はこの発明の一実施例が適用される音声認識装置
全体の概略ブロック図である。第２図はこの発明の一実
施例における最も単純なセグメンテーション法と音韻識
別法の組合わせにより音韻を識別して音声を認識する一
例を示す図である。第３図はセグメンテーション法を音
韻グループの絞り込みに用いた手段により音韻を識別し
て音声を認識する一例を示す図である。第４図はセグメ
ンテーション法の音韻グループと音韻識別法の結果の妥
当性を表わす関数を用いたことにより音韻を識別して音
声を認識する一例を示す図である。第５図はセグメンテ
ーション法を音韻グループの絞り込みに用いた手段によ
り音韻を識別し、セグメンテーション法の音韻グループ
と音韻識別法の結果の妥当性を示す関数を用いたことに
より音韻を識別して音声を認識する一例を示す図であ
る。第６図は第２図および第４図で用いた18子音識別用
時間遅れニューラル・ネットワークの一例を示す図であ
る。第７図は第３図および第５図の実施例で用いた有声
音／無声音別の子音識別用時間遅れニューラル・ネット
ワークの一例を示す図である。第８図はこの発明の各方
式による音韻認識結果をテーブルに示した図である。図において、１はアンプ、２はローパスフィルタ、３は
A/D変換器、４は処理装置、５はコンピュータ、６は磁
気ディスク、７は端末類、８はプリンタ、11,21,31は入
力層、12,22,32は中間層、13,23,33は出力層を示す。FIG. 1 is a schematic block diagram of an entire voice recognition device to which an embodiment of the present invention is applied. FIG. 2 is a diagram showing an example in which a phoneme is identified and a voice is recognized by a combination of the simplest segmentation method and phoneme identification method in an embodiment of the present invention. FIG. 3 is a diagram showing an example of recognizing a voice by identifying a phoneme by means using a segmentation method for narrowing down a phoneme group. FIG. 4 is a diagram showing an example in which a phoneme is identified and a voice is recognized by using a phoneme group of the segmentation method and a function representing the validity of the result of the phoneme identification method. FIG. 5 shows that phonemes are identified by the means using the segmentation method for narrowing down the phoneme groups, and the phonemes are identified by using the function indicating the validity of the result of the phoneme group and the phoneme identification method of the segmentation method. It is a figure which shows an example to recognize. FIG. 6 is a diagram showing an example of the time-delayed neural network for identifying 18 consonants used in FIGS. 2 and 4. FIG. 7 is a view showing an example of a time delay neural network for distinguishing voiced sounds / unvoiced sounds used in the embodiment of FIGS. 3 and 5. FIG. 8 is a table showing the result of phoneme recognition by each method of the present invention. In the figure, 1 is an amplifier, 2 is a low-pass filter, and 3 is
A / D converter, 4 processing unit, 5 computer, 6 magnetic disk, 7 terminals, 8 printer, 11,21,31 input layer, 12,22,32 intermediate layer, 13,23 Reference numerals 33 denote output layers.

Claims

[Claims]

1. A voice recognition device for recognizing an input voice, the magnitude of the power of the input voice in a certain frequency band, and the amount of change in the power in the certain frequency band,
Based on the acoustic characteristic of the amount of change in spectrum in a certain frequency band and the power ratio in a certain frequency band and another certain frequency band, the segment and the confidence level of the input speech are determined for each phonological group. A plurality of time-delay neural networks for recognizing the phonemes of the phoneme group to which maximum confidence is given among the phoneme groups segmented by the segmentation means, and A speech recognition apparatus, comprising: a recognition unit for recognizing a voice based on a phoneme group section determined by the segmentation unit, a certainty factor, and a phoneme recognized by the time delay neural network.

2. A voice recognition device for recognizing input voice, the magnitude of power in a certain frequency band of the input voice, and the amount of change in power in a certain frequency band,
Based on the acoustic characteristic of the amount of change in spectrum in a certain frequency band and the power ratio in a certain frequency band and another certain frequency band, the segment and the confidence level of the input speech are determined for each phonological group. A time-delayed neural network that recognizes the phonemes of the phoneme group segmented by the segmentation means, and a function that represents the validity of the phoneme group determined by the segmentation means and the recognition result of the time-delayed neural network. A voice having a function deciding means for deciding, and a recognizing means for recognizing a voice based on the section of the phoneme group decided by the segmentation means, the certainty factor and the function decided by the function deciding means. Identification equipment.

3. A voice recognition device for recognizing an input voice, the magnitude of power in a certain frequency band of the input voice, and the amount of change in power in a certain frequency band,
Based on the acoustic characteristic of the amount of change in spectrum in a certain frequency band and the power ratio in a certain frequency band and another certain frequency band, the segment and the confidence level of the input speech are determined for each phonological group. A plurality of time-delay neural networks for recognizing the phonemes of the phoneme group to which the maximum certainty factor is given among the phoneme groups segmented by the segmentation means. Function determining means for determining a function representing the validity of the phoneme group determined by the means and the recognition result of the time-delay neural network, and the interval and the confidence factor of the phoneme group determined by the segmentation means. Having a recognition means for recognizing speech based on the function determined by said function determining means, the speech recognition device.