JPH0398098A

JPH0398098A - Voice recognition device

Info

Publication number: JPH0398098A
Application number: JP1236471A
Authority: JP
Inventors: Yasuhiro Komori; 康弘小森
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1989-09-11
Filing date: 1989-09-11
Publication date: 1991-04-23
Anticipated expiration: 2009-10-19
Also published as: JPH0682275B2

Abstract

PURPOSE:To enable high-performance phoneme recognition by using a phoneme group result obtained by a segmentation method and combining it with a phoneme discriminating method. CONSTITUTION:The device consists of an amplifier 1, a low-pass filter 2, an A/D converter 3, and a processor 4. The amplifier 1 amplifies an input voice signal and the low-pass filter 2 removes folded-back noises from the voice signal. The D/A converter 3 samples and converts the voice signal into a digital signal and the processor 4 has a computer 5, magnetic disks 6, a terminal, etc., 7, and a printer 8. The computer 5 detects the section of a phoneme and its phoneme group by the segmentation method according to the digital signal of the sampled voice which is inputted from the A/D conversion part 3 and combines the phoneme group with the phoneme discrimination result. Consequently, the high-performance phoneme discrimination becomes possible.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は音声認識装置に関し、特に、入力された音声
を音韻グループごとにセグメンテーションを行ない、こ
のセグメンテーション法とこれに適用する音韻識別ニュ
ーラル・ネットワークを融合して音韻を認識するような
音声認識装置に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a speech recognition device, and in particular to a speech recognition device that performs segmentation of input speech for each phoneme group, and describes this segmentation method and a phoneme identification neural network applied thereto. This invention relates to a speech recognition device that recognizes phonemes by fusing them.

［従来の技術および発明が解決しようとする課題］従来
の音声認識の方法は、連続した音声波形に時間区分を入
れてセグメンテーションを行なった後に、音韻認識を行
なう方法と、連続した音声波形の時間区分およびその部
分の音韻認識を同時に行なういわゆる音韻スポッティン
グ方法とが提案されている。[Prior art and problems to be solved by the invention] Conventional speech recognition methods include a method in which a continuous speech waveform is segmented by time division, and then phoneme recognition is performed; A so-called phoneme spotting method has been proposed that simultaneously performs segmentation and phoneme recognition of the segment.

しかしながら、前者の方法においては、各音韻の存在す
る音韻環境にかかわらず、画一的なバワ−やスペクトル
の変化などの単純なパラメータの組合わせでセグメンテ
ーションを行なっているため、高精度な音韻認識率を得
ることができない。However, in the former method, segmentation is performed using a combination of simple parameters such as uniform power and spectral changes, regardless of the phonological environment in which each phoneme exists, resulting in highly accurate phoneme recognition. I can't get the rate.

さらに、セグメンテーシション法は、音韻区間の同定に
のみ用いられ、セグメンテーション方法により得られる
Ｔ８韻グループを用いて最終的な音韻の認識結果を決定
し、音韻認識率の向上を図ったものは提案されていない
。また、後者の方法においては、連続する音韻の境界付
近で音韻の誤認識や押入誤りが多く、その結果高い音韻
認識率が得られないという欠点があった。Furthermore, the segmentation method is only used to identify phoneme intervals, and the T8 rhyme groups obtained by the segmentation method are used to determine the final phoneme recognition result, and the proposed method aims to improve the phoneme recognition rate. It has not been. In addition, the latter method has the disadvantage that there are many phoneme recognition errors and intrusion errors near the boundaries of consecutive phonemes, and as a result, a high phoneme recognition rate cannot be obtained.

それゆえに、この発明の主たる目的は、セグメンテーシ
ョン課りによる音韻の誤認識および音韻スポッティング
法による音韻境界における音韻の誤認識や挿入誤りを解
決し、さらにセグメンテーション法により得られる音韻
グループを用いて最終的な音韻の認識結果を決定し、高
い音韻認識が可能な音声認識装置を提供することである
。Therefore, the main purpose of this invention is to solve the misrecognition of phonemes caused by segmentation, the misrecognition of phonemes at phoneme boundaries and insertion errors caused by the phoneme spotting method, and to solve the problems of phoneme recognition and insertion errors at phoneme boundaries using the phoneme spotting method. It is an object of the present invention to provide a speech recognition device capable of determining a phoneme recognition result and achieving high phoneme recognition.

［課題を解決するための手段コこの発明は入力された音声を認識する音声認識装置であ
って、入力された音声から予め設定された音韻グループ
ごとの位置または区間を検出する検出手段と、入力され
た音声から予め設定された音韻グループ内の音韻を音韻
識別ニューラル・ネットワークを用いて識別する識別手
段とを備えて構成され、検出手段によって検出された音
韻グループごとの位置または区間と、識別手段によって
識別された音韻とに基づいて音声認識が行なわれる。[Means for Solving the Problems] The present invention is a speech recognition device that recognizes input speech, which includes a detection means for detecting a position or section of each preset phoneme group from the input speech, and an input speech recognition device. an identification means for identifying phonemes within a preset phoneme group from the recorded speech using a phoneme identification neural network, and a position or section of each phoneme group detected by the detection means, and an identification means. Speech recognition is performed based on the phonemes identified by.

より好ましくは、検出手段は入力された音声の或る周波
数・：ｉ｝域におけるパワーの大きさと、或る周波数４
１｝域におけるパワーの変化量と、或る周波数帯域にお
けるスペクトルの変化量と、或る周波数帯域と他の或る
周波数帯域とにおけるパワーの比等の音響特徴に基づい
てセグメンテーション結果とその音韻グループとが決定
され、決定されたセグメンテーション結果の音韻グルー
プにより、適用する音韻識別ニューラル・ネットワーク
の音韻グループの絞り込みが行なわれ、絞り込まれた音
韻グループに応じたｇ韻識別ニューラル・ネットワーク
を適用することにより音韻の認識を行なわれる。More preferably, the detection means detects the magnitude of the power in a certain frequency range of the input voice and the power level in a certain frequency range.
1) Segmentation results and their phoneme groups based on acoustic features such as the amount of change in power in the frequency band, the amount of change in the spectrum in a certain frequency band, and the ratio of power in a certain frequency band and another certain frequency band. is determined, and based on the phoneme groups determined as a result of segmentation, the phoneme groups to be applied are narrowed down for the phoneme identification neural network, and by applying the g-rhyme identification neural network according to the narrowed down phoneme groups. Phonological recognition is performed.

さらに、より好ましくは、検出手段は、入力された音声
の或る周波数帯域におけるパワーの大きさと、或る周波
数帯域におけるパワーの変化量と、或る周波数帯域にお
けるスペクトルの変化量と、或る周波数帯域と他の或る
周波数帯域とにおけるパワーの比等の音響特徴に基づい
て、セグメンテーションとその音韻グループを推定し、
推定されたセグメンテーション候補の音韻グループと推
定されたセグメンテーションとに音韻識別ニューラル・
ネットワークを適用して音韻識別を行ない、この音韻識
別結果と検出手段によって得られたセグメンテーション
候補の音韻グループとの妥当性を表わす関数を用いるこ
とにより、最終的な音韻の認識結果が決定される。Still more preferably, the detection means detects the magnitude of power in a certain frequency band of the input voice, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. Estimating the segmentation and its phonological group based on acoustic features such as the ratio of power in the band and some other frequency band,
The phonological discrimination neural
A network is applied to perform phoneme identification, and a final phoneme recognition result is determined by using a function representing the validity of the phoneme identification result and the segmentation candidate phoneme group obtained by the detection means.

さらに、より好ましくは、検出手段は入力された音声の
或る周波数帯域におけるパワーの大きさと、或る周波数
帯域におけるパワーの変化量と、或る周波数帯域におけ
るスペクトルの変化量と、或る周波数帯域と他の或る周
波数帯域とにおけるパワーの比等の音響特徴に基づいて
、セグメンテーションとその音韻グループを推定し、推
定されたセグメンテーション結果の音韻グループにより
適用する音韻識別ニューラル・ネットワークの音韻グル
ープの絞り込みを行ない、絞り込んだ音韻グループに応
じた音韻識別ニューラル・ネットワークを適用して音韻
識別を行ない、この音韻識別結果と検出手段によって得
られたセグメンテーション候補の音韻グループとの妥当
性を表わす関数により、最終的な音韻の認識結果が決定
される。Still more preferably, the detection means detects the magnitude of power in a certain frequency band of the input voice, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. A phonological identification neural network that estimates segmentation and its phonological groups based on acoustic features such as the ratio of power between and another certain frequency band, and narrows down the phonological groups using the phonological identification neural network applied by the phonological groups of the estimated segmentation results. Then, phoneme identification is performed by applying a phoneme identification neural network corresponding to the narrowed-down phoneme groups, and the final phonological recognition results are determined.

［作用］この発明に係る音声認識装置は、セグメンテーション法
によって得られる音韻グループ結果を川いて、音韻識別
法と融合することにより最終的な音韻の認識結果が決定
されて音韻の認識が行なわれる。その結果、高性能な音
韻認識を可能にし、高性能な音声認識装置を構築できる
。[Operation] The speech recognition device according to the present invention uses the phoneme group results obtained by the segmentation method and combines them with the phoneme identification method to determine the final phoneme recognition result and perform phoneme recognition. As a result, high-performance phoneme recognition becomes possible and a high-performance speech recognition device can be constructed.

［発明の実施例コ第１図はこの発明が適用される音声認識装置の概略ブロ
ック図である。第１図を参照して、音声認識装置はアン
プ１とローパスフィルタ２とＡ／Ｄ変換器３と処理装置
４とを含む。アンプ１は入力された音声信号を増幅し、
ローパスフィルタ２は増幅された音声信号から折返し雑
音を除去する。[Embodiment of the Invention] FIG. 1 is a schematic block diagram of a speech recognition device to which the present invention is applied. Referring to FIG. 1, the speech recognition device includes an amplifier 1, a low-pass filter 2, an A/D converter 3, and a processing device 4. Amplifier 1 amplifies the input audio signal,
The low-pass filter 2 removes aliasing noise from the amplified audio signal.

Ａ／Ｄ変換器３は音声信号をサンプリングしてディジタ
ル信号に変換する。処理装置４はコンピュータ５と磁気
ディスク６と端末類７とプリンタ８とを含む。コンピュ
ータ５はＡ／Ｄ変換部３から入力されたサンプリングさ
れた音声のディジタル信号に基づいて、後述の第２図な
いし第５図に示した手法を用いて音声認識を行なう。The A/D converter 3 samples the audio signal and converts it into a digital signal. The processing device 4 includes a computer 5, a magnetic disk 6, a terminal 7, and a printer 8. The computer 5 performs voice recognition based on the sampled voice digital signal input from the A/D converter 3 using the method shown in FIGS. 2 to 5, which will be described later.

第２図ないし第５図はこの発明の音韻を識別して音声を
認識する各種方式を示す図である。FIGS. 2 to 5 are diagrams showing various methods of recognizing phonemes and recognizing speech according to the present invention.

まず、第２図ないし第５図に示すそれぞれの手法におい
て、共通の構成について説明する。第２図ないし第５図
に示した各方式は、３つの部分からなり、それぞれ音韻
セグメンテーション部、音韻識別部および音韻決定部か
らなる。これらの具体的な説明は、本願発明者が先に成
した特許出願（特願平１−６１９２８号公報）において
詳細に説明しており、ここでは簡単に説明する。音韻セ
グメンテーション部はルールベースで行なわれ、音韻候
補の検出が、音韻クラスごとにスペクトログラム上の大
局的な音響特徴を用いて、音韻の７１在し得る大まかな
位置が検出される。ここでの音部クラスは、たとえば無
声摩擦音や有声摩擦音などである。First, common configurations in each of the methods shown in FIGS. 2 to 5 will be explained. Each of the systems shown in FIGS. 2 to 5 consists of three parts, each consisting of a phoneme segmentation part, a phoneme identification part, and a phoneme determination part. These specific explanations were explained in detail in the patent application (Japanese Patent Application No. 1-61928) previously filed by the inventor of the present invention, and will be briefly explained here. The phoneme segmentation section is performed on a rule basis, and phoneme candidates are detected using global acoustic features on the spectrogram for each phoneme class, and the rough positions of 71 possible phonemes are detected. The phonic classes here include, for example, voiceless fricatives and voiced fricatives.

次に、音韻環境の仮説が行なわれる。すなわち、検出さ
れた音韻候補ごとに、それぞれの前後に音部の種類が仮
説される。次に、音韻環境の仮説の下で音韻境界の検出
および仮説の検証が行なわれる。正しい仮説の下では、
仮説ごとに高い確信度か得られ、結果として音韻環境が
検出される。逆に誤った仮説では、確信度が低くなり、
音韻環境を得るに至らない。仮説が正しいか否かの判断
は、スペクトログラム上の音響特徴、すなわち入力され
た音声の或る周波数・：；冫域におけるパワーの大きさ
と、パワーの変化量と、スペクトルの変化量と、他の周
波数・：；｝域に対するパワーの比等の音響特徴に基づ
いて判断される。次に、仮説された音韻クラスごとに最
大確信度を与える音韻境界がセグメンテーション結果と
され、その音韻の始終端と音韻クラスが確信度付きで出
力される。Next, a hypothesis of the phonological environment is made. That is, for each detected phoneme candidate, the types of sounds before and after each are hypothesized. Next, phonological boundaries are detected and the hypothesis is verified under the hypothesis of the phonological environment. Under the correct hypothesis,
A high degree of confidence is obtained for each hypothesis, and as a result, the phonological environment is detected. Conversely, if the hypothesis is incorrect, the confidence level will be low;
It is not possible to obtain a phonological environment. Judging whether the hypothesis is correct or not depends on the acoustic features on the spectrogram, that is, the magnitude of the power in the lower range, the amount of change in power, the amount of change in the spectrum, and other factors. Judgment is made based on acoustic characteristics such as the ratio of power to the frequency range. Next, the phoneme boundary that gives the maximum confidence for each hypothesized phoneme class is taken as the segmentation result, and the beginning and end of that phoneme and the phoneme class are output with confidence.

第６図は音韻を識別するための時間遅れニューラル・ネ
ットワーク（ＴＤＮＮ）の一例を示す図である。次に、
第６図を参照して、上述のようにして検出されたセグメ
ンテーションの音韻を識別する方法について説明する。FIG. 6 is a diagram showing an example of a time delay neural network (TDNN) for identifying phonemes. next,
With reference to FIG. 6, a method for identifying the phoneme of the segmentation detected as described above will be described.

第６図に示した時間遅れニューラル・ネットワークは１
８の子音を有声破裂音，無声破裂音，鼻音，有声摩擦音
．無声摩擦音，流音の６つのクラスにグループ化し、そ
れぞれのグループが入力層１１に入力される。入力層１
１は従来から知られているパックプロバゲーションの学
習により、セグメンテーションされた音韻の識別を行な
う。各クラスの識別は中間層１２によって行なわれる。The time-delay neural network shown in Figure 6 is 1
The 8 consonants are voiced plosives, voiceless plosives, nasals, and voiced fricatives. The sounds are grouped into six classes: voiceless fricatives and flowing sounds, and each group is input to the input layer 11. Input layer 1
1 identifies segmented phonemes by learning the conventionally known pack propagation. Identification of each class is performed by intermediate layer 12.

この実施例では、時間遅れニューラル・ネットワークの
学習は、すべての子音の終端位置を入力層１１の荊から
２／３の位置に合わせて行なわれる。同様にして、音韻
識別では、セグメンテーション結果の終端が入力層１１
の同し位置に適用され、時間遅れニューラル・ネットワ
ークの出力層１３が出力する最大確信度を与える音韻を
識別結果とする。In this embodiment, the learning of the time-delay neural network is performed by adjusting the end positions of all consonants to positions two-thirds of the way from the end of the input layer 11. Similarly, in phoneme identification, the end of the segmentation result is the input layer 11.
is applied to the same position, and the phoneme that gives the maximum confidence output by the output layer 13 of the time-delayed neural network is taken as the identification result.

第２図ないし第５図に示した音韻決定部では、音韻クラ
スごとにセグメンテーション結果およびその区間に適用
した時間遅れニューラル・ネットワークが出力する音韻
識別結果を用いて、最大確信度を与える音韻とその区間
が決定される。The phoneme determination unit shown in Figures 2 to 5 uses the segmentation results for each phoneme class and the phoneme identification results output from the time-delayed neural network applied to that section to identify the phoneme that gives the maximum confidence and its The interval is determined.

第２図に示した方式は最も単純なセグメンテーション法
と音韻識別法の組合わせにより音韻を識別し、音声を認
識するものである。入力された音声は分析され、特徴抽
出が行なわれた後、セグメンテーション部において、た
とえば無声摩擦音の確信度が０．６２であり、有声摩擦
音の確信度が０．５１であるという決定が行なわれる。The method shown in FIG. 2 identifies phonemes and recognizes speech by combining the simplest segmentation method and phoneme identification method. After the input speech is analyzed and feature extraction is performed, a determination is made in the segmentation unit that, for example, the confidence of voiceless fricatives is 0.62 and the confidence of voiced fricatives is 0.51.

そして、確信度の大きい無声摩擦音が選択され、この無
声摩擦音が第６図に示した時間遅れニューラル・ネット
ワークに入力され、前述の特願平１−６１９２８号に開
示されている方式を用いて音韻識別が行なわれて音韻の
認識が行なわれる。Then, an unvoiced fricative with a high degree of certainty is selected, and this unvoiced fricative is input into the time-delayed neural network shown in FIG. Identification is performed and phonological recognition is performed.

第３図に示した例は、セグメンテーション法を音韻グル
ープの絞り込みに用いた手段により音韻が識別され、音
声を認識するものである。この例では、入力された音声
は分析され特徴抽出の結果、セグメンテーション部にお
いて最大確信度を与える結果が決定され、その音韻グル
ープが有声音グループであるか無声音グループであるか
に応じて有声子音識別用時間遅れネットワークあるいは
無声子音識別時間遅れニューラル・ネットワークが選択
的に適用されてその区間内の音韻識別が行なわれる。In the example shown in FIG. 3, phonemes are identified and speech is recognized by a means that uses a segmentation method to narrow down phoneme groups. In this example, the input speech is analyzed and as a result of feature extraction, the segmentation part determines the result that gives the maximum confidence, and then identifies voiced consonants depending on whether the phoneme group is a voiced group or an unvoiced group. A time delay network or a time delay neural network for unvoiced consonant discrimination is selectively applied to perform phoneme discrimination within the interval.

一般に、識別音韻の種類が少ないほど時間遅れニューラ
ルナネットワークの識別能力が上がることから、セグメ
ンテーション結果の音韻クラス間に混同がない場合、ク
ラスごとに音韻識別を行なう時間遅れニューラル・ネッ
トワークを用リ．）た方が識別率が向上することが期待
される。つまり、セグメンテーション部により音韻クラ
スの絞り込みを行ない、そのクラス内の音韻識別が行な
われる。In general, the discrimination ability of a time-delayed neural network increases as the number of types of phonemes to be identified increases. Therefore, if there is no confusion between phoneme classes in the segmentation results, a time-delay neural network that performs phoneme identification for each class is used. ) is expected to improve the identification rate. That is, the segmentation unit narrows down phoneme classes and identifies phonemes within that class.

第７図は第３図で説明した有声子音識別用時間遅れニュ
ーラル・ネットワークおよび無声子音識別用時間遅れニ
ューラル・ネットワークの一例を示す図である。第７図
（ａ）に示した無声子音識別用ニューラル・ネットワー
クは無声８子音（ｐ，ｔ，ｋ，ａｈ，ｔｓ，ｓ，ｓｈ，
ｈ）を識別するものであり、入力層２１と中間層２２と
出力層２３とを含む。また、第７図（ｂ）に示した有声
子音識別用時間遅れニューラル・ネットワークは有声７
子音（ｂ，ｄ，ｇ，ｍ，ｎ，ｒ，ｚ）を識別するもので
あり、入力層３１と中間層３２と出力層３３とを含む。FIG. 7 is a diagram showing an example of the time-delay neural network for identifying voiced consonants and the time-delay neural network for identifying unvoiced consonants described in FIG. 3. The neural network for identifying voiceless consonants shown in FIG.
h) and includes an input layer 21, an intermediate layer 22, and an output layer 23. In addition, the time-delay neural network for identifying voiced consonants shown in FIG. 7(b) is
It identifies consonants (b, d, g, m, n, r, z), and includes an input layer 31, an intermediate layer 32, and an output layer 33.

第４図に示した例は、セグメンテーション法の音韻グル
ープと音韻識別法の結果の妥当性を表わす関数を用いて
音韻を識別して音声を認識するものであり、第２図およ
び第３図で説明した実施例と同様にして、セグメンテー
ション部において無声摩擦音と有声摩擦音の確信度が決
定され、その後第６図に示した時間遅れニューラル・ネ
ットワークを用いて、その区間内の音韻識別が行なわれ
て音韻認識が行なわれる。すなわち、この第４図に示し
た例では、音韻区間の候補とその音韻グループが出力さ
れ、時間遅れニューラル・ネットワークの識別音韻とセ
グメンテーション結果の音韻クラスの妥当性を考慮に入
れることができ、音韻セグメンテーションおよび音韻識
別の能力がともに向上することが期待できる。The example shown in Figure 4 recognizes speech by identifying phonemes using a function representing the validity of the phoneme groups of the segmentation method and the results of the phoneme identification method. Similar to the described embodiment, the segmentation unit determines the reliability of voiceless fricatives and voiced fricatives, and then uses the time-delay neural network shown in FIG. 6 to identify the phonemes within that interval. Phonological recognition is performed. In other words, in the example shown in FIG. 4, phoneme interval candidates and their phoneme groups are output, and the validity of the phonemes identified by the time-delayed neural network and the phoneme classes of the segmentation results can be taken into account. It is expected that both segmentation and phoneme discrimination abilities will improve.

ここで、その妥当性を表わす関数の一例として、次の第
（１）式および第（２）式を用いて、最大の確信度（　
Ｃｅｒｔａｉｎｔｙ　　Ｆａｃｔｏｒ）を与える音韻を
認識結果とする方法として示す。Here, as an example of a function expressing its validity, we use the following equations (1) and (2) to obtain the maximum confidence (
This is a method of using a phoneme that gives a certain factor as a recognition result.

ＣＰｒｅｃ　−ｃｏｍｂｉｎｅ　　（ＣＰｓｅｇ，ＣＦ
ｎｎ）　　　＝　（１）ＣＦｎｎ−ｋ　ａＷｎｎ　ｌ’
　　（ａｒｇ（ｓｅｇ），ａｒｇ（ｎｎ））（２）但し、ＣＰｒｅｃ　：最終音韻認識の確信度ＣＰｓｅｇ　：セグメンテーション結果の確信度ＣＰｎ
ｎ　：音韻識別結果の確信度Ｗｎｎ　：時間遅れニューラル・ネットワークの識別音
韻の出力値ａｒｇ（ｓｅｇ）　：セグメンテーション結果の音韻ク
ラスａｒｇ（ｎｎ）　　：時間遅れニューラル・ネットワー
クの識別音韻ｋ　：係数（晴間遅れニューラル・ネットワークの信頼
度，ｋが大きいほど時間遅れニューラル・ネットワーク
の出力結果を信用している。）『（）：識別音韻と音韻クラスの妥当性を示す関数。時
間遅れニューラル・ネットワークの識別音韻がセグメン
テーション結果の音韻クラスに属せば１．０，属さなけ
れば１．０，有声音／無声音が一致していれば０．５を
与える。CPrec-combine (CPseg, CF
nn) = (1) CFnn-k aWnn l'
(arg(seg), arg(nn)) (2) However, CPrec: Confidence level of final phoneme recognition CPseg: Confidence level of segmentation result CPn
n: Confidence of phoneme identification result Wnn: Output value of the identified phoneme of the time-delayed neural network arg (seg): Phoneme class of the segmentation result arg (nn): Identification phoneme of the time-delayed neural network k: Coefficient (single-delayed) The reliability of the neural network: The larger k is, the more reliable the output result of the time-delayed neural network is.) ``(): Function indicating the validity of the identified phoneme and phoneme class. If the identified phoneme of the time-delay neural network belongs to the phoneme class of the segmentation result, 1.0 is given, otherwise it is given 1.0, and if voiced/unvoiced sounds match, 0.5 is given.

ｃｏｍｂｉｎｅ（　）　：　ＭＹ　Ｃ　Ｉ　Ｎの確信度
計算モデル第５図に示した例は、セグメンテーション法
を音韻グループの絞り込みに用いた手段により音韻の識
別手段を選択し、セグメンテーション法の音韻グループ
と音韻識別法の結果の妥当性を表わす関数を用いたこと
により音韻を識別し、音声を認識するものである。combine ( ): MYC I N confidence calculation model The example shown in Figure 5 selects the phoneme identification means by using the segmentation method to narrow down the phoneme groups, and then selects the phoneme identification method using the segmentation method to narrow down the phoneme groups. This method identifies phonemes and recognizes speech by using a function that represents the validity of the results of the method.

第８図はこの発明の各方式による子音認識結果をテーブ
ルに示した図である。１８子音識別時間遅れニューラル
・ネットワークと有声音／無声音の２つの時間遅れニュ
ーラル・ネットワークとを用いた場合、時間遅れニュー
ラル・ネットワークの識別音韻とセグメンテーション結
果の音韻クラスとの妥当性を考慮する場合としない場合
、さらに妥当性を考慮する場合どの程度時間遅れニュー
ラル・ネットワークの出力結果を信用するかなどの条件
を変えた実験を行なった。第８図において、１８−ＣＯ
ＮＳ−ＴＤＮＮは１８子音識別時間遅れニューラル・ネ
ットワークを用いた場合を示し、Ｖ／ＵＶ−ＴＤＮＮは
有声音／無声音の２つの時間遅れニューラル・ネットワ
ークを用いた場合を示し、Ｎｏ　　ＣＯＭＢは時間遅れ
ニューラル・ネットワークの識別音韻とセグメンテーシ
ョン結果の音韻クラスの妥当性を考慮しない場合を示し
、ｗｉｔｈ　　ＣＯＭＢは考慮した場合を示す。FIG. 8 is a table showing consonant recognition results by each method of the present invention. When using a time-delay neural network for identifying 18 consonants and two time-delay neural networks for voiced/unvoiced sounds, we consider the validity of the phonology identified by the time-delay neural network and the phonological class of the segmentation results. We conducted an experiment in which conditions were changed, such as how much to trust the output results of the time-delayed neural network when considering the validity of the results. In Figure 8, 18-CO
NS-TDNN shows the case where an 18 consonant discrimination time delay neural network is used, V/UV-TDNN shows the case where two time delay neural networks for voiced/unvoiced sounds are used, and No COMB shows the case where a time delay neural network is used. - Indicates a case where the validity of the network's identified phoneme and the phoneme class of the segmentation result is not considered, and with COMB indicates a case where it is taken into account.

前述の第（１）式および第（２）式の時間遅れニューラ
ル・ネットワークに対する依存度としては、ｋ−０．４
，０．８の２つの値を用いた。ｋが大きいほど峙間遅れ
ニューラル・ネットワークの出力結果を信用しているこ
とになる。Ｒｅｃｏｇｎｉｔｉｏｎ　　Ｒａｔｅは音韻
セグメンテーション，音韻識別ともに正しく行なわれた
場合を示し、Ｉｎｓｅｒｔｉｏｎ　　Ｅｒｒｏｒ　　Ｒ
ａｔｅは付加訝り率を示し、Ｓｅｇｍｅｎｔａｔ　ｉｏ
ｎＲａｔｅは音韻の始終端境界誤差が５０ｍｓｅＣ以内
に検出され正しくセグメンテーションされたと判断され
た割合を示し、Ｂｏｕｎｄａ　ｒｙＡｌｉｇｎｍｅｎｔ
　　Ｅｒｒｏｒは正しく検出された境界の視察ラベルに
対するずれを示し、ｗｉｔｈｉｎ　　Ｃｏｒｒｅｃｔ　
　Ｓｅｇｍｅｎｔａｔｉｏｎ　　Ｒａｔｅはこの発明に
より正しくセグメンテーションされた区間のψでの音韻
識別率を示す。第８図に示したテーブルは、音韻クラス
の絞り込みを行なった上で時間遅れニューラル・ネット
ワークを適用する方法の有効性、また侍間遅れニューラ
ル・ネットワークの識別音韻とセグメンテーション結果
の音韻クラスの妥当性を考慮する方広の有効性を示して
いる。The dependence of Equations (1) and (2) above on the time-delay neural network is k-0.4.
, 0.8 were used. The larger k is, the more reliable the output result of the delay neural network is. Recognition Rate indicates when both phoneme segmentation and phoneme identification are performed correctly, and Insertion Error R
ate indicates the additional doubt rate, Segmentatio
nRate indicates the rate at which the beginning/end boundary error of a phoneme is detected within 50 msec and is judged to have been correctly segmented, and BoundaryAlignment
Error indicates the deviation of the correctly detected boundary from the inspection label, within Correct
The Segmentation Rate indicates the phoneme identification rate at ψ of a segment correctly segmented according to the present invention. The table shown in Figure 8 shows the effectiveness of the method of applying the time-delay neural network after narrowing down the phonological classes, and the validity of the phonological classes identified by the Samurai-delay neural network and the segmentation results. This shows the effectiveness of Hohiro considering the above.

なお、音韻グループの絞り込みは、有声音／無声音など
の分け方に限ることなく、摩擦音声，鼻音声音，破裂性
音などの分け方も可能であり、この分け方に応じた音声
識別方法を適用すればよい。Note that the narrowing down of phoneme groups is not limited to voiced/unvoiced sounds, but can also be divided into fricatives, nasal sounds, plosives, etc., and the speech identification method is applied according to this classification. do it.

また、上述の丈施例の音韻識別方式においては、時間遅
れニューラル・ネットワークを用いたが、その他の一般
的な統計的手法による音韻グループ内の音韻識別方法を
用いてもよい。たとえば、般のニューラル・ネットワー
クによる音韻識別方法や、ＨＭＭによる音韻識別方法や
、ベイズ則による音韻識別方法や、線形判別による音韻
識別方法や、ＬＶＱなどの方法にて設計した標準パター
ンを用いた音韻識別方法などが適用可能である。Further, in the phoneme identification method of the above-described embodiment, a time-delay neural network is used, but other general statistical methods may be used to identify phonemes within a phoneme group. For example, phoneme identification methods using general neural networks, phoneme identification methods using HMM, phoneme identification methods using Bayes' rule, phoneme identification methods using linear discrimination, and phoneme identification methods using standard patterns designed using methods such as LVQ. Identification methods etc. can be applied.

［発明の効果コ以上のように、この発明によれば、入力された音声をセ
グメンテーション法により音韻の区間とその音韻グルー
プとが検出され、この音韻グル−ブを音韻識別結果と組
合わせることにより高い音韻識別を可能にし、さらにセ
グメンテーション誤りによる音韻の誤認識および音韻ス
ポッティング法による音韻境界における音韻の誤認識や
挿入誤りを検出することができ、その結果、高い性能の
音韻認識を可能にすることができる。[Effects of the Invention] As described above, according to the present invention, phoneme sections and their phoneme groups are detected from input speech using a segmentation method, and by combining the phoneme groups with the phoneme identification results. To enable high-performance phoneme recognition, and to be able to detect phoneme misrecognition due to segmentation errors and phoneme misrecognition and insertion errors at phoneme boundaries using the phoneme spotting method, and as a result, to enable high-performance phoneme recognition. I can do it.

[Brief explanation of drawings]

第１図はこの発明の一実施例が適用される音声認識装置
全体の概略ブロック図である。第２図はこの発明の一実
施例における最も単純なセグメンテーション法と音韻識
別法の組合わせにより音韻を識別して音声を認識する一
例を示す図である。第３図はセグメンテーション法を音韻グループの絞り込
みに用いた手段により音韻を識別して音声を認識する一
例を示す図である。第４図はセグメンテーション法の音
韻グループと音韻識別法の結果の妥当性を表わす関数を
用いたことにより音韻を識別して音声を認識する一例を
示す図である。第５図はセグメンテーション法を音韻グループの絞り込
みに用いた手段により音韻を識別し、セグメンテーショ
ン法の音韻グループと音韻識別法の結果の妥当性を示す
関数を用いたことにより音韻を識別して音声を認識する
一例を示す図である。第６図は第２図および第４図で用いた１８子音識別用時
間遅れニューラル・ネットワークの一例を示す図である
。第７図は第３図および第５図の実施例で用いた有声音
／無声音別の子音識別用時間遅れニューラル・ネッ１・
ワークの一例を示す図である。第８図はこの発明の各方
式による音韻認識結果をテーブルに示した図である。図において、１はアンプ、２はローバスフィルタ、３は
Ａ／Ｄ変換器、４は処理装置、５はコンピュータ、６は
磁気ディスク、７は端末類、８はプリンタ、１１，２１
．３１は入力層、１２，２２，３２は中間層、１３，２
３．３３は出力層を示す。FIG. 1 is a schematic block diagram of the entire speech recognition device to which an embodiment of the present invention is applied. FIG. 2 is a diagram showing an example of speech recognition by identifying phonemes by a combination of the simplest segmentation method and phoneme identification method in an embodiment of the present invention. FIG. 3 is a diagram illustrating an example of speech recognition by identifying phonemes using a segmentation method to narrow down phoneme groups. FIG. 4 is a diagram showing an example of speech recognition by identifying phonemes by using phoneme groups of the segmentation method and functions representing the validity of the results of the phoneme identification method. Figure 5 shows that phonemes are identified by means of the segmentation method used to narrow down phoneme groups, and phonemes are identified and speech is identified by using a function that shows the validity of the phoneme groups of the segmentation method and the results of the phoneme identification method. It is a figure which shows an example of recognition. FIG. 6 is a diagram showing an example of the time-delay neural network for identifying 18 consonants used in FIGS. 2 and 4. FIG. 7 shows the time-delay neural network for consonant identification by voiced/unvoiced sounds used in the embodiments of FIGS. 3 and 5.
It is a figure showing an example of a work. FIG. 8 is a table showing the phoneme recognition results according to each method of the present invention. In the figure, 1 is an amplifier, 2 is a low-pass filter, 3 is an A/D converter, 4 is a processing device, 5 is a computer, 6 is a magnetic disk, 7 is a terminal, 8 is a printer, 11, 21
．． 31 is an input layer, 12, 22, 32 is an intermediate layer, 13, 2
3.33 indicates the output layer.

Claims

[Claims]

(1) A speech recognition device that recognizes input speech, comprising a detection means for detecting a position or section of each preset phoneme group from the input speech; an identification means for identifying phonemes within the phoneme group using a phoneme identification neural network, based on the position or section of each phoneme group detected by the detection means and the phoneme identified by the identification means. A voice recognition device characterized in that it performs voice recognition using a voice.

(2) The detection means detects the magnitude of the power of the input voice in a certain frequency band, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. The segmentation result and its phonological group are determined based on the acoustic characteristics such as the power ratio between 2. The speech recognition device according to claim 1, wherein phoneme recognition is performed by narrowing down phoneme groups and applying a phoneme identification neural network according to the narrowed down phoneme groups.

(3) The detection means detects the magnitude of the power of the input voice in a certain frequency band, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. The segmentation and its phonological group are estimated based on the acoustic characteristics such as the power ratio between A network is applied to perform phoneme identification, and a final phoneme recognition result is determined by using a function representing the validity of the phoneme identification result and the phoneme group of the segmentation candidate obtained by the detection means. 2. The speech recognition device according to claim 1, wherein the speech recognition device performs phoneme recognition.

(4) The detection means detects the magnitude of the power of the input voice in a certain frequency band, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. The phonological identification neural network estimates the segmentation and its phonological group based on the acoustic characteristics such as the power ratio between After narrowing down the phoneme, phoneme identification is performed by applying a phoneme identification neural network according to the narrowed-down phoneme group, and a function representing the validity of the phoneme identification result and the phoneme group of the segmentation candidate obtained by the detection means is used. The method is characterized in that the final phonological recognition result is determined by performing phonological recognition.
A speech recognition device according to any one of claims 1 to 3.