JPH02239290A

JPH02239290A - Voice recognizing device

Info

Publication number: JPH02239290A
Application number: JP1061928A
Authority: JP
Inventors: Yasuhiro Komori; 康弘小森; Koichiro Hatasaki; 畑崎　香一郎
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1989-03-13
Filing date: 1989-03-13
Publication date: 1990-09-21
Anticipated expiration: 2009-05-25
Also published as: JPH0640274B2

Abstract

PURPOSE:To improve the phoneme recognition rate by recognizing a voice based on the position or the section of each phoneme group detected by a detecting means and the phoneme discriminated by a discriminating means. CONSTITUTION:This device consists of an amplifier 1, a low pass filter 2, an A/D converter 3, and a processor 4. The processor 4 consists of a computer 5, a magnetic disk 6, a terminal 7 or the like, and a printer 8, and the computer 5 recognizes the voice based on the digital signal of the voice inputted from the A/D converter 3. That is, the position or the section of each phoneme group is detected by the detecting means, and a phoneme in the phoneme group preliminarily set from the inputted voice is discriminated by the discriminating means. Thus, the phoneme is recognized with a high performance, and a voice recognizing device of high performance is constituted.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は音声認識装置に関し、特に、入力された音声
の音韻グループごとの位置または区間を検出してセグメ
ンテーションを行ない、検出した位置または区間に対し
て音韻を認識するような音声認識装置に関する。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech recognition device, and in particular, detects the position or interval of each phoneme group of input speech, performs segmentation, and performs segmentation on the detected position or interval. The present invention relates to a speech recognition device that recognizes phonemes.

［従来の技術および発明が解決しようとする課題］従来
の音声認識の方法は、連続した音声波形に時間区分を入
れてセグメンテーションを行なった後に、音韻認識を行
なう方法と、連続した音声波形の時間区分およびその部
分の音韻認識を同時に行なういわゆる音韻スボッティン
グ方法とが提案されている。[Prior art and problems to be solved by the invention] Conventional speech recognition methods include a method in which a continuous speech waveform is segmented by time division, and then phoneme recognition is performed; A so-called phoneme swapping method has been proposed that simultaneously performs segmentation and phoneme recognition of the segment.

しかしながら、前者の方法においては、各音韻の存在す
る音韻環境にかかわらず画一的なパワーやスペクトルの
変化などの単純なパラメータの組合わせでセグメンテー
ションを行なっているため、高精度のセグメンテーショ
ンを行なうことができない。その結果、高い音韻認識率
を得ることができない。また、後者の方法においては、
連続する音韻の境界付近で音韻の誤認識や挿入誤りが多
く、その結果高い音韻認識率が得られないという欠点が
あった。However, in the former method, segmentation is performed using a combination of simple parameters such as uniform power and spectral changes regardless of the phonological environment in which each phoneme exists, so it is difficult to perform highly accurate segmentation. I can't. As a result, a high phoneme recognition rate cannot be obtained. Also, in the latter method,
This method has the disadvantage that there are many phoneme recognition errors and insertion errors near the boundaries of consecutive phonemes, and as a result, a high phoneme recognition rate cannot be obtained.

それゆえに、この発明の主たる目的は、セグメンテーシ
ョン誤りによる音韻の誤認鷹および音韻スボッティング
法による音韻境界における音韻の誤認識や挿入誤りを解
決して、高い音韻認識が可能な音声認識装置を提供する
ことである。Therefore, the main object of the present invention is to provide a speech recognition device capable of high level phoneme recognition by solving phoneme misrecognition caused by segmentation errors and phoneme recognition errors and insertion errors at phoneme boundaries caused by the phoneme swapping method. It is.

［課題を解決するための手段コこの発明はく入力された音声を認識する音声認識装置で
あって、入力された音声から予め設定された音韻グルー
プごとの位置または区間を検出する検出手段と、入力さ
れた音声から予め設定された音韻グループ内の音鎚を識
別する識別手段とを備えて構成される。[Means for Solving the Problems] This invention is a speech recognition device that recognizes input speech, comprising a detection means for detecting a position or section of each preset phoneme group from the input speech; and identification means for identifying a tone hammer within a preset phoneme group from input speech.

より好ましくは、検出手段は入力された音声の或る周波
数帯域におけるパワーの大きさと、その周波数帯域にお
けるパワー変化量と、その周波数帯域におけるスペクト
ルの変化量と、或る周波数帯域と他の周波数帯域とにお
けるパワーの比とに基づいて音韻グループごとの位置ま
たは区間を検出する手段を含む。More preferably, the detection means detects the magnitude of the power of the input voice in a certain frequency band, the amount of change in power in that frequency band, the amount of change in spectrum in that frequency band, and the difference between one frequency band and another frequency band. and a means for detecting a position or section for each phoneme group based on the power ratio between the two phoneme groups.

より好ましくは、識別手段は予め設定された音韻グルー
プ内の音韻を識別するように設計された統計的な手法を
用いて識別する。More preferably, the identification means identify using statistical methods designed to identify phonemes within predefined phoneme groups.

さらに、予め設定された音韻グループごとの位置または
区間を検出した後に、予め設定された音韻グループ内の
音韻を識別するようにする。Furthermore, after detecting the position or section of each preset phoneme group, the phonemes within the preset phoneme group are identified.

［作用］この発明に係る音声認識装置は、検出手段によって音韻
グループごとの位置または区間を検出すると同時に、入
力された音声から予め設定された音鎚グループ内の音韻
を識別手段によって識別する。その結果、高い性能の音
韻認識を可能にし、高性能な音声認識装置を構築できる
。[Operation] In the speech recognition device according to the present invention, the detection means detects the position or section of each phoneme group, and at the same time, the identification means identifies phonemes within a preset tone group from the input speech. As a result, high-performance phoneme recognition becomes possible, and a high-performance speech recognition device can be constructed.

［発明の実施例コ第１図はこの発明が適用される音声認識装置の概略ブロ
ック図である。第１図を参照して、音声認識装置はアン
ブ１とローパスフィルタ２とＡ／Ｄ変換器３と処理装置
４とを含む。アンブ１は入力された音声信号を増幅し、
ローバスフィル２は増幅された音声信号から折返し雑音
を除去する。[Embodiment of the Invention] FIG. 1 is a schematic block diagram of a speech recognition device to which the present invention is applied. Referring to FIG. 1, the speech recognition device includes an amplifier 1, a low-pass filter 2, an A/D converter 3, and a processing device 4. Anbu 1 amplifies the input audio signal,
The low bass filter 2 removes aliasing noise from the amplified audio signal.

Ａ／Ｄ変換器３は音声信号を１２ｋＨｚのサンプリング
信号により、１６ビットのデジタル信号に変換する。処
理装置４はコンピュータ５と磁気ディスク６と端末類７
とプリンタ８とを含む。コンピュータ５はＡ／Ｄ変換器
３から入力された音声のディジタル信号に基づいて、後
述の第２図ないし第５図に示した手法を用いて音声認識
を行なう。The A/D converter 3 converts the audio signal into a 16-bit digital signal using a 12 kHz sampling signal. The processing device 4 includes a computer 5, a magnetic disk 6, and terminals 7.
and a printer 8. The computer 5 performs voice recognition based on the voice digital signal input from the A/D converter 3 using the method shown in FIGS. 2 to 5, which will be described later.

第２図はこの発明の一実施例による音韻グループごとに
区間を検出する手順を示す図であり、第３図はスベクト
口ダラムの一例を示す図であり、第４図は認識結果を示
す図であり、第５図はニューラルネットワークを用いて
音韻を識別する一例を示す図である。FIG. 2 is a diagram showing a procedure for detecting a section for each phoneme group according to an embodiment of the present invention, FIG. 3 is a diagram showing an example of subekto mouth duram, and FIG. 4 is a diagram showing recognition results. FIG. 5 is a diagram showing an example of identifying phonemes using a neural network.

次に、第１図ないし第５図を参照して、この発明の一実
施例の具体的な動作について説明する。Next, with reference to FIGS. 1 to 5, a specific operation of an embodiment of the present invention will be described.

第１図に示したＡ／Ｄ変換器３からディジタル化された
音韻スペクトルがコンピュータ５に与えられる。コンピ
ュータ５はステップ（図示ではＳＰ１と略称する）ＳＰ
Ｉにおいて、入力された音韻スペクトルに基づいて、ス
ペクト口ダラム上の大まかな音韻特徴を参照する。第３
図は［ｓｕｋｕｎａｋｕｔｏｍｏＪと発音したときのス
ベクトログラムであり、縦軸は周波数を示し、横軸は時
間経過を示している。このスベクトログラムにおいて黒
く示されている部分はパワーの大きいことを示しており
、白くなるに従ってパワーの小さいことを示している。A digitized phonetic spectrum is provided to a computer 5 from the A/D converter 3 shown in FIG. The computer 5 performs step (abbreviated as SP1 in the figure) SP
In I, rough phoneme features on the spectrum mouth durum are referred to based on the input phoneme spectrum. Third
The figure is a vectorogram when pronouncing [sukunakutomoJ, where the vertical axis shows the frequency and the horizontal axis shows the passage of time. In this spectrogram, black areas indicate large power, and white areas indicate small power.

第２図のステップＳＰ２において音韻候補が検出される
。すなわち、前述のステップＳＰｉにおける音韻特徴の
参照結果に基づいて、音韻グループごとの大まかな位置
を大まかな特徴を用いて音韻候補が検出される。ここで
の音韻グループは、たとえば無声摩擦音，有声破裂音，
鼻音，流音などである。In step SP2 of FIG. 2, phoneme candidates are detected. That is, based on the reference result of the phoneme features in step SPi described above, phoneme candidates are detected using the rough features for the rough position of each phoneme group. The phonological groups here include, for example, voiceless fricatives, voiced plosives,
These include nasal sounds and flowing sounds.

第３図に示したスベクトログラムでは、／Ｓ／に対応し
て、３３５ｍｓｅｃ〜４９２ｍｓｅｃの区間において、
４０００Ｈｚ〜６０００Ｈｚの周波数帯域のパワーが大
きく、１０００Ｈｚ〜２０００Ｈｚ付近の周波数帯域で
はパワーが小さく、カットオフ点は５　０　０　Ｑ　Ｈ
　ｚ付近にあることから、ほぼ無声摩擦音または有声摩
擦音に近いと判断され、無声摩擦音と有声摩擦音とが音
韻候補とされる。In the spectrum shown in FIG. 3, in the interval from 335 msec to 492 msec, corresponding to /S/,
The power in the frequency band from 4000Hz to 6000Hz is large, and the power is small in the frequency band around 1000Hz to 2000Hz, and the cutoff point is 500QH
Since it is near z, it is determined that it is almost a voiceless fricative or a voiced fricative, and the voiceless fricative and the voiced fricative are considered as phoneme candidates.

次に、／Ｓ／に引き続いて、／ｋ／に対応して、４９２
〜５６２ｍｓｅｃの区間におけるパワーの変化，スペク
トルの変化などに基づいて、無声破裂音を音韻候補とす
る。Next, following /S/, corresponding to /k/, 492
Based on changes in power, changes in spectrum, etc. in the interval of ~562 msec, unvoiced plosives are selected as phoneme candidates.

次に、ステップＳＰ３において、音韻環境の仮説が行な
われる。すなわち、上述のステップｓＰ２において検出
された音韻候補ごとに予め設定された前後の音韻の種類
，音韻変形が仮説される。Next, in step SP3, a hypothesis of the phonetic environment is made. That is, for each phoneme candidate detected in step sP2 described above, types of phonemes and phoneme transformations before and after the phoneme are hypothesized.

すなわち、前述のステップＳＰ２で検出された無声摩擦
音と有声摩擦音のそれぞれの前後の音韻の種類が仮説さ
れる。／　ｓ　／という無声摩擦音に対して、その前に
は無音，閉鎖音，母音が仮説され、後の音韻に対して閉
鎖音，無音，母音，摩擦音が仮説される。ステップＳＰ
２で検出された有声摩擦音に対しても、前後の音韻の種
類が仮設され、前の音韻が無音と母音であり、後の音韻
として母音が仮説される。That is, the types of phonemes before and after each of the voiceless fricative and voiced fricative detected in step SP2 are hypothesized. For the voiceless fricative /s/, a silence, a stop, and a vowel are hypothesized before it, and a stop, a stop, a vowel, and a fricative are hypothesized for the phoneme after it. Step SP
For the voiced fricative detected in step 2, the types of phonemes before and after are tentatively assumed, with the preceding phoneme being silent and a vowel, and the subsequent phoneme being a vowel.

上述のステップＳＰ３において仮説された音韻環境ごと
に可能性のある音韻境界の検出および仮説の検証が行な
われる。正しい仮説の下では、仮説ごとに高い確信度が
得られ、結果として音韻環境が検出される。逆に誤った
仮説では確信度が低くなり、音韻環境を得るに至らない
。仮説が正しいか否かの判断はスペクトログラム上の音
響特徴、すなわち、入力された音声の或る周波数帯域に
おけるパワーの大きと、パワーの変化量と、スペクトル
の変化量と、他の周波数帯域に対するパワーの比とに基
づいて判別される。Possible phoneme boundaries are detected and hypotheses are verified for each phoneme environment hypothesized in step SP3 described above. Under correct hypotheses, a high degree of confidence is obtained for each hypothesis, and as a result, the phonological environment is detected. On the other hand, if the hypothesis is incorrect, the confidence level will be low and the phonological environment will not be obtained. Judging whether the hypothesis is correct or not depends on the acoustic features on the spectrogram, that is, the magnitude of the power in a certain frequency band of the input voice, the amount of change in power, the amount of change in the spectrum, and the power in other frequency bands. It is determined based on the ratio of

ステップＳＰ５において、各音韻グループが決定された
区間のうち、最も確信度の高い区間を最終セグメンテー
ションおよび音韻グループの結果とされる。この最終セ
グメンテーションの結果に対して、ステップＳＰ６で対
応する音韻グループの識別が行なわれる。ステップＳＰ
３における無音の仮説に対して、３３５ｍｓｅｃから無
声摩擦音がスタートし、その確信度（ｃ　ｆ）が０，６
４であるという結果が得られ、母音の仮説に対しては結
果が得られず、閉鎖音の仮説に対しては３２５ｍｓｅｃ
からスタートし、その確信度が０．６０であるという結
果が得られる。また、破裂音の仮説に対してはスタート
する４９２ｍｓｅｃの境界が／ｓ／の終端であり、その
確信度が０．６６であると仮説される。In step SP5, among the sections in which each phoneme group has been determined, the section with the highest confidence is determined as the final segmentation and phoneme group result. Based on the final segmentation result, a corresponding phoneme group is identified in step SP6. Step SP
Regarding the hypothesis of silence in 3, the voiceless fricative starts from 335 msec, and its confidence (c f) is 0.6.
4, no result was obtained for the vowel hypothesis, and 325 msec for the stop consonant hypothesis.
Starting from , the result is that the confidence level is 0.60. Furthermore, for the plosive hypothesis, it is hypothesized that the starting boundary of 492 msec is the end of /s/, and the confidence level thereof is 0.66.

ステップＳＰ６において、確信度の最も高い結果が選ば
れ、ステップＳＰ７において、／Ｓ／は３３５ｍｓｅｃ
からスタートし、４９２ｍｓｅｃでエンドであることが
識別され、それによってセグメンテーションが決定され
ると同時に音韻グループの識別が行なわれる。In step SP6, the result with the highest confidence is selected, and in step SP7, /S/ is 335 msec.
The end is identified at 492 msec, and segmentation is determined based on this, and at the same time phoneme group identification is performed.

次に、第５図を参照して、検出されたセグメンテーショ
ンの音韻を識別する方法について説明する。第５図に示
した時間遅れニューラルネットワークは、１８の子音を
有声破裂音，無声破裂音，鼻音，有声摩擦音，無声摩擦
音．流音の６つのクラスにグループ化し、それぞれのグ
ループを入力層１０として用いる。入力層１０は従来か
ら知られているパックブロバゲーションの学習により、
セグメンテーションされた音韻の識別を行なう。Next, a method for identifying the phoneme of the detected segmentation will be described with reference to FIG. The time-delayed neural network shown in Figure 5 divides the 18 consonants into voiced plosives, voiceless plosives, nasals, voiced fricatives, and voiceless fricatives. The flowing sounds are grouped into six classes, and each group is used as the input layer 10. The input layer 10 is trained by the conventionally known pack blobagation.
Identify segmented phonemes.

各クラスの識別は入力層１１によって行なわれる。Identification of each class is performed by the input layer 11.

時間遅れニューラルネットワークの学習は、すべての子
音の終端位置の入力層１０の１５０ｍｓｅｃの前から１
００ｍｓｅｃの位置に合わせて行なわれ、同様に、音韻
識別ではセグメンテーション結果の終端は入力層１０の
同じ位置に適用され、時間遅れニューラルネットワーク
の出力層１２が出力する最大確信度を与える音韻を識別
結果とする。この識別結果の一例を示したのが、第４図
である。The learning of the time-delayed neural network starts from 150 msec before the input layer 10 of the final position of all consonants.
Similarly, in phoneme identification, the end of the segmentation result is applied to the same position of the input layer 10, and the output layer 12 of the time-delayed neural network outputs the phoneme that gives the maximum confidence as the identification result. shall be. FIG. 4 shows an example of this identification result.

なお、上述の実施例における位置検出においては、音韻
グループとその区間を示した。しかし、この方法の他に
、たとえば破裂の特徴を有する音韻グループと破裂位置
，局所的パワーのディップの特徴を有する音韻グループ
とディップの位置などのように、或る特徴を有する音韻
グループとその特徴の位置による方法でも可能である。Note that in the position detection in the above-described embodiment, phoneme groups and their sections are shown. However, in addition to this method, it is also possible to identify phonological groups with certain characteristics and their characteristics, such as phonological groups with rupture characteristics and rupture positions, phonological groups with local power dip characteristics and dip positions, etc. It is also possible to use a method based on the position of .

また、上述の第５図に示した音韻識別方式においては、
時間遅れニューラルネットワークを用いたが、その他の
一般的な統計的手法による音韻グループ内の音韻識別方
法でも可能である。たとえば、一般のニューラルネット
ワークによる音韻識別方法や、ＨＭＭによる音韻識別方
法や、ベイズ則による音韻識別方法や、線形判別による
音韻識別方法や、ＬＶＱなどの方法にて設計した標準パ
ターンを用いた音韻識別方法などが適用可能である。Furthermore, in the phoneme identification method shown in FIG. 5 above,
Although a time-delay neural network was used, it is also possible to identify phonemes within phoneme groups using other general statistical methods. For example, phoneme identification methods using general neural networks, phoneme identification methods using HMM, phoneme identification methods using Bayes' rule, phoneme identification methods using linear discrimination, and phoneme identification methods using standard patterns designed using methods such as LVQ. methods etc. are applicable.

［発明の効果］以上のように、この発明によれば、入力された音声から
予め設定された音韻グループごとの位置または区間を検
出するとともに、予め設定された音韻グループ内の音韻
を識別するようにしたので、従来のようにセグメンテー
ション誤りによる音韻の誤認識や音韻スボッティングに
よる音韻境界における音韻の誤認識や挿入誤りを解決す
ることができ、音韻認識の高性能化を図ることができる
。[Effects of the Invention] As described above, according to the present invention, the position or section of each preset phoneme group is detected from input speech, and the phonemes within the preset phoneme group are identified. As a result, it is possible to solve the conventional misrecognition of phonemes due to segmentation errors, misrecognition of phonemes at phoneme boundaries due to phoneme swapping, and insertion errors, and improve the performance of phoneme recognition.

[Brief explanation of drawings]

第１図はこの発明の一実施例が適用される音声認識装置
の全体の構成を示すブロック図である。第２図はこの発明の一実施例による音韻グループごとに
区間を検出する手順を示す図である。第３図はスペクト
ログラムの一例を示す図である。第４図は音声認識結果
を示す図である。第５図は時間遅れニューラルネットを
用いて音声認忠する一例を示す図である。図において、１はアンプ、２はローパスフィルタ、３は
Ａ／Ｄ変換器、４は処理装置、５はコンピュータ、６は
磁気ディスク、７は端末類、８はプリンタを示す。特許出願人　株式会社エイ・テイ・アール自動補正の対
象平成１年８月２日図面の第４図FIG. 1 is a block diagram showing the overall configuration of a speech recognition device to which an embodiment of the present invention is applied. FIG. 2 is a diagram showing a procedure for detecting intervals for each phoneme group according to an embodiment of the present invention. FIG. 3 is a diagram showing an example of a spectrogram. FIG. 4 is a diagram showing the voice recognition results. FIG. 5 is a diagram showing an example of voice recognition using a time-delay neural network. In the figure, 1 is an amplifier, 2 is a low-pass filter, 3 is an A/D converter, 4 is a processing device, 5 is a computer, 6 is a magnetic disk, 7 is a terminal, and 8 is a printer. Patent applicant: A.T.R. Co., Ltd. Subject of automatic correction: Figure 4 of the drawing dated August 2, 1999

Claims

[Claims]

(1) A speech recognition device that recognizes input speech, comprising: a detection means for detecting a position or pattern for each preset phoneme group from the input speech; and a detection means for detecting a preset phoneme group from the input speech. an identification means for identifying phonemes within the group, and a position or section of each phoneme group detected by the detection means;
A speech recognition device, characterized in that speech recognition is performed based on the phoneme identified by the identification means.

(2) The detection means detects the magnitude of the power of the input voice in a certain frequency band, the amount of change in power in a certain frequency band, the amount of change in spectrum in a certain frequency band, and the amount of change in the power in a certain frequency band. 2. The speech recognition device according to claim 1, wherein the position or section of each phoneme group is detected based on the ratio of power in a certain frequency band and a certain other frequency band.

(3) The speech recognition device according to claim 1, wherein the identification means uses a statistical method designed to identify phonemes within a preset phoneme group.

(4) After the detecting means detects the position or section of each preset phoneme group, the identifying means identifies the phoneme within the preset phoneme group. The speech recognition device according to item 1.