JPH04281500A

JPH04281500A - Voiced sound phoneme recognition method

Info

Publication number: JPH04281500A
Application number: JP3045194A
Authority: JP
Inventors: Hiroyuki Noto; 広之野戸
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1991-03-11
Filing date: 1991-03-11
Publication date: 1992-10-07

Abstract

PURPOSE:To improve the phoneme recognition performance by providing a method whereby the characteristics extracting process of voice waveforms becomes the most suitable per recognized phoneme. CONSTITUTION:First, a neurocircuit network which has a learning process is provided. This network outputs impulse string, which has the same period of the basic period of an input phoneme and corresponding to the input phoneme, when voiced sound phoneme waveforms are inputted. The signal value, whereby the phoneme normalizes the size of unknown input voice waveforms by the power over the prescribed time section, is inputted into the neurocircuit network (S20), plural output series peaks of the neurocircuit network are detected by slightly time shifting the added input voice waveforms (S22) and assume the phoneme, which corresponds to the output having the maximum peak value, as the phoneme recognition result (S23).

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、音声認識技術において
、学習処理を施した神経回路網あるいは神経回路網群を
用いて有音声の音韻認識処理を行う有声音韻認識方法に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voiced phoneme recognition method in speech recognition technology, which performs voiced phoneme recognition processing using a neural network or neural network group that has undergone learning processing.

【０００２】0002

【従来の技術】従来、有声音韻認識方法には種々の方法
があり、その一例として図２に示す有声音韻認識方法に
ついて説明する。2. Description of the Related Art Conventionally, there are various methods for recognizing voiced phonemes, and as an example, the method for recognizing voiced phonemes shown in FIG. 2 will be described.

【０００３】図２は、従来の有声音韻認識方法の処理手
順を示すフローチャートである。FIG. 2 is a flowchart showing the processing procedure of a conventional voiced phoneme recognition method.

【０００４】この有声音韻認識方法では、ステップＳ１
で、認識の対象となる音声信号に窓関数を乗じて所望の
時刻の一定時間領域の音声信号を取り出す音声波形抽出
処理を行った後、ステップＳ２で、音声波形の自己相関
関数をもとに、線形予測係数の計算を行う。そして、ス
テップＳ３では、ステップＳ２で求めた線形予測係数を
特徴量としたパタンマッチングを行い、最小の距離値を
もつ標準パタンの音韻を音韻認識結果として出力する。[0004] In this voiced phoneme recognition method, step S1
After performing audio waveform extraction processing in which the audio signal to be recognized is multiplied by a window function to extract the audio signal in a certain time area at a desired time, in step S2, the audio signal is extracted based on the autocorrelation function of the audio waveform. , calculate the linear prediction coefficients. Then, in step S3, pattern matching is performed using the linear prediction coefficient obtained in step S2 as a feature quantity, and the phoneme of the standard pattern having the minimum distance value is output as the phoneme recognition result.

【０００５】以下、各処理ステップＳ１〜Ｓ３の内容を
説明する。[0005] The contents of each processing step S1 to S3 will be explained below.

【０００６】まず、ステップＳ１では、未知の入力音声
の一部を取り出す。時間領域で離散的な音韻が未知の音
声波形をｓ（ｍ）とし、適当な窓関数をｗ（ｍ）とする
。ここで、ｍは離散的な時刻である。今、音声波形のう
ち、音韻の種類を求めたい所望の離散的な時刻をｎとす
る。このとき、所望の時刻ｎにおける音声波形ｓｎ　（
ｍ）は次式（１）で求める。　　　　　　ｓｎ　（ｍ）＝ｓｎ　（ｍ＋ｎ）ｗ（ｍ）
　　　　　　　　　　　　　　　　　　・・・（１）但
し、０≦ｍ≦Ｎ−１Ｎ；所望の窓関数の大きさステップＳ２では、所望の時刻ｎにおける音声波形の自
己相関関数Ｒｎ　（ｋ）を次式（２）により求める。但し、ｋ；ラグそして、線形予測係数を求める。例えば、ダービン（Ｄ
ｕｒｂｉｎ）の再帰法によれば、次式（３）〜（８）に
よって線形予測係数αｊ　（但し、ｊ；第ｊ次の予測係
数）を求めることができる。　　Ｅ（０）＝Ｒ（０）　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　・
・・・（３）但し、ｐ；線形予測の次数で任意である。　　αｉ　（ｉ）　＝ｋｉ　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　・・・・（５）　　αｊ　（ｉ）＝αｊ　（ｉ−
１）　−ｋｉ　αｉ−ｊ　（ｉ−１）　　　　　　　　
　　　　　　　　・・・・（６）　　Ｅ（ｉ）＝（１−
　ｋｉ　２　）Ｅ（ｉ−１）　　　　　　　　　　　　
　　　　　　　　　　　　　・・・・（７）線形予測係
数αｊ　の計算手順としては、まず、（３）式によりＥ
（０）を得る。次に、（４）式〜（７）式までの計算を
行い、αｊ　（ｉ）を１　≦ｉ　≦ｐの範囲で順に再帰
的に求める。その後、次式（８）より、線形予測係数α
ｊ　を得る。　　　　　　αｊ　＝αｊ　（ｐ）　　　　　　　　　
　　　　　　　　　　１　≦ｊ　≦ｐ　　　　　　　　
・・・・（８）最後に、ステップＳ３で、音韻標準パタ
ンとのパタンマッチングを行う。この処理では、予め、
音韻が既知の多くの音声波形に対して音韻別に線形予測
係数αｊ　を求めておき、その代表値を標準パタンとし
て記憶しておく。ここでは、音韻φに対する線形予測係数αｊ　の平均値
を、標準パタンαｊ　φとして予め計算しておく。そし
て、音韻の認識時には、未知の入力音声の線形予測係数
αｊ　と音韻φの標準パタンαｊ　φとの距離Ｄｘ　φ
を次式（９）で測定する。First, in step S1, a part of unknown input speech is extracted. Let s(m) be a speech waveform with unknown discrete phonemes in the time domain, and let w(m) be an appropriate window function. Here, m is a discrete time. Now, let n be a desired discrete time in the speech waveform at which the type of phoneme is to be determined. At this time, the audio waveform sn (
m) is determined by the following equation (1). sn (m)=sn (m+n)w(m)
...(1) However, 0≦m≦N-1 N: Desired window function size In step S2, the autocorrelation function Rn (k) of the speech waveform at the desired time n is calculated using the following equation (2). demand. However, k: lag, and the linear prediction coefficient is determined. For example, Durbin (D
According to the recursive method of urbin), linear prediction coefficients αj (where j: j-th prediction coefficient) can be obtained using the following equations (3) to (8). E(0)=R(0)
・
...(3) However, p: is the order of linear prediction and is arbitrary. αi (i) = ki

...(5) αj (i)=αj (i-
1) -ki αi-j (i-1)
...(6) E(i)=(1-
ki2)E(i-1)
...(7) As a calculation procedure for the linear prediction coefficient αj, first, E is calculated using equation (3).
(0) is obtained. Next, equations (4) to (7) are calculated, and αj (i) is determined recursively in the range of 1≦i≦p. Then, from the following equation (8), the linear prediction coefficient α
get j. αj = αj (p)
1 ≦j ≦p
(8) Finally, in step S3, pattern matching with the phoneme standard pattern is performed. In this process, in advance,
Linear prediction coefficients αj are determined for each phoneme for many speech waveforms whose phonemes are known, and their representative values are stored as standard patterns. Here, the average value of the linear prediction coefficients αj for the phoneme φ is calculated in advance as a standard pattern αj φ. When recognizing a phoneme, the distance Dx φ between the linear prediction coefficient αj of the unknown input speech and the standard pattern αj φ of the phoneme φ is
is measured using the following equation (9).

【０００７】[0007]

【数１】[Math 1]

【０００８】この距離Ｄｘ　φが最も小さな値を持つ音
韻φを、所望の時刻ｎにおける有声音韻の認識結果とし
て出力する。The phoneme φ whose distance Dxφ has the smallest value is output as the voiced phoneme recognition result at a desired time n.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、従来の
有声音韻認識方法では、次のような課題があった。[Problems to be Solved by the Invention] However, the conventional voiced phoneme recognition method has the following problems.

【００１０】従来の方法では、線形予測分析やバンドパ
スフィルタバンク等、音声波形に対して特定の特徴抽出
処理を行い、この特徴抽出処理によって得られた特徴量
を用いて音韻認識を行っている。ところが、このように
特徴抽出処理を音韻によらず一定の特徴抽出方法に定め
てしまうと、各音韻毎に最適な特徴抽出処理が異なって
いる場合、各音韻に対する認識性能には、特徴抽出方法
が最適でないことによる限界が生ずる。[0010] In conventional methods, specific feature extraction processing is performed on speech waveforms, such as linear predictive analysis or bandpass filter bank, and phoneme recognition is performed using the feature quantities obtained by this feature extraction processing. . However, if the feature extraction process is set to a constant feature extraction method regardless of the phoneme, if the optimal feature extraction process is different for each phoneme, the recognition performance for each phoneme will depend on the feature extraction method. A limitation arises due to the fact that

【００１１】例えば、従来の有声音韻認識方法では、フ
レームと呼ばれる一定の時間単位毎に特徴量を求めるが
、このフレームの長さは音韻認識性能を大きく左右する
ことが知られている。これは、特徴抽出後の認識方法が
いかに優れていても、特徴抽出方法における音韻情報の
劣化、あるいは特徴抽出方法と認識方法の組み合せの適
否による総合的な性能の劣化により、認識性能が低下し
ていることを意味する。そのため、精度の高い音韻認識
結果を得ることが困難であった。For example, in the conventional voiced phoneme recognition method, feature amounts are obtained for each fixed time unit called a frame, and it is known that the length of this frame greatly influences the phoneme recognition performance. This is because no matter how good the recognition method after feature extraction is, recognition performance will deteriorate due to deterioration of phonological information in the feature extraction method or deterioration of overall performance due to the inappropriateness of the combination of feature extraction method and recognition method. means that Therefore, it has been difficult to obtain highly accurate phoneme recognition results.

【００１２】本発明は、前記従来技術が持っていた課題
として、特徴抽出処理を特定の処理方法に定めてしまう
と、有声音韻認識にとって必ずしも最適な特徴抽出処理
方法が適用できず、音韻認識性能が低下するという点に
ついて解決した有声音韻認識方法を提供するものである
。[0012] The present invention solves the problem that the prior art had, and if feature extraction processing is determined to be a specific processing method, it is not necessarily possible to apply the optimal feature extraction processing method for voiced phoneme recognition, resulting in poor phoneme recognition performance. The present invention provides a voiced phoneme recognition method that solves the problem of a decrease in voiced phoneme recognition.

【００１３】[0013]

【課題を解決するための手段】前記課題を解決するため
に、第１の発明は、有声音韻の認識を行う有声音韻認識
方法において、学習のための提示信号として有声音韻信
号の大きさを電力（パワー）によって一定の時間区間に
わたって正規化した信号値を入力し、該入力した時間区
間の所定位置と入力信号のエポック点とが一致したとき
に入力音韻に対応した出力が大きな値をとり、それ以外
のときには小さな値を取るように、誤差逆伝搬法による
学習を施した神経回路網を使用する。[Means for Solving the Problems] In order to solve the above problems, the first invention provides a voiced phoneme recognition method for recognizing voiced phonemes, in which the magnitude of the voiced phoneme signal is used as a presentation signal for learning. A signal value normalized by (power) is input over a certain time interval, and when a predetermined position in the input time interval and the epoch point of the input signal match, the output corresponding to the input phoneme takes a large value, At other times, a neural network trained using the error backpropagation method is used so that it takes a small value.

【００１４】そして、音韻が未知の入力音声波形の大き
さをパワーによって一定の時間区間にわたって正規化し
た信号値を前記神経回路網に入力し、加えられる前記入
力音声波形の時刻をわずかづつ移動させて得られる前記
神経回路網の複数の出力系列のピークを検出し、前記ピ
ーク検出結果に基づき音韻認識判定を行って最も大きな
ピーク値を生ずる出力に対応した音韻を音韻認識結果と
している。[0014] Then, a signal value obtained by normalizing the magnitude of an input speech waveform whose phoneme is unknown over a certain time interval by power is input to the neural network, and the time of the input speech waveform to be applied is shifted little by little. The peaks of a plurality of output series of the neural network obtained by the above are detected, and a phoneme recognition judgment is performed based on the peak detection results, and the phoneme corresponding to the output that produces the largest peak value is determined as the phoneme recognition result.

【００１５】第２の発明は、第１の発明の神経回路網に
代えて、複数個の神経回路網からなる神経回路網群を使
用している。The second invention uses a neural network group consisting of a plurality of neural networks in place of the neural network of the first invention.

【００１６】そして、この神経回路網群に対して学習処
理を施た後、音韻が未知の入力音声波形の大きさをパワ
ーによって一定の時間区間にわたって正規化した信号値
を前記神経回路網群に入力し、加えられる前記入力音声
波形の時刻をわずかづつ移動させて得られる前記神経回
路網群の各出力系列のピークを検出し、前記ピーク検出
結果に基づき音韻認識判定を行って最も大きなピーク値
を生ずる前記神経回路網に対応した音韻を音韻認識結果
としている。[0016] After performing learning processing on this neural network group, a signal value obtained by normalizing the magnitude of an input speech waveform whose phoneme is unknown over a certain time interval by power is applied to the neural network group. The peak of each output series of the neural network group obtained by slightly shifting the time of the input speech waveform that is input and added is detected, and a phoneme recognition judgment is performed based on the peak detection result to determine the largest peak value. The phoneme corresponding to the neural network that produces the above is taken as the phoneme recognition result.

【００１７】[0017]

【作用】第１の発明によれば、予め、有声音韻波形を入
力した場合に、入力音韻に対応した出力が入力音韻の基
本周期と同じ周期のインパルス列を出力するように、学
習処理を施した神経回路網を用意しておく。そして、音
韻が未知の入力音声波形信号を前記神経回路網に入力し
た場合に最も大きなインパルス列を出力する出力に対応
した音韻を音韻認識結果とする。[Operation] According to the first invention, when a voiced phoneme waveform is input in advance, a learning process is performed so that the output corresponding to the input phoneme outputs an impulse train with the same period as the fundamental period of the input phoneme. Prepare a neural network. Then, when an input speech waveform signal with an unknown phoneme is input to the neural network, the phoneme corresponding to the output that outputs the largest impulse train is taken as the phoneme recognition result.

【００１８】このように、波形入力から特徴抽出及び音
韻認識までの処理を一貫した処理方法として行い、その
処理方法の特性を決定する特徴抽出のパラメータを神経
回路網と学習処理によって最適化するので、特徴抽出処
理が認識音韻別に最適な方法となって音韻認識性能の向
上が図れる。In this way, processing from waveform input to feature extraction and phoneme recognition is performed as a consistent processing method, and the feature extraction parameters that determine the characteristics of the processing method are optimized using neural networks and learning processing. , the feature extraction process becomes an optimal method for each recognized phoneme, and the phoneme recognition performance can be improved.

【００１９】第２の発明によれば、予め、有声音韻波形
を入力した場合に、入力音韻に対応した出力が入力音韻
の基本周期と同じ周期のインパルス列を出力するように
、学習処理を施した神経回路網を認識対象毎に複数個用
意しておく。そして、音韻が未知の入力音声波形信号を
前記複数の神経回路網群に入力した場合に最も大きなイ
ンパルス列を出力する神経回路網に対応した音韻を音韻
認識結果とすることにより、第１の発明に比べ、処理量
が多くなるが、より音韻認識性能の向上が図れる。従っ
て、前記課題を解決できるのである。According to the second invention, when a voiced phoneme waveform is inputted in advance, a learning process is performed so that the output corresponding to the input phoneme outputs an impulse train having the same period as the fundamental period of the input phoneme. Multiple neural networks are prepared for each recognition target. Then, when an input speech waveform signal whose phoneme is unknown is input to the plurality of neural network groups, the phoneme corresponding to the neural network that outputs the largest impulse train is set as the phoneme recognition result. Compared to , the amount of processing is larger, but the phoneme recognition performance can be further improved. Therefore, the above problem can be solved.

【００２０】[0020]

【実施例】第１の実施例図３は、本発明の第１の実施例を示す有声音韻認識装置
の機能ブロック図である。DESCRIPTION OF THE PREFERRED EMBODIMENTS First Embodiment FIG. 3 is a functional block diagram of a voiced phoneme recognition device showing a first embodiment of the present invention.

【００２１】この有声音韻認識装置は、入力信号波形の
蓄積を行う波形入力部１０を有し、その出力側には、学
習、認識処理を行う神経回路網２０が接続されている。神経回路網２０は、多数の処理ユニット（即ち、セル）
２１で構成され、その出力側には、出力を評価して認識
結果を出力する認識評価部３０が接続されている。This voiced phoneme recognition device has a waveform input section 10 that stores input signal waveforms, and a neural network 20 that performs learning and recognition processing is connected to the output side of the waveform input section 10. Neural network 20 includes a number of processing units (i.e., cells).
21, and a recognition evaluation unit 30 is connected to the output side of the recognition evaluation unit 30, which evaluates the output and outputs a recognition result.

【００２２】また、神経回路網２０には学習制御部４０
が接続され、さらにその学習制御部４０に、学習時に教
師信号を入力する教師信号入力部５０が接続されている
。学習制御部４０は、学習時に神経回路網２０の出力誤
差を評価して該神経回路網２０の結合重み係数を変更す
る機能を有している。The neural network 20 also includes a learning control section 40.
is connected to the learning control section 40, and further connected to the learning control section 40 is a teacher signal input section 50 that inputs a teacher signal during learning. The learning control unit 40 has a function of evaluating the output error of the neural network 20 during learning and changing the connection weight coefficient of the neural network 20.

【００２３】この有声音韻認識装置の実現方法は任意で
あり、計算機ソフトウェア、専用ディジタル回路、アナ
ログ回路、光電素子等、実現方法はいずれを用いても、
また組み合せてもよい。実施例では、計算機ソフトウェ
アを用いる。[0023] This voiced phoneme recognition device can be realized using any method, such as computer software, dedicated digital circuits, analog circuits, photoelectric elements, etc.
They may also be combined. In the embodiment, computer software is used.

【００２４】図１は、図３の装置を用いた有声音韻認識
方法の処理手順を示すフローチャートである。FIG. 1 is a flowchart showing the processing procedure of a voiced phoneme recognition method using the apparatus shown in FIG.

【００２５】本実施例の有声音韻認識方法の処理手順は
、図１（ａ）に示す神経回路網の学習処理手順と、図１
（ｂ）に示す学習処理が終わった神経回路網による音韻
認識処理手順とで、構成されている。以下、その（Ｉ）
学習処理、及び（ＩＩ）音韻認識処理について説明する
。The processing procedure of the voiced phoneme recognition method of this embodiment includes the learning processing procedure of the neural network shown in FIG.
It consists of the phoneme recognition processing procedure using the neural network after the learning processing shown in (b). Below, (I)
The learning process and (II) phoneme recognition process will be explained.

【００２６】（Ｉ）　　神経回路網の学習処理図１（ａ
）に示すように、神経回路網の学習処理は、神経回路網
の学習のための初期化処理（ステップＳ１０）、神経回
路網の学習のために入力として加える提示信号（以下、
提示信号と称する）を入力するための提示信号入力処理
（ステップＳ１１）、神経回路網の順方向伝搬処理（ス
テップＳ１２）、神経回路網の出力誤差計算処理（ステ
ップＳ１３）、神経回路網の誤差逆伝搬学習処理（ステ
ップＳ１４）、及び学習終了判定処理（ステップＳ１５
）より、構成されている。以下、その各ステップＳ１０
〜Ｓ１５の内容を、図３及び図４を参照しつつ説明する
。(I) Learning process of neural network Figure 1 (a
), the neural network learning process includes an initialization process for neural network learning (step S10), a presentation signal (hereinafter referred to as
Presentation signal input processing (step S11) for inputting a presentation signal (referred to as a presentation signal), forward propagation processing of the neural network (step S12), output error calculation processing of the neural network (step S13), error of the neural network Backpropagation learning process (step S14) and learning end determination process (step S15)
). Below, each step S10
The contents of ~S15 will be explained with reference to FIGS. 3 and 4.

【００２７】なお、図４は、第１の実施例において提示
信号として用いる音声波形の一例を示す図である。図４
中の６０は提示信号、６１は教師信号のピークを与える
エポック点の位置、６２は教師信号として０．１を与え
る場合の提示信号の時間区間の一例、６３は教師信号と
して０．９を与える場合の提示信号の時間区間の一例で
ある。Note that FIG. 4 is a diagram showing an example of an audio waveform used as a presentation signal in the first embodiment. Figure 4
60 is the presentation signal, 61 is the position of the epoch point that gives the peak of the teacher signal, 62 is an example of the time interval of the presentation signal when 0.1 is given as the teacher signal, and 63 is 0.9 is given as the teacher signal. This is an example of a time interval of a presentation signal in the case of FIG.

【００２８】学習処理において、時間領域で離散的な信
号波形をｓ（ｍ）とし、提示信号を特にａｓ（ｍ）とす
る。本実施例では、提示信号６０として男性の発生した
母音波形を１２ｋＨｚ、１２ビットでサンプリングした
ものを用いる。予め、提示信号６０に対して人間の視察
により教師信号のピークを与えるエポック点の位置（以
下、教師エポック点と称する）６１を設定しておく。In the learning process, a discrete signal waveform in the time domain is assumed to be s(m), and a presentation signal is particularly assumed to be as(m). In this embodiment, the presentation signal 60 is a vowel waveform generated by a man sampled at 12 kHz and 12 bits. In advance, an epoch point position 61 (hereinafter referred to as a teacher epoch point) at which a peak of the teacher signal is given to the presentation signal 60 is set by human inspection.

【００２９】図１（ａ）のステップＳ１０では、初期化
処理として学習を行った回数を表す変数ｐを１とする。In step S10 of FIG. 1(a), a variable p representing the number of times learning is performed as an initialization process is set to 1.

【００３０】ステップＳ１１において、波形入力部１０
は、提示信号６０を教師エポック点を中心とした区間か
ら取り出す。この場合の提示信号の時間区間が、図４中
に符号６３で示されている。この時間区間６３の長さは
、神経回路網２０の入力層のセル数に等しいサンプル数
とする。本実施例では、５１２点のサンプルを提示信号
とする。提示信号の音韻φを本実施例では５母音の音韻
に対応した０〜４の番号とみなし、この場合の音韻番号
φの提示信号をａｓφ０．９　（ｍ）で表す。ここで、
０≦ｍ≦５１１とする。In step S11, the waveform input unit 10
extracts the presentation signal 60 from the section centered on the teacher epoch point. The time interval of the presentation signal in this case is indicated by reference numeral 63 in FIG. The length of this time interval 63 is set to be the number of samples equal to the number of cells in the input layer of the neural network 20. In this embodiment, 512 samples are used as the presentation signal. In this embodiment, the phoneme φ of the presentation signal is regarded as a number from 0 to 4 corresponding to the phoneme of five vowels, and the presentation signal of the phoneme number φ in this case is expressed as asφ0.9 (m). here,
0≦m≦511.

【００３１】波形入力部１０は、提示信号ａｓφ０．９
　（ｍ）を次式（１０）によってパワー正規化してオフ
セットを加えたものが、神経回路網２０の入力層の各セ
ル２１の出力ｏｐｊ（０）となるように設定する。The waveform input unit 10 receives the presentation signal asφ0.9
The output opj(0) of each cell 21 in the input layer of the neural network 20 is set by power-normalizing (m) using the following equation (10) and adding an offset.

【００３２】[0032]

【数２】[Math 2]

【００３３】但し、ｏｐｊ（０）　；ｑ番目の層におけ
るｐ番目のパタンに対するｊ番目のセル２１の出力Ｃ　
　；正規化のための正定数ここでは、提示信号ａｓφ０
．９　（ｍ）を１番目のパタンとし、入力層をｏ番目の
層としている。However, opj(0); output C of the j-th cell 21 for the p-th pattern in the q-th layer.
; Positive constant for normalization Here, the presentation signal asφ0
．． 9 (m) is the first pattern, and the input layer is the o-th layer.

【００３４】ステップＳ１２では、神経回路網２０の順
方向伝搬処理を行う。本実施例における神経回路網２０
の構造は、入力層を第ｏ番目の層として第１番目の層を
中間層、第２番目の層を出力層とする３層構造であり、
第ｑ番目の層の出力は次式（１１）で計算する。In step S12, forward propagation processing of the neural network 20 is performed. Neural network 20 in this embodiment
The structure is a three-layer structure in which the input layer is the o-th layer, the first layer is the intermediate layer, and the second layer is the output layer.
The output of the qth layer is calculated using the following equation (11).

【００３５】[0035]

【数３】[Math 3]

【００３６】ここで、ｏｐｊ（ｑ）　は第ｑ番目の層に
おける第ｊ番目のセル２１出力であり、第ｑ番目のパタ
ンを提示した場合のものである。Ｎｑ　は第ｑ番目の層
におけるセル２１の数であり、ｗ　ｊｉ　（ｑ）　は第
ｑ番目の層の入力となる第ｑ−１番目の層の第ｊ番目の
セル２１からの重み係数、θｊ　（ｑ）　は第ｑ番目の
層の第ｊ番目のセル２１のバイアスである。本実施例で
の各層のセル数はＮ０　が５１２、Ｎ１　が６４、Ｎ２
　が５である。ｗｊｉ（ｑ）　とθｊ　（ｑ）　は、学
習の前にはランダムな小さな値に設定しておく。Here, opj(q) is the output of the j-th cell 21 in the q-th layer, when the q-th pattern is presented. Nq is the number of cells 21 in the q-th layer, w ji (q) is the weighting coefficient from the j-th cell 21 in the q-1th layer, which is input to the q-th layer, and θj (q) is the bias of the j-th cell 21 of the q-th layer. In this example, the number of cells in each layer is 512 for N0, 64 for N1, and N2
is 5. wji(q) and θj(q) are set to random small values before learning.

【００３７】（１１）式の計算を全てのｑ，ｊ　に対し
て順に計算し、第２番目の層である出力層のセル２１の
出力ｏｐｊ（２）　に対して順に計算し、　　　　第２
番目の層である出力層のセル２１の出力ｏｐｊ（２）　
を得る。The calculation of equation (11) is performed in order for all q, j, and the output opj(2) of the cell 21 of the output layer, which is the second layer, is calculated in order, and the second
Output opj (2) of cell 21 of the output layer which is the th layer
get.

【００３８】ステップＳ１３では、学習制御部４０が出
力誤差計算処理を行う。これにさきだち、教師信号入力
部５０は、提示信号に対する出力セル２１の教師信号と
して、提示信号の音韻φに対応した出力セル２１の教師
信号として０．９を定め、他の音韻に対応した出力セル
２１の教師信号として０．１を定める。つまり、第ｐ番
目の提示信号に対する教師信号をｔｐｊとして、　　　
　　　　　　　　　ｔｐ　φ＝０．９　　　　　　　　
　　　　ｔｐｊ＝０．１　　　　（ｊ　≠φ）　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　・・・・（１２）とする。In step S13, the learning control section 40 performs output error calculation processing. Prior to this, the teacher signal input unit 50 determines 0.9 as the teacher signal of the output cell 21 corresponding to the phoneme φ of the presentation signal as the teacher signal of the output cell 21 for the presentation signal, and outputs 0.9 corresponding to the phoneme φ of the presentation signal. 0.1 is determined as the teacher signal of cell 21. In other words, if the teacher signal for the p-th presentation signal is tpj,
tpφ=0.9
tpj=0.1 (j ≠φ)

...(12).

【００３９】学習制御部４０は、教師信号入力部５０か
らの教師信号と神経回路網２０の出力とを、次のように
比較する。第ｑ番目の層の第ｊ番目のセル２１における
第ｐ番目の提示信号に対するセル内部の誤差（一般にシ
グモイド関数と呼ばれる関数を用いた変換を行う前の誤
差）をδｐｊ（２）で表すと、出力層におけるセル内部
の誤差δｐｊ（２）を次式（１３）で計算する。　　　　　　　　　　δｐｊ（２）　＝（ｔｐｊ−ｏｐ
ｊ（２）　）ｏｐｊ（２）　（１−ｏｐｊ（２）　　）
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　・・・（１３）ステップＳ１４で
は、学習制御部４０が神経回路網２０の誤差逆伝搬学習
処理を行う。第ｑ番目の層の各セル２１のセル内部の誤
差δｐｊ（ｑ）が計算済みのとき、第（ｑ−１　）層の
各セル２１のセル内部の誤差δｐｊ（ｑ−１）を次式（
１４）で計算しておく。さらに、セル内部の誤差δｐｊ（ｑ）　を用いて第（ｑ
−１）層から第ｑ層への重み係数ｗｊｉ（ｑ）　の修正
量Δｐ　ｗｊｉ　（ｐ）を次式（１５）で計算する。　　　　Δｐ　ｗｊｉ　（ｑ）＝ηδｐｊ（ｑ）ｏｐｉ
（ｑ−１）　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　・・・・（１５）　　学習の
速度を速めるために、次式（１５−１）を用いてもよい
。　　　　Δｐ　ｗｊｉ　（ｑ）＝αΔｐ−１　ｗｊｉ（
ｑ）＋ηδｐｊ（ｑ）ｏｐｉ（ｑ−１）　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　・・・・
（１５−１）　　（１５−１）式において、α、ηは学
習の速度を決定する定数であり、（１５−１）　式を用
いる場合には初期化処理ステップＳ１０においてΔ０　
ｗｊｉ（ｑ）＝０としておく。また、第ｑ層の第ｊ番目のセル２１におけるバイアスθ
ｊ　（ｑ）に対する修正量Δｐ　θｊ　（ｑ）も、次式
（１６）で計算する。　　　　Δｐ　θｊ　（ｑ）＝ηδｐｊ（ｑ）　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
・・・・（１６）学習の速度を速めるために、次式（１
６−１）を用いてもよい。　　　　Δｐ　θｊ　（ｑ）＝αΔｐ−１　θｊ　（ｑ
）＋ηδｐｊ（ｑ）　　　　　　　　・・・（１６−１
）（１６−１）式において、αは学習の速度を決定する
定数であり、（１６−１）式を用いる場合には初期化処
理ステップＳ１０においてΔ０　θｊ　（ｑ）＝０とし
ておく。The learning control section 40 compares the teacher signal from the teacher signal input section 50 and the output of the neural network 20 as follows. If the error inside the cell for the p-th presentation signal in the j-th cell 21 of the q-th layer (error before conversion using a function generally called a sigmoid function) is expressed as δpj (2), The error δpj(2) inside the cell in the output layer is calculated using the following equation (13). δpj (2) = (tpj-op
j(2) ) opj(2) (1-opj(2) )

(13) In step S14, the learning control unit 40 performs error backpropagation learning processing for the neural network 20. When the internal error δpj(q) of each cell 21 in the q-th layer has been calculated, the internal error δpj(q-1) of each cell 21 in the (q-1)th layer can be calculated using the following formula (
14). Furthermore, using the error δpj(q) inside the cell, the (qth
-1) Calculate the correction amount Δp wji (p) of the weighting coefficient wji (q) from the layer to the qth layer using the following equation (15). Δp wji (q)=ηδpj (q) opi
(q-1)

...(15) In order to speed up the learning speed, the following equation (15-1) may be used. Δp wji (q)=αΔp-1 wji(
q)+ηδpj(q)opi(q-1)

・・・・・・
(15-1) In the equation (15-1), α and η are constants that determine the learning speed, and when using the equation (15-1), Δ0 is set in the initialization process step S10.
Let wji(q)=0. Also, the bias θ in the j-th cell 21 of the q-th layer
The correction amount Δp θj (q) for j (q) is also calculated using the following equation (16). Δp θj (q) = ηδpj (q)

...(16) In order to speed up the learning speed, the following equation (1
6-1) may also be used. Δp θj (q)=αΔp−1 θj (q
)+ηδpj(q)...(16-1
) In equation (16-1), α is a constant that determines the learning speed, and when using equation (16-1), Δ0 θj (q)=0 is set in initialization processing step S10.

【００４０】（１５）式及び（１６）式の計算を、層の
番号ｑを減じながら全ての出力層、及び中間層にいて実
行し、全ての重み係数ｗｊｉ（ｑ）　及びバイアスθｊ
　（ｑ）　に対する修正量Δｐ　ｗｊｉ（ｑ）　及びΔ
ｐ　θｊ　（ｑ）　を求める。全ての修正量Δｐ　ｗｊ
ｉ（ｑ）　及びΔｐ　θ　ｊ　（ｑ）を計算した後に、
この修正量を用いて全ての重み係数ｗｊｉ（ｑ）　及び
バイアスθｊ（ｑ）　を次式（１７）によって修正する
。　　　　　　　　ｗｊｉ（ｑ）＝ｗｊｉ（ｑ）＋Δｐ　
ｗｊｉ（ｑ）　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　・・・・（１７）
　　　　　　　　　　θｊ　（ｑ）＝θｊ　（ｑ）＋Δ
ｐ　θｊ　（ｑ）　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　以上の処理をｐ＝１と
して提示信号ａｓφ０．９　（ｍ）に対して行う。この
とき、教師信号ｔｐｊとしては前記（１２）式によるも
のを用いる。The calculations of equations (15) and (16) are performed in all output layers and intermediate layers while decreasing the layer number q, and all weighting coefficients wji(q) and biases θj
(q) Correction amount Δp wji(q) and Δ
Find p θj (q). All correction amounts Δp wj
After calculating i(q) and Δp θ j (q),
Using this correction amount, all weighting coefficients wji(q) and biases θj(q) are corrected by the following equation (17). wji(q)=wji(q)+Δp
wji (q)

...(17)
θj (q)=θj (q)+Δ
p θj (q)
The above processing is performed on the presentation signal asφ0.9 (m) with p=1. At this time, as the teacher signal tpj, one according to the above equation (12) is used.

【００４１】ステップＳ１５において、学習制御部４０
が学習終了判定処理を行う。学習処理の繰り返しによっ
て重み係数ｗｊｉ（ｑ）が最適な値に近づくと、出力の
セル内部の誤差δｐｊ（２）が０に近づく。出力のセル
内部の誤差δｐｊ（２）が十分に小さな値εよりも小さ
な値になったかどうかを判定し、出力のセル内部の誤差
δｐｊ（２）が大きければ学習未終了であると判定し、
ｐに１を加えてステップＳ１１に戻る。出力のセル内部
の誤差δｐｊ（２）が小さければ、全ての学習処理を終
了する。[0041] In step S15, the learning control unit 40
performs learning completion determination processing. As the weighting coefficient wji(q) approaches the optimal value by repeating the learning process, the error δpj(2) within the output cell approaches 0. Determine whether the error δpj(2) inside the output cell has become a value smaller than a sufficiently small value ε, and if the error δpj(2) inside the output cell is large, determine that learning has not finished,
Add 1 to p and return to step S11. If the error δpj(2) inside the output cell is small, all learning processing is completed.

【００４２】ステップＳ１１に戻った場合には、提示信
号として前述の教師エポック点を中心としない区間６２
をとる。区間６２の中心の教師エポック点からのずれは
ランダムとする。この場合の提示信号ａｓφ０．１　（
ｍ）を次式（１８）でパワー正規化してオフセットを加
え、入力層の各セル２１の出力ｏｐｊ（０）　とする。When the process returns to step S11, the section 62 which is not centered at the aforementioned teacher epoch point is used as the presentation signal.
Take. The deviation of the center of the section 62 from the teacher epoch point is random. Presentation signal in this case asφ0.1 (
m) is power normalized using the following equation (18), an offset is added, and the output of each cell 21 in the input layer is set as opj(0).

【００４３】[0043]

【数４】[Math 4]

【００４４】このときの教師信号ｔｐｊは全ての出力セ
ル２１で同一の０．１とし、ステップＳ１２以降では前
述と同様の処理を行う。The teacher signal tpj at this time is set to the same value of 0.1 in all output cells 21, and from step S12 onwards, the same processing as described above is performed.

【００４５】以上のように、ａｓφ０．９　（ｍ）とａ
ｓφ０．１（ｍ）を交互に提示する学習処理を、多くの
音韻φとその提示信号に対して繰り返し行うことにより
、最終的に最適な重み係数が得られる。As shown above, asφ0.9 (m) and a
By repeatedly performing the learning process of alternately presenting sφ0.1(m) for many phonemes φ and their presentation signals, the optimal weighting coefficient can finally be obtained.

【００４６】（ＩＩ）　　学習後の神経回路網を用いた
音韻認識処理図１（ｂ）に示すように、音韻認識処理は
、神経回路網の入力処理（ステップＳ２０）、神経回路
網の順方向伝搬処理（ステップＳ２１）、神経回路網の
出力ピーク検出処理（ステップＳ２２）、及び音韻認識
判定処理（ステップＳ２３）より、構成されている。以
下、その各ステップＳ２０〜Ｓ２３の内容を、図３及び
図５を参照しつつ説明する。(II) Phonological recognition processing using neural network after learning As shown in FIG. 1(b), phoneme recognition processing includes input processing of neural network (step S20), forward direction of neural network It consists of propagation processing (step S21), neural network output peak detection processing (step S22), and phoneme recognition determination processing (step S23). The contents of each step S20 to S23 will be explained below with reference to FIGS. 3 and 5.

【００４７】なお、図５は、第１の実施例において音韻
認識時の際の入力音声信号と出力信号系列の関係を表す
説明図である。図５中の７０は音韻認識の対象となる入
力音声波形、７１は神経回路網に入力するある１つの入
力音声波形の区間、７２は区間７１に対する神経回路網
の出力である。７３は神経回路網に入力する次の１つの
入力音声波形の区間であり、７４はこの区間７３に対す
る神経回路網の出力、７５は音韻認識処理によって得ら
れる神経回路網の出力信号系列、７６は出力信号系列か
ら抽出されたピークの時刻を示す。FIG. 5 is an explanatory diagram showing the relationship between the input speech signal and the output signal sequence during phoneme recognition in the first embodiment. In FIG. 5, 70 is an input speech waveform to be subjected to phoneme recognition, 71 is a section of one input speech waveform input to the neural network, and 72 is an output of the neural network for the section 71. 73 is the section of the next input speech waveform input to the neural network, 74 is the output of the neural network for this section 73, 75 is the output signal sequence of the neural network obtained by phoneme recognition processing, and 76 is the section of the next input speech waveform input to the neural network. Indicates the time of the peak extracted from the output signal sequence.

【００４８】音韻認識処理において、時間領域で離散的
な、音韻が未知の入力音声波形をｘ（ｍ）とする。本実
施例では、学習処理に用いた提示信号とは別の音声を入
力音声波形として用いる。ここで、現在着目している時
刻ｕを中心とした時間区間における入力音声波形をｘｕ
　（ｍ）とする。In the phoneme recognition process, x(m) is an input speech waveform whose phoneme is unknown and is discrete in the time domain. In this embodiment, a different sound from the presentation signal used in the learning process is used as the input sound waveform. Here, xu
(m).

【００４９】ステップＳ２０では、神経回路網２０の入
力として、入力音声波形ｘｕ　（ｍ）を次式（１９）に
よってパワー正規化してオフセットを加え、入力層の各
セル２１の出力ｏｕｊ（０）とする。In step S20, as an input to the neural network 20, the input speech waveform xu (m) is power-normalized using the following equation (19), an offset is added, and the output ouj(0) of each cell 21 of the input layer is do.

【００５０】[0050]

【数５】[Math 5]

【００５１】但し、ｏｕｊ（ｑ）　；ｑ番目の層におけ
る時刻ｕを中心とした入力音声波形に対するｊ番目のセ
ル２１の出力Ｃ　　；正規化のための正定数ステップＳ２１では、神経回路網２０の順方向伝搬処理
を行う。この処理は、（１１）　式におけるｐをｕに置
き換えて計算することにより、学習処理における順方向
伝搬処理と同様に行う。この処理によって出力層のセル
２１からは、出力ｏｕｊ（２）　が得られる。さらに、
入力音声波形を時刻ｕ＋１を中心とする時間区間から取
り、同様な処理を行う。このような処理を繰り返し、時
刻ｕに対する各音韻に対応した出力ｏｕｊ（２）　の系
列を得る。この出力系列の一例が図５の符号７６で示さ
れている。[0051] However, ouj(q); Output C of the j-th cell 21 for the input speech waveform centered at time u in the q-th layer; Positive constant for normalization In step S21, the output of the neural network 20 is Perform forward propagation processing. This process is performed in the same way as the forward propagation process in the learning process by replacing p in equation (11) with u. Through this process, the output ouj(2) is obtained from the output layer cell 21. moreover,
The input audio waveform is taken from a time interval centered at time u+1, and similar processing is performed. By repeating such processing, a series of outputs ouj(2) corresponding to each phoneme at time u is obtained. An example of this output series is shown at 76 in FIG.

【００５２】入力音声波形を取り出した時間区間の中心
と、入力音声波形のある音韻のエポック点が一致すると
、出力系列の当該音韻に対応した出力系列には大きなピ
ークが生ずる。このピークを検出し、各出力系列のピー
クの大きさを比較することにより、入力音声波形の音韻
を求めることが出来る。When the center of the time interval from which the input speech waveform is extracted coincides with the epoch point of a certain phoneme in the input speech waveform, a large peak occurs in the output series corresponding to that phoneme. By detecting this peak and comparing the magnitude of the peak of each output series, the phoneme of the input speech waveform can be determined.

【００５３】ステップＳ２２では、出力系列に対してピ
ーク検出処理を行う。出力が次式（２０）及び（２１）
の条件を満たす離散的な時刻ｖｄｊを、当該音韻のエポ
ック点の時刻として検出する。かつ、　　ここで、ｐはピークを検出するための閾値であり、
本実施例では定数０．５を用いる。ｄは検出したピーク
に付与する番号、ｊは出力に対応した音韻の種類である
。In step S22, peak detection processing is performed on the output series. The output is the following equations (20) and (21)
A discrete time vdj that satisfies the condition is detected as the epoch point time of the phoneme. and, where p is the threshold for detecting the peak,
In this embodiment, a constant of 0.5 is used. d is the number given to the detected peak, and j is the type of phoneme corresponding to the output.

【００５４】最後に、ステップＳ２３では、検出した各
音韻のエポック点における出力系列のピークの大きさを
比較し、音韻認識結果とする。本実施例では、単純にピ
ークの絶対的な大きさを所望の時間区間（ここでは時刻
Ｔｓ　〜Ｔｅ　までとする）にわたって単純に加算平均
して比較し、最も大きなピークの平均値を持つ出力に対
応した音韻を音韻認識結果として出力する。これを式で
表すと、まず各音韻の評価値Ｒｊ　を次式（２２）で求
める。但し、Ｎｊ　；時刻Ｔｓ　〜Ｔｅ　までに含まれるピー
クの時刻ｖｄｊの数そして、評価値Ｒｊ　を全音韻にわたって比較し、最大
の評価値を与えるｊ　に対応した音韻を音韻認識結果と
して出力する。Finally, in step S23, the magnitudes of the peaks of the output series at the epoch points of each detected phoneme are compared and used as a phoneme recognition result. In this example, the absolute magnitudes of the peaks are simply averaged over a desired time interval (here, from time Ts to Te) and compared, and the output having the largest average value is selected. The corresponding phoneme is output as the phoneme recognition result. Expressing this in a formula, first, the evaluation value Rj of each phoneme is calculated using the following formula (22). However, Nj is the number of peak times vdj included from time Ts to Te, and the evaluation value Rj is compared over all phonemes, and the phoneme corresponding to j that gives the maximum evaluation value is output as the phoneme recognition result.

【００５５】この音韻認識結果判定処理の方式は任意で
あり、各種の統計的な距離尺度等を用いてもよい。The method of this phoneme recognition result judgment process is arbitrary, and various statistical distance measures may be used.

【００５６】以上のように、この第１の本実施例では、
次のような利点を有している。As described above, in this first embodiment,
It has the following advantages:

【００５７】（ｉ）有声音の特徴抽出と音韻認識を一体
化した神経回路網２０によって行うため、学習処理によ
って最適な特徴抽出と最適な音韻認識が組み合わされた
処理が自動的に形成され、高性能な音韻認識出力が得ら
れる。(i) Since feature extraction of voiced sounds and phoneme recognition are performed by the neural network 20 that integrates them, a process that combines optimal feature extraction and phoneme recognition is automatically created through learning processing. High performance phonological recognition output can be obtained.

【００５８】（ｉｉ）音声の基本周期と音韻認識結果が
同時に得られ、時間分解能の高い認識結果が得られる。そのため、本実施例の方法を、例えば連続音声認識に用
いた場合には、音韻認識と基本周波数抽出とを個別に行
う必要がなく、単一の処理で音韻情報と基本周期波数を
求めることができる。(ii) The basic period of speech and the phoneme recognition result can be obtained simultaneously, and recognition results with high temporal resolution can be obtained. Therefore, when the method of this embodiment is used for continuous speech recognition, for example, there is no need to perform phoneme recognition and fundamental frequency extraction separately, and it is possible to obtain phoneme information and fundamental period wavenumber in a single process. can.

【００５９】第２の実施例図７は、本発明の第２の実施例を示す有声音韻認識装置
の機能ブロック図である。Second Embodiment FIG. 7 is a functional block diagram of a voiced phoneme recognition device showing a second embodiment of the present invention.

【００６０】この有声音韻認識装置では、図３の神経回
路網２０に代えて神経回路網群２０Ａが設けられ、それ
に対応して図３とほぼ同様の波形入力部１０Ａ、認識評
価部３０Ａ、学習制御部４０Ａ、及び教師信号入力部５
０Ａが設けられている。神経回路網群２０Ａは、認識対
象の各音韻毎に設けられた複数個の神経回路網２０を有
し、その各神経回路網２０が多数の処理ユニット（即ち
、セル）２１より構成されている。In this voiced phoneme recognition device, a neural network group 20A is provided in place of the neural network 20 in FIG. Control unit 40A and teacher signal input unit 5
0A is provided. The neural network group 20A has a plurality of neural networks 20 provided for each phoneme to be recognized, and each neural network 20 is composed of a large number of processing units (i.e., cells) 21. .

【００６１】図６は、図７の装置を用いた有声音韻認識
方法の処理手順を示すフローチャートである。本実施例
の有声音韻認識方法の処理手順は、図１とほぼ同様に、
図６ｂ（ａ）に示す神経回路網群の学習処理手順と、図
６（ｂ）に示す学習処理が終った神経回路網群による音
韻認識処理手順とで、構成されている。以下、その（Ｉ
Ａ）学習処理、及び（ＩＩＡ）音韻認識処理について説
明する。FIG. 6 is a flowchart showing the processing procedure of the voiced phoneme recognition method using the apparatus shown in FIG. The processing procedure of the voiced phoneme recognition method of this embodiment is almost the same as in FIG.
The learning processing procedure of the neural network group shown in FIG. 6b(a) and the phoneme recognition processing procedure of the neural network group after the learning processing shown in FIG. 6(b) are configured. Below, the (I
A) Learning processing and (IIA) phonological recognition processing will be explained.

【００６２】（ＩＡ）神経回路網群の学習処理この学習
処理では、図１の神経回路網に代えて神経回路網群に対
する初期化処理（ステップＳ１０Ａ）、提示信号入力処
理（ステップＳ１１Ａ）、順方向伝搬処理（ステップＳ
１２Ａ）、出力誤差計算処理（ステップＳ１３Ａ）、誤
差逆伝搬学習処理（ステップＳ１４Ａ）、及び学習終了
判定処理（ステップＳ１５Ａ）が、図１（ａ）とほぼ同
様に実行される。(IA) Learning process for neural network group In this learning process, instead of the neural network shown in FIG. 1, initialization processing for the neural network group (step S10A), presentation signal input processing (step S11A), Direction propagation processing (step S
12A), output error calculation processing (step S13A), error backpropagation learning processing (step S14A), and learning completion determination processing (step S15A) are executed in substantially the same manner as in FIG. 1(a).

【００６３】即ち、図１（ａ）のステップＳ１０と同様
に、ステップＳ１０Ａで初期化処理を行った後、図７の
波形入力部１０Ａでは、ステップＳ１１Ａで提示信号の
入力処理を行う。That is, similar to step S10 in FIG. 1(a), after initialization processing is performed in step S10A, the waveform input section 10A in FIG. 7 performs presentation signal input processing in step S11A.

【００６４】このステップＳ１１Ａでは、図１（ａ）の
ステップＳ１１Ａとほぼ同様に、波形入力部１０Ａが、
図４の提示信号６０を教師エポック点を中心とした時間
区間６３から取り出す。この時間区間６３の長さは、神
経回路網群２０Ａにおける各神経回路網２０の入力層の
セル数に等しいサンプル数とする。[0064] In this step S11A, the waveform input section 10A
The presentation signal 60 in FIG. 4 is extracted from a time interval 63 centered on the teacher epoch point. The length of this time interval 63 is set to be the number of samples equal to the number of cells in the input layer of each neural network 20 in the neural network group 20A.

【００６５】各神経回路網２０の層数の構造は、第１の
実施例と同様に、音韻毎に異なってもよいが、本実施例
では各神経回路網２０の構造は互いに同様のものとし、
入力層のセル数に等しい５１２点のサンプルを提示信号
とする。提示信号の音韻φを本実施例では５母音の音韻
に対応した０〜４の番号とみなし、この場合の音韻番号
φの提示信号ａｓφ０．９　（ｍ）で表す。The structure of the number of layers of each neural network 20 may be different for each phoneme as in the first embodiment, but in this embodiment, the structure of each neural network 20 is assumed to be similar to each other. ,
Samples at 512 points, which is equal to the number of cells in the input layer, are used as presentation signals. In this embodiment, the phoneme φ of the presentation signal is regarded as a number from 0 to 4 corresponding to the phoneme of five vowels, and the presentation signal of the phoneme number φ in this case is expressed as asφ0.9 (m).

【００６６】波形入力部１０Ａは、提示信号をａｓφ０
．９　（ｍ）を（１０）式によってパワー正規化してオ
フセットを加えものが、神経回路網群２０Ａにおける各
神経回路網２０の入力層の各セル２１の出力ｏｐｊ（β
，０）となるように設定する。ここで、ｏｐｊ（β，ｑ
）　は、β番目の音韻に対応した神経回路網２０のｑ番
目の層におけるｐ番目のパタンに対するｊ番目のセル２
１の出力を表す。本実施例では、第１の実施例と同様に
、提示信号ａｓφ０．９　（ｍ）を１番目のパタンとし
、入力層を０番目層としている。The waveform input unit 10A inputs the presentation signal asφ0
．． 9 (m) is power normalized using equation (10) and an offset is added to the output opj(β
, 0). Here, opj(β, q
) is the j-th cell 2 for the p-th pattern in the q-th layer of the neural network 20 corresponding to the β-th phoneme.
Represents the output of 1. In this embodiment, as in the first embodiment, the presentation signal asφ0.9 (m) is the first pattern, and the input layer is the 0th layer.

【００６７】ステップＳ１２Ａでは、ステップＳ１２と
ほぼ同様に、神経回路網群２０Ａ内の各神経回路網２０
の順方向伝搬処理を行う。本実施例における神経回路網
２０の構造は、第１の実施例と同様の３層構造であり、
β番目の音韻に対応した神経回路網２０の第ｑ番目の層
の出力は次式（１１Ａ）で計算する。In step S12A, each neural network 20 in the neural network group 20A is processed almost similarly to step S12.
Performs forward propagation processing. The structure of the neural network 20 in this embodiment is a three-layer structure similar to the first embodiment,
The output of the qth layer of the neural network 20 corresponding to the βth phoneme is calculated using the following equation (11A).

【００６８】[0068]

【数６】[Math 6]

【００６９】（１１Ａ）式において、ｏｐｊ（β，ｑ）
はβ番目の音韻に対応した神経回路網２０の第ｑ番目の
層における第ｊ番目のセル２１の出力であり、第ｐ番目
のパタンを提示した場合のものである。ｗ　ｊｉ　（β
，ｑ）はβ番目の音韻に対応した神経回路網２０の第ｑ
番目の層の入力となる第ｑ−１番目の層の第ｊ番目のセ
ル２１からの重み係数、θｊ　（β，ｑ）はβ番目の音
韻に対応した神経回路網２０の第ｑ番目の層の第ｊ番目
のセル２１のバイアスである。本実施例での各神経回路
網２０の構造は同一であり、各層のセル数はＮ０　が５
１２、Ｎ１　が６４、Ｎ２　が１である。In equation (11A), opj (β, q)
is the output of the jth cell 21 in the qth layer of the neural network 20 corresponding to the βth phoneme, and is the output when the pth pattern is presented. w ji (β
, q) is the q-th neural network 20 corresponding to the β-th phoneme.
The weighting coefficient from the j-th cell 21 of the q-1th layer, which is the input to the q-th layer, and θj (β, q) is the q-th layer of the neural network 20 corresponding to the β-th phoneme. is the bias of the j-th cell 21 of . In this embodiment, the structure of each neural network 20 is the same, and the number of cells in each layer is N0 = 5.
12, N1 is 64, and N2 is 1.

【００７０】各神経回路網２０の層数、セル数等の構造
は音韻毎に異なってもよい。ｗ　ｊｉ　（β，ｑ）とθ
ｊ　（β，ｑ）は、学習の前にはランダムな小さな値に
設定しておく。（１１Ａ）式の計算を全てのβ，ｑ，ｊ
に対して順に計算し、各神経回路網２０の第２番目の層
である出力層のセル２１の出力ｏｐｊ（β，２）　を得
る。The structure of each neural network 20, such as the number of layers and the number of cells, may differ for each phoneme. w ji (β, q) and θ
j (β, q) is set to a small random value before learning. Calculate equation (11A) for all β, q, j
are calculated in order to obtain the output opj (β, 2) of the cell 21 of the output layer, which is the second layer of each neural network 20.

【００７１】ステップＳ１３Ａでは、ステップＳ１３と
ほぼ同様に、学習制御部４０Ａが出力誤差計算処理を行
う。これにさきだち、教師信号入力部５０Ａは、提示信
号に対する出力セル２１の教師信号として、提示信号の
音韻φに対応した神経回路網２０の出力セル２１の教師
信号として０．９を定め、他の音韻に対応した神経回路
網２０の出力セル２１の教師信号として０．１を定める
。つまり、第ｐ番目の提示信号（音韻φ）に対する、音
韻βに対応した神経回路網２０の出力セル２１の教師信
号をｔｐ　（β）として、とする。[0071] In step S13A, the learning control section 40A performs an output error calculation process in substantially the same way as step S13. Prior to this, the teacher signal input unit 50A determines 0.9 as the teacher signal of the output cell 21 of the output cell 21 for the presentation signal, and sets 0.9 as the teacher signal of the output cell 21 of the neural network 20 corresponding to the phoneme φ of the presentation signal. 0.1 is determined as the teacher signal of the output cell 21 of the neural network 20 corresponding to the phoneme. That is, for the p-th presentation signal (phoneme φ), the teacher signal of the output cell 21 of the neural network 20 corresponding to the phoneme β is set as tp (β).

【００７２】学習制御部４０Ａは、第１の実施例とほぼ
同様に、教師信号入力部５０Ａからの教師信号と神経回
路網群２０Ａの出力とを、次のように比較する。β番目
の音韻に対応した神経回路網２０の第ｑ番目の層の第ｊ
番目のセル２１における第ｐ番目の提示信号に対するセ
ル内部の誤差をδｐｊ（β，ｑ）で表すと、出力層にお
けるセル内部の誤差δｐｊ（β，２）を次式（１３Ａ）
で計算する。　　ステップＳ１４Ａでは、第１の実施例とほぼ同様に
、学習制御部４０Ａが神経回路網群２０Ａの誤差逆伝搬
学習処理を行う。各神経回路網２０の第ｑ番目の層の各
セル２１のセル内部の誤差δｐｊ（β，ｑ）が計算済み
とき、　　第（ｑ−１）層の各セル２１のセル内部の誤
差δｐｊ　（β，ｑ−１）を次式（１４Ａ）で計算して
おく。さらに、セル内部の誤差δｐｊ　（β，ｑ）を用いて第
（ｑ−１　）層から第ｑ層への重み係数ｗｊｉ　（β，
ｑ）の修正量Δｐ　ｗｊｉ　（β，ｑ）を次式（１５Ａ
）で計算する。　　　　　　Δｐ　ｗｊｉ　（β，ｑ）　＝ηδｐｊ　
（β，ｑ）ｏｐｉ　（β，ｑ−１）　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　・・・・（
１５Ａ）　　第１の実施例と同様、学習の速度を速める
ために、次式を用いてもよい。　　Δｐ　ｗｊｉ　（β，ｑ）　＝αΔｐ−１　ｗｊｉ
　（β，ｑ）　＋ηδｐｊ　（β，ｑ）ｏｐｉ　（β，
ｑ−１）　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　・・・・（１５Ａ−１）（１５Ａ−１）式
において、α、ηは学習の速度を決定する定数であり、
（１５Ａ−１）式を用いる場合には初期化処理ステップ
Ｓ１０ＡにおいてΔ０　ｗｊｉ（β，ｑ）＝０としてお
く。また、β番目の音韻に対応した神経回路網２０の第
ｑ層の第ｊ番目のセル２１おけるバイアスθｊ　（β，
ｑ）に対する修正量Δｐ　θｊ　（β，ｑ）も、次式（
１６Ａ）で計算する。　　　　　　　　　　Δｐ　θｊ　（β，ｑ）　＝ηδ
ｐｊ　（β，ｑ）　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　・・・・（１６Ａ）第
１の実施例と同様、学習の速度を速めるために、次式（
１６Ａ−１）を用いてもよい。　　　　Δｐ　θｊ　　（β，ｑ）　＝αΔｐ−１　θ
ｊ　　（β，ｑ）＋ηδｐｊ　（β，ｑ）　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　・
・・（１６Ａ−１）但し、αは学習の速度を決定する定
数であり、（１６Ａ−１）式を用いる場合には初期化処
理ステップＳ１０ＡにおいてΔ０　θｊ　（β，ｑ）＝
０としておく。The learning control section 40A compares the teacher signal from the teacher signal input section 50A and the output of the neural network group 20A as follows, almost in the same way as in the first embodiment. j of the q-th layer of the neural network 20 corresponding to the β-th phoneme
If the cell internal error for the pth presentation signal in the cell 21 is expressed as δpj (β, q), then the cell internal error δpj (β, 2) in the output layer can be expressed as the following equation (13A):
Calculate with. In step S14A, the learning control unit 40A performs error backpropagation learning processing for the neural network group 20A, almost similarly to the first embodiment. When the internal error δpj (β, q) of each cell 21 in the qth layer of each neural network 20 has been calculated, the internal error δpj (β, q) of each cell 21 in the (q-1)th layer is calculated as , q-1) are calculated using the following equation (14A). Furthermore, using the error δpj (β, q) inside the cell, the weighting coefficient wji (β,
q) correction amount Δp wji (β, q) is calculated using the following formula (15A
) to calculate. Δp wji (β, q) = ηδpj
(β, q) opi (β, q-1)

...(
15A) Similar to the first embodiment, the following equation may be used to speed up learning. Δp wji (β, q) = αΔp−1 wji
(β, q) +ηδpj (β, q) opi (β,
q-1)

...(15A-1) In equation (15A-1), α and η are constants that determine the learning speed,
When formula (15A-1) is used, Δ0 wji (β, q) is set to 0 in initialization processing step S10A. Furthermore, the bias θj (β,
The correction amount Δp θj (β, q) for q) is also calculated using the following formula (
16A). Δp θj (β, q) = ηδ
pj (β, q)

(16A) Similar to the first embodiment, in order to speed up learning, the following formula (
16A-1) may also be used. Δp θj (β, q) = αΔp−1 θ
j (β, q) + ηδpj (β, q)

・
...(16A-1) However, α is a constant that determines the learning speed, and when using equation (16A-1), Δ0 θj (β, q)=
Set it to 0.

【００７３】（１５−Ａ）式及び（１６−Ａ）式の計算
を、層の番号ｑを減じながら全ての神経回路網２０、出
力層、及び中間層について実行し、全ての重み係数ｗｊ
ｉ（β，ｑ）及びバイアスθｊ　（β，ｑ）に対する修
正量Δ　ｐｗｊｉ（β，ｑ）及びΔ　ｐθｊ　（β，ｑ
）を求める。　　全ての修正量Δｐ　ｗｊｉ（β，ｑ）
　及びΔｐθｊ　（β，ｑ）　を計算した後に、この修
正量を用いて全ての重み係数ｗｊｉ（β，ｑ）　及びバ
イアスθｊ　（β，ｑ）を次式（１７Ａ）によって修正
する。　　　　　　　　ｗｊｉ　（β，ｑ）　＝ｗｊｉ　（β
，ｑ）＋Δｐ　ｗｊｉ　（β，ｑ）　　　　　　　　　
θｊ　　（β，ｑ）　＝θｊ　　（β，ｑ）＋Δｐ　θ
ｊ　　（β，ｑ）　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　・・・・（１７Ａ）以上の
処理をｐ＝１として、提示信号ａｓφ０．９　（ｍ）に
対して行う。このとき、教師信号ｔｐ　（β）としては
前記（１２Ａ）式によるものを用いる。The calculations of equations (15-A) and (16-A) are performed for all neural networks 20, output layers, and intermediate layers while subtracting the layer number q, and all weighting coefficients wj
Correction amounts Δ pwji (β, q) and Δ pθj (β, q) for i (β, q) and bias θj (β, q)
). All correction amounts Δp wji (β, q)
After calculating Δpθj (β, q), all weighting coefficients wji (β, q) and biases θj (β, q) are modified using the following equation (17A) using this modification amount. wji (β, q) = wji (β
, q)+Δp wji (β, q)
θj (β, q) = θj (β, q) + Δp θ
j (β, q)

(17A) The above processing is performed on the presentation signal asφ0.9 (m) with p=1. At this time, as the teacher signal tp (β), one according to the above-mentioned formula (12A) is used.

【００７４】ステップＳ１５Ａにおいて、第１の実施例
とほぼ同様に、学習制御部４０Ａが学習終了判定処理を
行う。学習処理の繰り返しによって重み係数ｗｊｉ（β
，ｑ）が最適な値に近づくと、出力のセル内部の誤差δ
ｐｊ（β２　）が０に近づく。全ての神経回路網２０の
出力のセル内部の誤差δｐｊ（β２　）が十分に小さな
値εよりも小さな値になったかどうかを判定し、出力の
セル内部の誤差δｐｊβ（２）が大きければ学習未終了
であると判定し、ｐに１を加えてステップＳ１１Ａに戻
る。全ての神経回路網２０の出力のセル内部の誤差δｐ
ｊ（β２　）がεよりも小さければ、全ての学習処理を
終了する。In step S15A, the learning control section 40A performs a learning completion determination process, almost in the same way as in the first embodiment. By repeating the learning process, the weighting coefficient wji(β
, q) approaches the optimal value, the error inside the output cell δ
pj (β2) approaches 0. It is determined whether the errors δpj(β2) inside the cells of the outputs of all the neural networks 20 have become smaller than a sufficiently small value ε, and if the errors δpjβ(2) inside the cells of the outputs are large, the learning is not completed. It is determined that the process has ended, 1 is added to p, and the process returns to step S11A. Cell internal error δp of all neural network 20 outputs
If j(β2) is smaller than ε, all learning processing ends.

【００７５】ステップＳ１１Ａに戻った場合には、提示
信号として前述の教師エポック点を中心としない図４の
区間６２をとる。区間６２の中心の教師エポック点から
のずれはランダムとする。この場合の提示信号ａｓφ０
．１　（ｍ）を（１８）式でパワー正規化してオフセッ
トを加え、入力層の各セル２１の出力ｏｐｊ（β０　）
　とする。このときの教師信号ｔｐ　（β）は全ての神
経回路網２０の全ての出力セル２１で同一の０．１とし
、ステップＳ１２Ａ以降では前述と同様の処理を行う。When the process returns to step S11A, the section 62 in FIG. 4, which is not centered at the aforementioned teacher epoch point, is taken as the presentation signal. The deviation of the center of the section 62 from the teacher epoch point is random. Presentation signal asφ0 in this case
．． 1 (m) is power-normalized using equation (18) and an offset is added to obtain the output opj (β0) of each cell 21 in the input layer.
shall be. The teacher signal tp (β) at this time is set to the same 0.1 for all output cells 21 of all neural networks 20, and the same processing as described above is performed from step S12A onwards.

【００７６】以上のように、第１の実施例とほぼ同様に
、ａｓφ０．９　（ｍ）とａｓφ０．１　（ｍ）を交互
に提示する学習処理を、多くの音韻φとその提示信号に
対して繰り返し行うことにより、最終的に最適な重み係
数が得られる。As described above, almost similarly to the first embodiment, the learning process of alternately presenting asφ0.9 (m) and asφ0.1 (m) is applied to many phonemes φ and their presentation signals. By repeating this process, the optimal weighting coefficient can finally be obtained.

【００７７】（ＩＩＡ）学習後の神経回路網群を用いた
音韻認識処理この音韻認識処理では、図１の神経回路網
に代えて、神経回路網群に対する入力処理（ステップＳ
２０Ａ）、順方向伝搬処理（ステップＳ２１Ａ）、出力
ピーク検出処理（ステップＳ２２Ａ）、及び音韻認識判
定処理（ステップＳ２３Ａ）が、同１（ｂ）とほぼ同様
に実行される。(IIA) Phonological recognition processing using the neural network group after learning In this phoneme recognition processing, input processing to the neural network group (step S
20A), forward propagation processing (step S21A), output peak detection processing (step S22A), and phoneme recognition determination processing (step S23A) are executed in substantially the same manner as in 1(b).

【００７８】即ち、ステップＳ２０Ａでは、全ての神経
回路網２０の入力として、入力音声波形ｘｕ　（ｍ）を
（１９）式によってパワー正規化してオフセットを加え
、各神経回路網２０における入力層の各セル２１の出力
ｏｕｊ（β，０）とする。ここで、ｏｕｊ（β，０）は
β番目の音韻に対応した神経回路網２０のｑ番目の層に
おける時刻ｕを中心とした入力音声波形に対するｊ番目
のセル２１の出力を示す。That is, in step S20A, the input speech waveform xu (m) is power-normalized as an input to all neural networks 20 using equation (19), and an offset is added to each of the input layers in each neural network 20. Let the output of the cell 21 be ouj (β, 0). Here, ouj (β, 0) indicates the output of the j-th cell 21 for the input speech waveform centered at time u in the q-th layer of the neural network 20 corresponding to the β-th phoneme.

【００７９】ステップＳ２１Ａでは、第１の実施例とほ
ぼ同様に、神経回路網群２０Ａの順方向伝搬処理を行う
。この処理は、（１１Ａ）式におけるｐをｕに置き換え
て計算することにより、学習処理における順方向伝搬処
理と同様に行う。この処理によって各神経回路網２０の
出力層のセル２１からは、出力ｏｕｊ（β，２）が得ら
れる。さらに、入力音声波形を時刻ｕ＋１を中心とする
時間区間から取り、同様な処理を行う。このような処理
を繰り返し、時刻ｕに対する各音韻に対応した神経回路
網２０の出力ｏｕｊ（β，２）の系列を得る。In step S21A, forward propagation processing of the neural network group 20A is performed in substantially the same manner as in the first embodiment. This process is performed in the same way as the forward propagation process in the learning process by replacing p in equation (11A) with u. Through this processing, output ouj (β, 2) is obtained from the cells 21 of the output layer of each neural network 20. Furthermore, the input audio waveform is taken from a time interval centered at time u+1, and similar processing is performed. By repeating such processing, a sequence of outputs ouj (β, 2) of the neural network 20 corresponding to each phoneme at time u is obtained.

【００８０】この出力系列の一例は、図５の符号７６の
ようになる。入力音声波形を取り出した時間区間の中心
と、入力音声波形のある音韻のエポック点が一致すると
、出力系列のうち、当該音韻に対応した神経回路網２０
の出力系列には大きなピークが生ずる。このピークを検
出し、各神経回路網２０の出力系列のピークの大きさを
比較することにより、入力音声波形の音韻を求めること
が出来る。An example of this output series is as indicated by the reference numeral 76 in FIG. When the center of the time interval from which the input speech waveform is extracted matches the epoch point of a certain phoneme in the input speech waveform, the neural network 20 corresponding to the phoneme in the output series
A large peak occurs in the output series. By detecting this peak and comparing the magnitudes of the peaks of the output series of each neural network 20, the phoneme of the input speech waveform can be determined.

【００８１】ステップＳ２２Ａでは、第１の実施例とほ
ぼ同様に、各神経回路網２０の出力系列に対してピーク
検出処理を行う。出力が次式（２０Ａ）及び（２１Ａ）
の条件を満たす離散的な時刻ｖｄ　βを、当該音韻のエ
ポック点の時刻として検出する。In step S22A, peak detection processing is performed on the output series of each neural network 20, almost similarly to the first embodiment. The output is the following formula (20A) and (21A)
A discrete time vd β that satisfies the condition is detected as the time of the epoch point of the phoneme.

【００８２】ここで、ｐはピークを検出するための閾値であり、第１
の実施例と同様に定数０．５を用いる。ｄは検出したピ
ークに付与する番号、βは神経回路網２０に対応した音
韻の種類である。[0082] Here, p is a threshold value for detecting a peak, and the first
A constant of 0.5 is used as in the embodiment. d is a number given to the detected peak, and β is the type of phoneme corresponding to the neural network 20.

【００８３】ステップＳ２３Ａでは、検出した各音韻の
エポック点における各神経回路網の出力系列のピークの
大きさを比較し、音韻認識結果とする。本実施例では、
第１の実施例と同様に、単純にピークの絶対的な大きさ
を所望の時間区間（時刻Ｔｓ　〜Ｔｅ　）にわたって単
純に加算平均して比較し、最も大きなピークの平均値を
持つ出力に対応した音韻を音韻認識結果として出力する
。これを式で表すと、まず各音韻の評価値Ｒβを次式（
２２Ａ）で求める。[0083] In step S23A, the magnitudes of the peaks of the output series of each neural network at the epoch points of each detected phoneme are compared and used as a phoneme recognition result. In this example,
As in the first embodiment, the absolute magnitudes of the peaks are simply averaged and compared over a desired time interval (times Ts to Te), and the output having the largest average value of the peaks is matched. The resulting phoneme is output as the phoneme recognition result. To express this in a formula, first, the evaluation value Rβ of each phoneme is calculated by the following formula (
22A).

【００８４】[0084]

【数７】[Math 7]

【００８５】但し、Ｎβは時刻Ｔｓ　〜Ｔｅ　までに含
まれる、音韻βに対応した神経回路網２０のピークの時
刻ｖｄ　βの数である。そして、評価値Ｒβを全音韻に
わたって比較し、最大の評価値を与えるβに対応した音
韻を音韻認識結果として出力する。However, Nβ is the number of peak times vd β of the neural network 20 corresponding to the phoneme β, which are included from time Ts to Te. Then, the evaluation values Rβ are compared over all phonemes, and the phoneme corresponding to β that gives the maximum evaluation value is output as the phoneme recognition result.

【００８６】この音韻認識判定処理の方式は、第１の実
施例と同様に任意であり、各種の統計的な距離尺度等を
用いてもよい。The method of this phoneme recognition determination process is arbitrary as in the first embodiment, and various statistical distance measures may be used.

【００８７】以上のように、この第２の実施例では、有
声音の特徴抽出と音韻認識を一体化した各音韻専用の神
経回路網２０によつて行うため、第１の実施例の利点（
ｉ）（ｉｉ）よりも、より高性能及び時間分解能の高い
認識結果が得られる。As described above, in this second embodiment, feature extraction of voiced sounds and phoneme recognition are carried out by the neural network 20 dedicated to each phoneme, which has the advantage of the first embodiment (
i) Recognition results with higher performance and time resolution can be obtained than in (ii).

【００８８】[0088]

【発明の効果】以上詳細に説明したように、第１の発明
によれば、有声音の特徴抽出と音韻認識を一体化した神
経回路網によって行うため、学習処理によって最適な特
徴抽出と最適な音韻認識が組み合わされた処理が自動的
に形成され、高性能な音韻認識出力が得られる。しかも
、音声の基本周期と音韻認識結果が同時に得られ、時間
分解能の高い認識結果が得られる。[Effects of the Invention] As explained in detail above, according to the first invention, feature extraction of voiced sounds and phoneme recognition are performed by a neural network that integrates them. A process that combines phoneme recognition is automatically created, resulting in high-performance phoneme recognition output. Furthermore, the basic period of speech and the phoneme recognition results can be obtained simultaneously, and recognition results with high temporal resolution can be obtained.

【００８９】そのため、本発明を例えば連続音声認識に
用いた場合には、音韻認識と基本周波数抽出とを個別に
行う必要がなく、単一の処理で音韻情報と基本周波数を
求めることができる。Therefore, when the present invention is used for continuous speech recognition, for example, there is no need to perform phoneme recognition and fundamental frequency extraction separately, and phoneme information and fundamental frequency can be obtained by a single process.

【００９０】第２の発明によれば、有声音の特徴抽出と
音韻認識を一体化した各音韻専用の神経回路によって行
うため、第１の発明に比べてより高性能及び時間分解能
の高い音韻認識結果が得られる。According to the second invention, feature extraction of voiced sounds and phoneme recognition are performed by a neural circuit dedicated to each phoneme, which is integrated, so phoneme recognition is achieved with higher performance and temporal resolution than the first invention. Get results.

[Brief explanation of the drawing]

【図１】本発明の第１の実施例を示す有声音韻認識方法
のフローチャートである。FIG. 1 is a flowchart of a voiced phoneme recognition method showing a first embodiment of the present invention.

【図２】従来の有声音韻認識方法のフローチャートであ
る。FIG. 2 is a flowchart of a conventional voiced phoneme recognition method.

【図３】本発明の第１の実施例における有声音韻認識装
置の機能ブロック図である。FIG. 3 is a functional block diagram of a voiced phoneme recognition device in the first embodiment of the present invention.

【図４】本発明の第１の実施例における指示信号として
用いる音声波形図である。FIG. 4 is an audio waveform diagram used as an instruction signal in the first embodiment of the present invention.

【図５】本発明の第１の実施例における入力音声信号と
出力信号系列の関係を示す図である。FIG. 5 is a diagram showing the relationship between an input audio signal and an output signal sequence in the first embodiment of the present invention.

【図６】本発明の第２の実施例を示す有声音韻認識方法
のフローチャートである。FIG. 6 is a flowchart of a voiced phoneme recognition method showing a second embodiment of the present invention.

【図７】本発明の第２の実施例における有声音韻認識装
置の機能ブロック図である。FIG. 7 is a functional block diagram of a voiced phoneme recognition device in a second embodiment of the present invention.

[Explanation of symbols]

１０，１０Ａ　　　　　　　　波形入力部２０　　　　
　　　　　　　　　　　　神経回路網２０Ａ　　　　　
　　　　　　　　　神経回路網群２１　　　　　　　　
　　　　　　　　セル３０，３０Ａ　　　　　　　　認
識評価部４０，４０Ａ　　　　　　　　学習制御部５０
，５０Ａ　　　　　　　　教師信号入力部Ｓ１０，Ｓ１
０Ａ　　　　初期化処理10,10A waveform input section 20
Neural network 20A
Neural network group 21
Cells 30, 30A Recognition evaluation units 40, 40A Learning control unit 50
, 50A Teacher signal input section S10, S1
0A Initialization processing

Claims

[Claims]

Claim 1: A signal value obtained by normalizing the magnitude of a voiced phoneme signal over a certain time interval by power is input as a presentation signal for learning, and a predetermined position in the input time interval and an epoch point of the input signal are input. The output corresponding to the input phoneme takes a large value when the input phoneme matches, and takes a small value otherwise. The peaks of a plurality of output sequences of the neural network obtained by inputting a signal value whose magnitude is normalized by power over a certain time interval into the neural network and shifting the time of the input audio waveform to be applied. A voiced phoneme recognition method, characterized in that the voiced phoneme recognition method is characterized in that a phoneme recognition result is determined based on the peak detection result, and a phoneme corresponding to the output that produces the largest peak value is determined as the phoneme recognition result.

2. As a presentation signal for learning, a signal value obtained by normalizing the magnitude of a voiced phoneme signal over a certain time interval by power is input, and a predetermined position in the input time interval and an epoch point of the input signal are input. match and the input signal's phoneme is of the set phoneme type, the output will take a large value, and at other times it will take a small value.
Using a neural network group consisting of a plurality of neural networks subjected to learning processing using the error backpropagation method, the signal value obtained by normalizing the magnitude of an input speech waveform with unknown phoneme over a certain time interval using electric power is calculated as described above. The peak of each output sequence of the neural network group obtained by inputting the input speech waveform to the neural network group and moving the time of the added input speech waveform is detected, and the phoneme recognition judgment is performed based on the peak detection result to determine the most A voiced phoneme recognition method characterized in that a phoneme corresponding to the neural network that produces a large peak value is taken as a phoneme recognition result.