JP2518007B2

JP2518007B2 - Dynamic neural network with learning mechanism

Info

Publication number: JP2518007B2
Application number: JP63070617A
Authority: JP
Inventors: 健一磯; 博昭迫江
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1988-03-23
Filing date: 1988-03-23
Publication date: 1996-07-24
Anticipated expiration: 2011-07-24
Also published as: JPH01241667A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声等の時系列パターンの認識に用いるパタ
ーン学習機構を有するダイナミック・ニューラル・ネッ
トワークに関する。TECHNICAL FIELD The present invention relates to a dynamic neural network having a pattern learning mechanism used for recognizing a time-series pattern of speech or the like.

（従来の技術）ニューラル・ネットワークは生体の脳神経系が比較的
単純な動作特性を有する神経細胞とその間の多数の結合
から構成されている情報処理システムであることを参考
にして考案された情報処理モデルで、神経細胞に相当す
る処理ユニット（以下ユニットと略す）とその間を結ぶ
ユニット間結合を有する。このユニット間結合の係数を
変えることによってシステムはさまざまな情報処理動作
を行なう。(Prior Art) A neural network is an information processing system devised with reference to the fact that the cranial nerve system of a living body is an information processing system composed of nerve cells having relatively simple motion characteristics and a large number of connections between them. In the model, a processing unit corresponding to a nerve cell (hereinafter abbreviated as a unit) and an inter-unit connection connecting the units are provided. The system performs various information processing operations by changing the coefficient of the coupling between the units.

このニューラル・ネットワーク・モデルは情報処理シ
ステムとして特に画像や音声等のパターン認識処理に有
効であろうと期待されており、その詳細に関しては「日
経エレクトロニクス」誌、第427号の第115頁（昭和62年
８月10日発行）「ニューラル・ネットをパターン認識、
信号処理、知識処理に使う」に解説されている。（以
下、文献１と称する。）上記文献１によるとニューラル・ネットワークは第２
図に示すように、入力層、中間層、出力層と呼ばれる階
層構造を有しており、各層は複数のユニットから構成さ
れている。またユニット間結合は隣接する層の間にだけ
許され、層内でのユニット間結合は禁止されている。認
識時にはネットワークは入力層の各ユニットの活性度と
して入力データを与えられ、ユニット間結合を通じて順
次隣接する中間層へ情報を伝達し、最後に出力層にまで
到達する。こうして入力データに対するネットワークの
応答結果が出力層のユニットの活性度のパターンとして
得られる。It is expected that this neural network model will be particularly effective as an information processing system for pattern recognition processing of images and voices. For details, see Nikkei Electronics, No. 427, page 115 (Showa 62). (August 10, 2010) "Neural network pattern recognition,
It is used for signal processing and knowledge processing ”. (Hereinafter, referred to as Document 1.) According to Document 1 above, the neural network is the second
As shown in the figure, it has a hierarchical structure called an input layer, an intermediate layer, and an output layer, and each layer is composed of a plurality of units. In addition, inter-unit bonding is allowed only between adjacent layers, and inter-unit bonding within a layer is prohibited. At the time of recognition, the network is given input data as the activity of each unit of the input layer, transmits the information to the adjacent intermediate layers through the unit coupling, and finally reaches the output layer. In this way, the response result of the network to the input data is obtained as the pattern of the activity of the units in the output layer.

ネットワークが指定した動作を行なうようにユニット
間結合を定める為には教師付き学習と呼ばれる手法を用
いる。即ち、入力層に学習させたいパターンを提示し、
出力層には対応して出力すべき教師信号を提示して、出
力層での教師信号と実際の出力値との差異を小さくする
ように結合係数を決定する。上記のような構成のニュー
ラル・ネットワークの場合には、この出力誤差最小化学
習はバックフロパゲーション学習と呼ばれており、その
詳細なアルゴリズムに関しては文献１に詳しい。A method called supervised learning is used to determine the connection between units so that the network performs the specified operation. That is, present the pattern to be learned to the input layer,
The teacher signal to be output is correspondingly presented to the output layer, and the coupling coefficient is determined so as to reduce the difference between the teacher signal in the output layer and the actual output value. In the case of the neural network having the above-mentioned configuration, this output error minimization learning is called back-fropagation learning, and its detailed algorithm is detailed in Reference 1.

（発明が解決しようとする問題点）このようなニューラル・ネットワークを音声認識に用
いることができれば、音声パターンの有する多様性を学
習によって吸収して、良好な認識特性実現できる可能性
があるが、実際に上記のニューラル・ネットワークを音
声認識に用いる為には、いくつかの解決しなければなら
ない問題が存在する。(Problems to be Solved by the Invention) If such a neural network can be used for voice recognition, it is possible to absorb the diversity of voice patterns by learning and realize good recognition characteristics. In order to actually use the above neural network for speech recognition, there are some problems to be solved.

第一に音声は同じカテゴリ（例えば単語）のパターン
でも発声の度に、或は話者毎にその継続時間長が異なる
ので、長さの異なる音声パターンを同じニューラル・ネ
ットワークの入力層に提示する為の工夫が必要となる。First, even if patterns of the same category (for example, words) have different durations for each utterance or for each speaker, voice patterns of different lengths are presented to the input layer of the same neural network. It is necessary to devise it.

第二に長さの異なる音声パターンをニューラル・ネッ
トワークの入力に提示できたときに、ネットワークが期
待する認識動作を行なうようにユニット間結合を定める
学習方法を確立しなければならない。Second, we must establish a learning method that determines the unit-to-unit coupling so that when the speech patterns of different lengths can be presented to the input of the neural network, the network will perform the expected recognition operation.

本発明は固定時間長の特徴パラメータ時系列を入力で
きる入力層を持つニューラル・ネットワークに長さの異
なる音声パターンを提示する為に認識時は出力層の出力
が最大になるように入力層の時間軸と入力音声時系列と
の対応付けを行い、ユニット間結合係数を定める学習時
には提示するパターンを固定継続時間長に正規化してネ
ットワークに提示して出力層での誤差を最小にする教師
付きの学習機構を有するダイナミック・ニューラル・ネ
ットワークを提供しようとするものである。The present invention presents a speech pattern of different lengths to a neural network having an input layer capable of inputting a time series of characteristic parameters having a fixed time length, so that the time of the input layer is maximized during recognition so that the output of the output layer is maximized. Corresponds the axes to the input speech time series, determines the inter-unit coupling coefficient, and normalizes the pattern to be presented during learning to present it to the network by presenting it to the network to minimize the error in the output layer. It is intended to provide a dynamic neural network having a learning mechanism.

（問題点を解決するための手段）本発明は音声等の時系列パターンを認識するニューラ
ル・ネットワークで、入力・出力層と複数の中間層から
構成される階層構造を有し、更に入力層と中間層が時間
軸に対応する時系列的構造を有し、認識時には動的計画
法によって入力時系列パターンの時間軸をニューラル・
ネットワークの出力が最大になるように入力層の持つ時
間軸と対応付けを行い、その時の出力層の出力を認識結
果とするダイナミック・ニューラル・ネットワークに於
て、その各階層間のユニット間結合係数を学習するに際
して、入力層の時間軸の長さと同じ一定の継続時間長に
正規化した学習用時系列パターンを入力層に提示し、出
力層には対応して出力すべき教師信号を提示して、出力
層での教師信号と実際の出力値の差異を小さくするよう
に結合係数を決定する教師付き学習を行なう機構を有す
ることを特徴とする。(Means for Solving Problems) The present invention is a neural network for recognizing a time-series pattern such as voice, which has a hierarchical structure including an input / output layer and a plurality of intermediate layers, and further includes an input layer and The middle layer has a time-series structure corresponding to the time axis, and at the time of recognition, the time axis of the input time-series pattern is
In a dynamic neural network that associates with the time axis of the input layer so that the output of the network is maximized, and the output of the output layer at that time is the recognition result, the unit coupling coefficient between the layers When learning, the learning time-series pattern normalized to a constant duration that is the same as the length of the input layer time axis is presented to the input layer, and the teacher signal to be output is presented to the output layer. And a mechanism for performing supervised learning for determining the coupling coefficient so as to reduce the difference between the teacher signal and the actual output value in the output layer.

（作用）本発明の原理の説明を簡単のために中間層を１層にし
た３層構造のモデルを用いて行なう。中間層が２層以上
の場合にも同様に適用できることは言うまでもない。(Operation) For simplicity of explanation of the principle of the present invention, a model having a three-layer structure with one intermediate layer is used. It goes without saying that the same applies to the case where the number of intermediate layers is two or more.

モデルの入力層はＰ次元の特徴ベクトルの時系列（長
さＪ）を受け取ることができるようにＪ×Ｐ個のユニッ
トから構成されている。この入力ユニットの出力値をy
⁽¹⁾ _j（ｐ）（ｊ＝１〜J,p＝１〜Ｐ）とする。一般には
入力層の時間軸の長さＪと認識時に入力される入力時系
列パターンa_i（ｐ）（ｉ＝１〜I,p＝１〜Ｐ）の長さＩ
は異なるので、入力時系列の時間軸になんらかの伸縮変
換を施して長さＪに揃えなければならない。入力層の時
間軸ｊと入力時系列パターンの時間軸ｉで構成される平
面（i,j）上での対応関係を次式で表わす。The input layer of the model is composed of J × P units so that it can receive a time series (length J) of P-dimensional feature vectors. The output value of this input unit is y
⁽¹⁾ _j (p) (j = 1 to J, p = 1 to P). Generally, the length J of the time axis of the input layer and the length I of the input time series pattern a _i (p) (i = 1 to I, p = 1 to P) input at the time of recognition
Are different from each other, it is necessary to perform some expansion / conversion on the time axis of the input time series to align the length J. The correspondence relation on the plane (i, j) constituted by the time axis j of the input layer and the time axis i of the input time series pattern is represented by the following equation.

ｃ（ｋ）＝（ｉ（ｋ）,j（ｋ）），（ｋ＝１〜Ｋ） …
（１）但し、この関係を用いて入力ユニットの出力値y⁽¹⁾ _j(k)（ｐ）
（ｊ＝１〜J,p＝１〜Ｐ）は y⁽¹⁾ _j(k)（ｐ）＝a_i(k)（ｐ） …（３）と表わされる。即ち、入力ユニットは時間軸を整合して
入力されたデータをそのまま次の層へ伝達することにな
る。c (k) = (i (k), j (k)), (k = 1 to K) ...
(1) However, Using this relationship, the output value of the input unit y ⁽¹⁾ _{j (k)} (p)
(J = 1 to J, p = ^{1 to} P) is expressed as y ⁽¹⁾ _{j (k)} (p) = _{ai (k)} (p) (3). That is, the input unit transmits the input data as it is to the next layer while matching the time axis.

中間層はＪ×Ｍ個のユニット（隠れユニットと呼ぶ）
から構成され、各ユニットへの入力値x⁽²⁾ _j（ｍ）（ｊ
＝１〜J,m＝１〜Ｍ）は入力ユニットの出力値y
⁽¹⁾ _j（ｐ）と入力ユニットと隠れユニットの間の結合係
数β⁰ _j（m,p），β¹ _j（m,p）を用いて次式のように与え
られる。The middle layer is J × M units (called hidden units)
Input value x ⁽²⁾ _j (m) (j
= 1 to J, m = 1 to M) is the output value y of the input unit
^{(1) Using} _j (p) and the coupling coefficients β ⁰ _j (m, p) and β ¹ _j (m, p) between the input unit and the hidden unit, it is given by the following equation.

このようにｊ（ｋ）番目の隠れユニットは入力層のｉ
（ｋ）番目とｉ（ｋ−１）番目のユニットからだけ情報
を受け取るようにユニット間結合を制限したニューラル
・ネットワークの構造を時系列構造と呼ぶことにする。
このようなネットワークの構造は音声パターン等のよう
にデータ自体が時系列的な構造を持っている場合には、
完全結合（すべての入力ユニットとすべての隠れユニッ
トを結ぶ）に比べて少ないユニット間結合でモデルが構
成できるので、認識・学習時の計算量を大幅に削減する
ことができる。式４で与えられる入力に対する隠れユニ
ットの応答は次のようになる。 Thus, the j (k) th hidden unit is i
The structure of the neural network in which the unit-to-unit coupling is limited so that the information is received only from the (k) th and i (k-1) th units will be referred to as a time series structure.
When the data itself has a time-series structure such as a voice pattern, the structure of such a network is
Since the model can be constructed with less inter-unit coupling than full coupling (connecting all input units and all hidden units), the amount of calculation at the time of recognition / learning can be significantly reduced. The response of the hidden unit to the input given by equation 4 is:

y⁽²⁾ _j（ｍ）＝ｆ（x⁽²⁾ _j（ｍ）−θ⁽²⁾ _j（ｍ）） …
（５）ｆ（ｘ）＝1/（１＋e^-x） …（６）ここでθ⁽²⁾ _j（ｍ）は隠れユニット（j,m）が持つ閾値
である。式（６）から明らかなように隠れユニットは一
種の閾値論理の働きをしている。y ⁽²⁾ _j (m) = f (x ⁽²⁾ _j (m) -θ ⁽²⁾ _j (m)) ...
(5) f (x) = 1 / (1 + e- ^x ) (6) where θ ⁽²⁾ _j (m) is a threshold value of the hidden unit (j, m). As is clear from the equation (6), the hidden unit functions as a kind of threshold logic.

出力層は認識対象となるＮ個のカテゴリに対応するＮ
個のユニットから構成されている。ｎ番目の出力ユニッ
トへの入力値x⁽³⁾（ｎ）（ｎ＝１〜Ｎ）は隠れユニット
の出力値y⁽²⁾ _j（ｍ）と隠れユニットと出力ユニットの
間の結合係数α^ｎ（j,m）を用いて次式のように与えら
れる。The output layer has N corresponding to N categories to be recognized.
It is composed of individual units. The input value x ⁽³⁾ (n) (n = 1 to N) to the nth output unit is the output value y ⁽²⁾ _j (m) of the hidden unit and the coupling coefficient α ⁿ between the hidden unit and the output unit. It is given by the following equation using (j, m).

出力ユニットの入出力の応答関係は式２と同じである。 The input / output response relationship of the output unit is the same as that in Equation 2.

y⁽³⁾（ｎ）＝ｆ（x⁽³⁾（ｎ）−θ_（３）（ｎ）） …
（８）ここでθ^（３）（ｎ）は出力ユニットｎの持つ閾値であ
る。y ⁽³⁾ (n) = f (x ⁽³⁾ (n) −θ ₍₃₎ (n)) ...
(8) where θ ⁽³⁾ (n) is the threshold value of the output unit n.

こうして得られるネットワークの出力値y⁽³⁾（ｎ）は
式１で与えられている入力時系列の時間軸と入力ユニッ
ト層の時間軸の対応関係｛ｃ（ｋ）｝に依存している。
最終的なカテゴリｎのネットワークによる認識結果は
｛ｃ（ｋ）｝に関して最適化された（最大化された）出
力値o_nとして得られる。The output value y ⁽³⁾ (n) of the network thus obtained depends on the correspondence {c (k)} between the time axis of the input time series and the time axis of the input unit layer, which is given by the equation (1).
The final recognition result by the network of category n is obtained as an optimized (maximized) output value o _{n with} respect to {c (k)}.

ここで式（８）は単調関数なので式（９）はと置き換えても同じである。ここでｆ（）の中の特徴
ベクトルの成分ｐに関する和は省略した。式（10）の
｛｝の中の式を γ（ｃ（ｋ）,c（ｋ−１））と定義すると、式（10）はとなり、この最適化は良く知られた動的計画法を用いて
解くことができることが分かる。即ち、γ（ｃ（ｋ）,c
（ｋ−１））の累積和をｇ（ｋ）として、次の漸化式を
計算してo_n＝ｇ（Ｋ）を求めればよい。 Since equation (8) is a monotone function, equation (9) is It is the same even if replaced with. Here, the sum of the feature vector components p in f () is omitted. When the expression in {} of the expression (10) is defined as γ (c (k), c (k-1)), the expression (10) becomes It can be seen that this optimization can be solved using the well-known dynamic programming method. That is, γ (c (k), c
The (k-1) cumulative sum of) as g (k), may be obtained and by calculating the following recurrence formula o _n = g (K).

次にニューラル・ネットワーク・モデルのパラメータで
あるユニット間結合係数｛β⁰ _j（m,p），β¹ _j（m,p），
α^ｎ（j,m）｝と閾値｛θ⁽²⁾ _j（ｍ），θ⁽³⁾（ｎ）｝を
決定する学習法について説明する。 Next, the unit coupling coefficients {β ⁰ _j (m, p), β ¹ _j (m, p), which are the parameters of the neural network model,
A learning method for determining α ⁿ (j, m)} and the threshold value {θ ⁽²⁾ _j (m), θ ⁽³⁾ (n)} will be described.

カテゴリｎの学習に用いる特徴ベクトルの時系列の組
をA⁽ⁿ⁾ _q＝｛aⁿ _q,i（ｐ）｝とする。ここでｑは同じカテ
ゴリ内の複数の時系列パターンを区別する添字、ｉは時
系列の時間軸を表わす添字、ｐは各時刻での特徴ベクト
ルの成分を表わす添字である。各添字の範囲はｎ＝１〜N,q＝１〜Qⁿ,i＝１〜I^q,p＝１〜Ｐ …（14）ネットワークにこのデータA⁽ⁿ⁾ _qを提示する為には時系
列の長さI^qをネットワークの入力層の時間軸の長さＪに
正規化しなければならない。学習時にはモデルのパラメ
ータが最適化されていないので、認識時のように動的計
画法を用いることは難しい。A set of time series of feature vectors used for learning of category n is A ⁽ⁿ⁾ _q = {a ⁿ _{q, i} (p)}. Here, q is a subscript that distinguishes a plurality of time series patterns in the same category, i is a subscript that represents the time axis of the time series, and p is a subscript that represents the component of the feature vector at each time. The range of each subscript is n = 1 to N, q = 1 to Q ⁿ , i = 1 to I ^q , p = 1 to P (14) In ^order to present this data A ⁽ⁿ⁾ _q to the network, The length ^{Iq of the} sequence must be normalized to the length J of the time axis of the input layer of the network. It is difficult to use dynamic programming as in recognition because the parameters of the model are not optimized during learning.

そこで学習の為にはカテゴリｎのデータの集合A⁽ⁿ⁾ _q
（ｑ＝１〜Qⁿ）の中から代表となる時系列パターンA⁽ⁿ⁾
_q0を選び出し、それ以外のデータA⁽ⁿ⁾ _q（ｑ≠q₀）の時
間軸をDPマッチングによって前記代表パターンの時間軸
に対応付ける。その方法を次に示す。代表パターンA⁽ⁿ⁾
_q0の時間軸をｊ（ｊ＝１〜Ｊ）、時間軸の対応付け（正
規化）を行ないたいデータA⁽ⁿ⁾ _q（ｑ＝q₀）の時間軸を
ｉ（ｉ＝１〜Ｉ）とする。このとき２つのパターンをDP
マッチングすることによって２つのパターンの時間軸の
間の対応関係（歪関数）ｉ＝ｉ（ｊ）が得られる。DPマ
ッチングと歪関数に関しては「日経エレクトロニクス」
誌、第329号の第171頁（昭和58年11月７日発行）に詳し
く解説されている（以下、文献２と呼ぶ）。この歪関数
ｉ（ｊ）によって代表パターンの時間軸ｊには学習デー
タの時間軸ｉ＝ｉ（ｊ）のフレーム・ベクトルaⁿ _q,i(j)
を対応付ければ良いことが分かる。この歪関数はDPマッ
チングに用いる局所的な経路の制限の仕方によってはｊ
＝ｊ（ｉ）のような形になり、あるｊに対応するフレー
ム・ベクトルが複数存在することが起こるが、このよう
な場合にも対応するフレーム・ベクトルを平均化するこ
とによって同様の時間軸対応付けが行える。Therefore, for learning, a set of data of category n A ⁽ⁿ⁾ _q
(Q = 1~Q ⁿ⁾ time series pattern A as a representative from among the ⁽ⁿ⁾
I picked out _q0, associating the time axis of the other data ^{_{A (n) q (q ≠}} q 0) to the time axis of the representative pattern by DP matching. The method is shown below. Representative pattern A ⁽ⁿ⁾
The time axis of _q0 j (j = _1~J), correspondence of the time axis data want to do (normalized) A ⁽ⁿ⁾ _q time axis _{(q = q 0) i (} i = 1~I) And DP at this time two patterns
By matching, the correspondence (distortion function) i = i (j) between the time axes of the two patterns is obtained. Nikkei Electronics about DP matching and distortion function
The magazine, No. 329, page 171 (published on November 7, 1983) explains in detail (hereinafter referred to as reference 2). With this distortion function i (j), the frame vector a ⁿ _{q, i (j)} of the learning data time axis i = i (j) is represented on the time axis j of the representative pattern.
It is understood that it is sufficient to associate This distortion function is j depending on how to limit the local path used for DP matching.
= J (i), and there may be a plurality of frame vectors corresponding to a certain j. In such a case, by averaging the corresponding frame vectors, a similar time axis is obtained. Can be associated.

この結果、データ毎にばらついていた時間長I^qが一定
の長さI^q0に正規化される。ネットワークの入力層の時
間軸の長さＪはこのI^q0に等しく設定する。As a result, the time length I ^q that has varied for each data is normalized to a fixed length I ^q0 . The length J of the time axis of the input layer of the network is set equal to this I ^q0 .

ここでカテゴリーｎの代表パターンの選び方としては
様々な方法が考えられるが、例えばカテゴリｎのパター
ン集合の中でパターン間のDPマッチングによる累積距離
ｄ（A_q0,A_q）をパターン間距離として、次式で与えられ
る量Δ、を最小にするようなq₀とする。このq₀はすべてのｑ＝１
〜Qⁿをq₀と仮定してΔを計算する総当たり法によって容
易に求めることができる。この他にも任意の１パターン
を代表にすることも可能である。Here, various methods can be considered for selecting the representative pattern of category n. For example, in the pattern set of category n, the cumulative distance d (A _q0 , A _q ) by DP matching between patterns is set as the inter-pattern distance. The quantity Δ given by Let q ₀ that minimizes. This q ₀ is all q = 1
It can be easily obtained by the brute force method of calculating Δ assuming that Q ⁿ is q ₀ . In addition to this, it is also possible to represent an arbitrary one pattern.

こうして時間軸の長さを長さＪに正規化した入力学習
データをA⁽ⁿ⁾ _q＝｛aⁿ _q,i（ｐ）｝（ｉ＝１〜Ｊ）とす
る。また、同じ長さＪに正規化された他のカテゴリの学
習データをB^(m) _r＝｛b^m _r,i（ｐ）｝（ｒ＝１〜Ｒ）とす
る（以後このＢを反学習データと呼ぶ）。このときｑ番
目の学習データに対するネットワークの出力値をy⁽³⁾ _q
（ｎ）、望ましい出力値をz_q（ｎ）（＝1.0）、ｒ番目
の反学習データに対する第ｎユニットの出力値をy⁽³⁾ _r
（ｎ）、望ましい出力値をz_r（ｎ）（＝0.0）とする
と、出力ユニット層に於ける出力値の誤差Ｅはで与えられる。この誤差量Ｅは学習によって決定しなけ
ればならないユニット間結合係数｛β⁰ _j（m,p），β¹ _j
（m,p），αⁿ（j,m）｝と閾値｛θ⁽²⁾ _j（ｍ），θ
⁽³⁾（ｎ）｝の関数と考えられるのでＥを評価関数とし
て最小化するようにこれらのパラメータを決定すればよ
い。またユニットの閾値は常に１を出力するユニットを
仮想的に考えて、そのユニットとの結合係数と考えれば
ユニット間結合と同じように学習することができる。そ
こで隣接する２層、第ｎ層のユニットｉと第ｎ＋１層の
ユニットｊを結ぶユニット間結合係数をωⁿ _ijとする
と、このωⁿ _ijに関するＥの微係数を用いて ωⁿ _ij（ｔ＋１）＝ωⁿ _ij（ｔ）−ε（δE/δωⁿ _ij）ｔ
…（17）とすれば、必ず、Ｅ（ｔ＋１）≦Ｅ（ｔ） …（18）となる。ここでｔは繰り返し学習のステップを表わす整
数値、εは修正の程度を決める定数である。結局、Ｅを
小さくするようにωⁿ _ijを繰り返し修正することがパラ
メータの学習になるのである。ここでωⁿ _ijと前記モデ
ルのユニット間結合係数｛β⁰ _j（m,p），β¹ _j（m,p），
α^ｎ（j,m），θ⁽²⁾ｊ（ｍ），θ⁽³⁾（ｎ）｝とは例え
ば次のように対応付ければよい。In this way, the input learning data obtained by normalizing the length of the time axis to the length J is set as A ⁽ⁿ⁾ _q = {a ⁿ _{q, i} (p)} (i = 1 to J). Further, the learning data of other categories normalized to the same length J is set to B ^(m) _r = {b ^m _{r, i} (p)} (r = 1 to R) (hereinafter, this B is the anti-learning). Called data). At this time, the output value of the network for the qth learning data is y ⁽³⁾ _q
(N), the desired output value is z _q (n) (= 1.0), and the output value of the nth unit for the r-th anti-learning data is y ⁽³⁾ _r
(N), if the desired output value is z _r (n) (= 0.0), the error E of the output value in the output unit layer is Given in. This error amount E must be determined by learning. The unit coupling coefficient {β ⁰ _j (m, p), β ¹ _j
(M, p), α ⁿ (j, m)} and threshold {θ ⁽²⁾ _j (m), θ
⁽³⁾ Since these are considered to be functions of (n)}, these parameters may be determined so as to minimize E as an evaluation function. Further, the threshold value of the unit can be learned in the same manner as the inter-unit coupling by considering a unit that always outputs 1 virtually and considering it as a coupling coefficient with the unit. Therefore two adjacent layers, when the inter-unit coupling coefficient linking the unit i and unit j of the n + 1 layer of the n-layer and omega ⁿ _ij, using a derivative of E about the ^{_{^{_{ω n ij ω n ij (t}}}} + 1) = Ω ⁿ _ij (t) -ε (δE / δω ⁿ _ij ) t
If (17), then E (t + 1) ≦ E (t) (18). Here, t is an integer value that represents the step of iterative learning, and ε is a constant that determines the degree of correction. After all, the parameter learning is to repeatedly correct ω ⁿ _ij so as to reduce E. Here, ω ⁿ _ij and the coupling coefficient between units of the model {β ⁰ _j (m, p), β ¹ _j (m, p),
For example, α ⁿ (j, m), θ ⁽²⁾ j (m), θ ⁽³⁾ (n)} may be associated as follows.

Ｅの微係数は解析的な計算の結果次式のようになること
が分かる。 It can be seen that the differential coefficient of E is as follows as a result of analytical calculation.

ここでδ⁽ⁿ⁺¹⁾ _i,qはｑ番目の学習（または反学習）デー
タを入力層に提示した場合の第ｎ＋１層のユニットｉの
入力値に換算された誤差で、y⁽ⁿ⁾ _j,qはｑ番目の学習デ
ータに対する第ｎ層のユニットｊの出力値である。δ
⁽ⁿ⁾ _i,qは次のような漸化式を用いて計算することができ
る。 Here, δ ^{(n + 1)} _{i, q} is an error converted into the input value of the unit i of the n + 1-th layer when the q-th learning (or anti-learning) data is presented to the input layer, and y ⁽ⁿ⁾ _{j and q} are output values of the unit j of the nth layer for the qth learning data. δ
⁽ⁿ⁾ _{i, q} can be calculated using the following recurrence formula.

ここでｆ（ｘ）は式６で与えられるユニットの入出力応
答関数で、x⁽ⁿ⁾ _iは第ｎ層のユニットｉへの入力値、z_i
は第Ｎ層（出力層）のユニットｉがとるべき値で学習の
時には1.0で反学習の時には0.0である。この式21に基づ
いて、各ユニットに換算された誤差量δを求める計算が
出力層から入力層の方向に進むので、この学習法は逆伝
播学習法（バック・プロパゲーション学習法）と呼ばれ
ている（詳細は文献１を参照のこと）。 Where f (x) is the input / output response function of the unit given by Equation 6, x ⁽ⁿ⁾ _i is the input value to the unit i of the nth layer, and z _i
Is a value that the unit i of the Nth layer (output layer) should take, and is 1.0 for learning and 0.0 for anti-learning. This learning method is called the back-propagation learning method (back propagation learning method) because the calculation of the error amount δ converted to each unit proceeds from the output layer to the input layer based on this equation 21. (Refer to Reference 1 for details).

結局、ユニット間結合係数に任意の初期値を与えたモ
デルから出発して、複数の学習・反学習データを提示し
て、各ユニット間結合に関して上記の繰り返し訂正学習
を行なえば、出力層での誤差を極小化するユニット間結
合の組を得ることができる。After all, starting from a model in which an arbitrary initial value is given to the inter-unit coupling coefficient, if multiple learning / anti-learning data are presented and the above iterative correction learning is performed for each inter-unit coupling, the output layer It is possible to obtain a set of inter-unit couplings that minimize the error.

（実施例）以下に式13の漸化式計算の為の（i,j）平面上での時
間軸対応付け規則（ｃ（ｋ）とｃ（ｋ−１）の相対位置
関係）として第３図のような規則を用いた場合の本発明
の実施例を説明する。第３図の場合はｃ（ｋ）＝（i,
j）とするとｃ（ｋ−１）としては（ｉ−1,j），（ｉ−
1,j−１），（ｉ−1,j−２）の３点だけが可能になる。
このように対応付け規則の場合にはニューラル・ネット
ワークの出力を決める式（12），（13）は次のように書
ける。(Embodiment) The third time is shown below as a time axis correspondence rule (relative positional relationship between c (k) and c (k-1)) on the (i, j) plane for the recurrence formula calculation of Expression 13. An embodiment of the present invention using the rules shown in the figure will be described. In the case of FIG. 3, c (k) = (i,
j), c (k-1) is (i-1, j), (i-
Only three points of 1, j-1) and (i-1, j-2) are possible.
Thus, in the case of the association rule, the equations (12) and (13) that determine the output of the neural network can be written as follows.

gⁿ（i,j）＝γ^ｎ（i,j）＋max［gⁿ（ｉ−1,j），gⁿ（ｉ
−1,j−１）， gⁿ（ｉ−1,j−２）］ …（24）第１図は式（22）〜（24）に基づいて本発明を実現した
一実施例を示したブロック図である。分析部10は入力さ
れた音声波形データを分析して特徴ベクトルの時系列に
変換して、パターンバッファ部20に記憶する。パターン
バッファ部20には学習動作時には学習用時系列データが
記憶され、認識動作時には未知発声の分析データが記憶
される。続く切り替えスイッチによって学習動作と認識
動作の切り替えを行なう。 g ⁿ (i, j) = γ ⁿ (i, j) + max [g ⁿ (i−1, j), g ⁿ (i
−1, j−1), g ⁿ (i−1, j−2)] (24) FIG. 1 shows an embodiment in which the present invention is realized based on formulas (22) to (24). It is a block diagram. The analysis unit 10 analyzes the input voice waveform data, converts it into a time series of feature vectors, and stores it in the pattern buffer unit 20. The pattern buffer unit 20 stores the learning time series data during the learning operation, and stores the analysis data of the unknown utterance during the recognition operation. The learning operation and the recognition operation are switched by the subsequent changeover switch.

時間軸整合部30は学習データ群中の各カテゴリの代表
パターンを式15に基づいて決定して、他の学習データを
代表パターンへDPマッチングすることによって時間軸の
整合を行い、すべての学習データの時間軸の長さを長さ
Ｊへ規格化する。修正量計算部40は時間軸整合部30から
送られた学習データとユニット間結合係数記憶部50に蓄
えられた結合係数を用いて、式17,20,21に基づいて結合
係数ωⁿ _ijの修正量Δωⁿ _ijを算出して、結合係数修正部
60に送る。結合係数修正部60はユニット間結合係数記憶
部50に蓄えられた結合係数に前記修正量Δωⁿ _ijを加え
て、書き戻す。修正量計算部40はすべての結合係数に対
する修正量Δωⁿ _ijが予め定められた閾値より小さくな
るまでか、あるいは修正回数が予め定められた回数を越
えるまで、この修正動作を繰り返す。The time-axis matching unit 30 determines the representative pattern of each category in the learning data group based on Equation 15, and performs DP-matching of the other learning data to the representative pattern to perform time-axis matching and all learning data. The length of the time axis of is standardized to the length J. The correction amount calculation unit 40 uses the learning data sent from the time axis matching unit 30 and the coupling coefficient stored in the inter-unit coupling coefficient storage unit 50 to calculate the coupling coefficient ω ⁿ _ij based on Equations 17, 20, and 21. The correction amount Δω ⁿ _ij is calculated, and the coupling coefficient correction unit
Send to 60. The coupling coefficient modification unit 60 adds the modification amount Δω ⁿ _ij to the coupling coefficient stored in the inter-unit coupling coefficient storage unit 50 and writes it back. The correction amount calculation unit 40 repeats this correction operation until the correction amount Δω ⁿ _ij for all coupling coefficients becomes smaller than a predetermined threshold value or the number of correction times exceeds a predetermined number.

格子点計算部70はパターンバッファ部20から送られた
未知発声データとユニット間結合係数記憶部50に蓄えら
れた結合係数を用いて、式23に基づいて格子点データγ
^ｎ（i,j）（ｉ＝１〜I,j＝１〜J,n＝１〜Ｎ）を計算す
る。計算された格子点データは格子点記憶部80に格納さ
れる。漸化式計算部90は格子点記憶部80に蓄えられた格
子点データを用いて、式24に基づく漸化式計算を行なっ
て累積値gⁿ（I,J）を作業用記憶部100に格納する。作業
用記憶部100は漸化式計算途中にもgⁿ（i,j）の記憶に用
いられる。認識判定部110は作業用記憶部100に格納され
た累積値gⁿ（I,J）の中から最大の累積値を与えるｎの
値を認識結果として出力する。The lattice point calculation unit 70 uses the unknown utterance data sent from the pattern buffer unit 20 and the coupling coefficient stored in the inter-unit coupling coefficient storage unit 50 to calculate the lattice point data γ based on Equation 23.
Calculate ⁿ (i, j) (i = 1 to I, j = 1 to J, n = 1 to N). The calculated grid point data is stored in the grid point storage unit 80. The recurrence formula calculation unit 90 uses the grid point data stored in the grid point storage unit 80 to perform the recurrence formula calculation based on Formula 24 and store the cumulative value g ⁿ (I, J) in the working storage unit 100. Store. The work storage unit 100 is used for storing g ⁿ (i, j) even during the recurrence formula calculation. The recognition determination unit 110 outputs, as a recognition result, the value of n that gives the maximum cumulative value among the cumulative values g ⁿ (I, J) stored in the work storage unit 100.

（発明の効果）以上述べたように、本発明によれば認識動作時に未知
音声データの発声時間長の変動を動的計画法によって正
規化してニューラル・ネットワークに入力することがで
きる時間軸の正規化能力を有するニューラル・ネットワ
ークを提供できる。このように本発明のニューラル・ネ
ットワークは認識動作時に時間軸正規化能力を有するの
で、学習動作時には音声データの発声毎の特徴パラメー
タの変動を小数の学習データ（発声時間長の変動による
多様性を持たなくてよい）を用いて学習することによっ
て、良好な認識装置を提供することができる。(Effect of the Invention) As described above, according to the present invention, the fluctuation of the utterance time length of unknown speech data during the recognition operation is normalized by the dynamic programming method and can be input to the neural network. It is possible to provide a neural network having the capability of computerization. As described above, since the neural network of the present invention has the ability to normalize the time axis during the recognition operation, the variation of the characteristic parameter for each utterance of the voice data during the learning operation is reduced by a small number of learning data (variance due to the variation of the utterance time length is It is possible to provide a good recognition device by learning by using (you do not have to have).

[Brief description of drawings]

第１図は本発明の一実施例を示すブロック図、第２図は
ニューラル・ネットワークの階層構造を表わす図、第３
図は漸化式計算の為の（i,j）平面上での時間軸対応付
け規則の例を表わす図である。図に於て、10は分析部、20はパターンバッファ部、30は
時間軸整合部、40は修正量計算部、50はユニット間結合
係数記憶部、60は結合係数修正部、70は格子点計算部、
80は格子点記憶部、90は漸化式計算部、100は作業用記
憶部、110は認識判定部である。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing a hierarchical structure of a neural network, and FIG.
The figure is a diagram showing an example of a time axis association rule on the (i, j) plane for recurrence formula calculation. In the figure, 10 is an analysis unit, 20 is a pattern buffer unit, 30 is a time axis matching unit, 40 is a correction amount calculation unit, 50 is an inter-unit coupling coefficient storage unit, 60 is a coupling coefficient correction unit, and 70 is a grid point. Calculator,
Reference numeral 80 is a grid point storage unit, 90 is a recurrence formula calculation unit, 100 is a working storage unit, and 110 is a recognition determination unit.

Claims

(57) [Claims]

1. A neural network for recognizing a time-series pattern such as voice, having a hierarchical structure composed of an input / output layer and a plurality of intermediate layers, and the input layer and the intermediate layer correspond to a time axis. It has a time-series structure, and at the time of recognition, the time axis of the input time-series pattern is associated with the time axis of the input layer so that the output of the neural network is maximized by dynamic programming, and the output layer at that time is associated. When learning the unit coupling coefficient between each layer in a dynamic neural network that uses the output of the above as the recognition result, the learning time is normalized to a constant duration that is the same as the length of the time axis of the input layer. The sequence pattern is presented to the input layer, the teacher signal to be output correspondingly is presented to the output layer, and the coupling coefficient is determined so as to reduce the difference between the teacher signal in the output layer and the actual output value. Dynamic neural network having a mechanism for supervised learning that.

2. The dynamic neural network according to claim 1, wherein the variation of the duration of the learning time series pattern is normalized by DP matching with the representative pattern.

3. The dynamic neural network according to claim 1, wherein the supervised learning of the inter-unit coupling coefficient is realized by a backpropagation learning method.