JPH036769A

JPH036769A - Method and device for parallel simulation of neural network

Info

Publication number: JPH036769A
Application number: JP14255889A
Authority: JP
Inventors: Takumi Watanabe; 渡辺　琢美
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-06-05
Filing date: 1989-06-05
Publication date: 1991-01-14
Anticipated expiration: 2013-06-18
Also published as: JP2766858B2

Abstract

PURPOSE:To increase the processing speed with the subject method and device by processing the back propagation algorithm with high parallelism. CONSTITUTION:When the largest number of units is referred to as (n) within a layer, the maximum (k X m) pieces of weight obtained between the i-th layer consisting of (k) units and the (i + 1)-th layer consisting of (m) units are successively set opposite to the j-th column of an arithmetic element group together with the weights applied among the units covering the j-th unit of the i-th layer through all units of the (i + 1)-th layer on the arithmetic element groups which are arranged in an (n X n)-2-dimensional lattice form and can transfer data. Then the weights applied among the units covering the j-th unit of the (i + 1)-th layer through all units of the (i + 2)-th layer are successively set opposite to the j-row of the arithmetic element group. Thus the parallel learning operations are carried. out. In such a way, the transfer of data and the arithmetic operations are repeated in both row and column directions. Thus the learning is attained with high parallelism with the transfer of data as well as the arithmetic operations. Then the simulation is carried out at a high speed.

Description

[Detailed description of the invention] [Industrial application field]

本発明は、パタン識別、音声認識などに利用されている
ニューラルネット学習アルゴリズムにおける各ユニット
間の結合の修正を並列処理によって、極めて短い時間で
行わせる方法及びそれに用いる装置に関する。The present invention relates to a method for correcting connections between units in a neural network learning algorithm used for pattern identification, speech recognition, etc. in an extremely short time by parallel processing, and an apparatus used therefor.

[Conventional technology]

先ず、従来提案されている階層構造のネットワークにお
ける学習アルゴリズムであるバックプロパゲーションを
、第１図を伴って、簡単のため、中間層が１個、各層で
のユニット数が３個である場合の例て、以下に、簡単に
説明する。なお、中間層が２個以上である場合でも、また、各層に
おけるユニット数が４以上である場合についても、以下
述べるところに準じている。ネットワークは、第１図に示すように、階層構造を用い
ており、入力層、中間層及び出力層は、入力層から中間
層の方向に、次でその中間層から出力層の方向にという
単方向結合をしているが、各層内でのユニット間結合は
なく、また、出力層から入ノ〕層へと向かう方向の結合
もない。その詳細にライては、Ｄ、　Ｆ、　Ｒｕ５ｅ　Ｉｈａｒ
ｔ、　ＥＧｅｏｆｆｅｒｙ、ａｎｄ　Ｒ，Ｊ、　Ｗｉｌ
ｌｉｏａｍｓ、　”ＬｅａｒｎｉｎｇＩｎｔｅｒｎａｔ
ｉｏｎａｌ　Ｐｒｅｓｅｎｔａｔｉｏｎｓ　ｂｙ　Ｅｒ
ｒｏｒＰｒｏｐａｇａｔｉｏｎ、　”　Ｉｎ　Ｐａｒａ
ｌｌｅｌ　ＤｉｓｔｒｉｂｕｔｅｄＰｒｏｃｅｓｓｌｎ
ｇ：Ｅｘｐｌｏｒａｔｉｏｎ　ｉｎ　ｔｈｅ　Ｈｉｃｒ
ｏｓｔｒｕｃｔｕｒｅｓ　　ｏｆ　　Ｃｏｇｎｉｔｉｏ
ｎ　　（Ｖｏｌ、１　　、ｐｐ、３１８−３６２．ＨＩ
丁Ｐｒｅｓｓ、　Ｃａａ＋ｂｒｉｄｇｅ、　Ｈａｓｓａ
ｃｈｕｓｅｔｔｅｓ、　１９８６を参照されたい。バックプロパゲーション（後向き伝送）アルゴリズムは
、多層ネットワークにおける誤差関数の極小値を求める
学園アルゴリズムである。データは、入力層から中間層を通って、出力層に伝搬す
る。前向き伝播処理においては、第１番目の層のユニットの
出力値は、そのユニットと結合されている、第（ＩＬ−
１）層の全てのユニットの重み付き和に、微分可能な関
数（例えばシグモイド（ｓ　ｉ　ｇｍｏ　ｉ　ｄ）関数
）を適用して得られる。前向き伝搬処理では、このような処理を、各層において
、繰返し行う。Ｌ個の層から成るネットワークにおける第込番目の層の
ユニットの入出力関係は、以下のように示される。ｕ、−ΣＷ＝　　（１）ａ、　　（０−１）＋　　　　
　　　　１Ｊ　　　　　　　　　Ｊ（１）ａ　　−ｆ　（ｕ・　（１））・・・・・・・・・・・
・・・・（２）１≦ｉ≦Ｎ・１≦１≦Ｌ後向き伝搬処理では、出力層から入力層に向って、前層
での誤差の重み付き和を計算しながら、順に誤差勾配を
求め、誤差を小さくするように小みの修正を行う。すなわち、ネットワークに、あるパタンを与えた時の各
重みの変化　八Ｗ　’Ｊは、△Ｗ・・＝δ　・　０・　
　　　・・・・・・・・・・・・・・・・・・　（３）
ＩＪ　　　　　Ｊぐある。ここて、０１はユニットｉから、ユニットｊへの入力値
を示す。 δ　は、ユニットｊが出カニニットであるか中間ユニッ
トであるかによって異なる。ユニットｊが出カニニットである場合、δ。First, for the sake of simplicity, backpropagation, which is a learning algorithm for hierarchical networks that has been proposed, is explained using Figure 1 for the case where there is one hidden layer and the number of units in each layer is three. An example will be briefly explained below. Note that even if there are two or more intermediate layers, or if the number of units in each layer is four or more, the same applies as described below. The network uses a hierarchical structure, as shown in Figure 1, where the input layer, hidden layer, and output layer are arranged in a simple direction from the input layer to the hidden layer, and then from the hidden layer to the output layer. Although there is directional coupling, there is no coupling between units within each layer, and there is no coupling in the direction from the output layer to the input layer. For details, please refer to D, F, Ru5e Ihar.
T, E.G.Offery, and R.J., Wil.
lioams, “Learning Internat
ional Presentations by Er
rorPropagation, ”In Para
llelDistributedProcessln
g:Exploration in the Hicr
Structures of Cognitio
n (Vol, 1, pp, 318-362.HI
Ding Press, Caa+bridge, Hassa
See E. Chusettes, 1986. The backpropagation algorithm is a school algorithm that finds the minimum value of the error function in a multilayer network. Data propagates from the input layer through intermediate layers to the output layer. In forward propagation processing, the output value of a unit in the first layer is
1) Obtained by applying a differentiable function (for example, a sigmoid function) to the weighted sum of all units of the layer. In forward propagation processing, such processing is repeated in each layer. The input/output relationship of the unit in the th layer in a network consisting of L layers is shown as follows. u, -ΣW= (1)a, (0-1)+
1J J(1) a −f (u・ (1))・・・・・・・・・・・・
...(2) 1≦i≦N・1≦1≦L In backward propagation processing, the error gradient is sequentially calculated from the output layer to the input layer while calculating the weighted sum of errors in the previous layer. Calculate the error and make corrections to reduce the error. In other words, the change in each weight when a certain pattern is given to the network 8W'J is △W...=δ ・ 0・
・・・・・・・・・・・・・・・・・・ (3)
IJ J Guar. Here, 01 indicates the input value from unit i to unit j. δ differs depending on whether unit j is an outgoing unit or an intermediate unit. If unit j is an outboard unit, then δ.

【Ｊ、 δ、−（ｔ、−０，）ｆ’　　（ｎｅｔ＝　　）Ｊ　　
　　　　　　ＪＪ　　　　　　　　　　　　　　　　Ｊ
・・・・・・・・・・・・・・・・・・　（４）である
。ここて、ｔ、は、教師信号（望ましい値）を示し、また
、ｎ　ｅ　ｔ　Ｊは、ｎｅｔ　　　−ΣＷ・・０　・Ｊ　　　　　　　　ＪＩ　　　Ｊ・・・・・・・・・・・・・・・　（４）　′ぐある。ユニットｊが中間ユニットである場合、δ・は、 δ、　−ｆ’　　（ｎｅｔ−）ΣδｊＷｋＪＪ　　　　
　　　　Ｊ・・・・・・・・・・・・・・・・・・（５）である。バックプロパゲーションアルゴリズムにおける具体的な
処理は、次のとおりである。（１）前向き伝搬処理（ａ）入力値または前層のユニットの出力値を該当する
重みに伝える。（ｂ）この値と重みの積を計算する。（Ｃ）次の層の同じユニットに接続されている重みごと
に重み付き和を計算する。（ｄ）この値に関数ｆを適用する。（１１）後向き伝搬処理＜ａ）該当するΦみに誤差を伝える。（ｂ）誤差と重みの積を計算する。（Ｃ）前層（出力に近い方）のユニットからのこれらの
値の和を計→する。（ｄ）関数ｆの微分を計算する。（ｅ）誤差勾配に従って重みを修正する。上述した処理を、収束するまで繰返す。従来、このような処理は、逐次処理型の汎用計０機上で
行われていた。この場合の上述した処理には、相隣る層のユニットが、
それぞれｍ個及びｎ個の個数を有するとき、ｍ×ｎのユ
ニット間結合があるため、学習に、多数回の繰返しが必
要である。このため、ｎの値が大きなニューラルネットにおいては
、上述した処理に膨大な時間を必要としでいた。【本発明の目的】本発明は、上述したバックプロパゲーションアルゴリズ
ムを、高い並列度で処理することによって、処理を高速
化することを目的としている。[J, δ, −(t, −0,)f' (net= )J
JJ J
・・・・・・・・・・・・・・・・・・ (4). Here, t indicates a teacher signal (desired value), and net J is net −ΣW・・0・J JI J ・・・・・・・・・・・・・・・・・・ ( 4) There is. If unit j is an intermediate unit, δ・ is δ, −f' (net−)ΣδjWkJJ
J ・・・・・・・・・・・・・・・(5). The specific processing in the backpropagation algorithm is as follows. (1) Forward propagation processing (a) Transmit the input value or the output value of the previous layer unit to the corresponding weight. (b) Calculate the product of this value and the weight. (C) Calculate a weighted sum for each weight connected to the same unit in the next layer. (d) Apply the function f to this value. (11) Backward propagation process <a) Transmit the error to the corresponding Φ. (b) Calculate the product of error and weight. (C) Calculate the sum of these values from the units in the previous layer (closer to the output). (d) Calculate the derivative of the function f. (e) Modify the weights according to the error gradient. The above process is repeated until convergence. Conventionally, such processing has been performed on a sequential processing type general-purpose machine. In this case, in the above-mentioned process, units of adjacent layers are
When the numbers are m and n, respectively, there are m×n connections between units, so learning requires many iterations. For this reason, in a neural network with a large value of n, the above-described processing requires an enormous amount of time. [Object of the present invention] The present invention aims to speed up the processing by processing the above-mentioned backpropagation algorithm with a high degree of parallelism.

[Means of the present invention]

本発明は、最大の層内のユニット数がｎである時、ｎ×
ｎの２次元格子状に配列されているｆ−夕の授受が可能
な演算要素群上て、第（−１）層の全てのユニットから
の第１層のユニットへの重み付き入力の計算を同時に行
い、行方向あるいは列方向にデータ転送及び演（１を繰
返すことて、各ユニットへの入力値の総和が並列に計算
でき、同様の処理を繰返すことて、入力に対して正しい
出力がｉｑられるように、各ユニット間の結合の修正が
各演算要素上で同時に計算できるように、ｋ個のユニッ
トから成る第１番目の層と、ｍ１ｌｌのユニットから成
る第（＋１）番目の層間の最大ｋ×ｍ個の重みを、第１
層の第ｊ番目のユニットから第（ｉ＋１）層の全てのユ
ニット間の重みを演算要素群の第１列に順に対応させ、
第（ｉ　−）−１）層の第ｊ番目のユニットから第（ｉ
＋２）ｌｉｄの全てのユニット間への重みを演算要素群
の第１行に順に対応させて、学習を並列に行う。次に、このような処理を、前半のため第１図に示すネッ
トワークモデルをもとに、具体例で説明すれば、次のと
おりである。なＪ５、中間層の数や、各層におけるユニットの数が、
第１図の場合から増加しても、下記の説明に準じた処理
を行わせることができる。あらかじめ、全ての重みの初期値、入力値、教師信号（
望ましいｆｆ１）を求めておく。これらのデータを、第２図に示すように、各プロセッサ
ＰＥに送る。以上の処理の後、次の手順に従って処理を行う。（１）前向き伝Ｗｊ処理（第３図Ａ）（１）各プロセッサＰＥて、入力値と、入力層と中間層
との間の重みの値との乗ｎを行わせる。（ｉｉ）（ｉ）上で得られた乗算結果の値を、例えば各
行ごとに、右（または左）から順に加鼻を繰返しながら左（または右）方向に値を転送し、左端プロセッサ列ＰＥに、上述した（１）式の］　Ｃ７１山を格納さ
ぼる。（ｉｉｉ　）左（または右）端ブ０ｔｌ−ッサＰＥの列
において、この値に関数ｆを適用した上述した（２）式の結果を、各行ごとに右（または左）方向に放送させる。（ｉｖ）各プロセッサーＰＥて、（１〉の場合と同様に
、（ｉｉｉ　）で得られた値と、次の層の重みの値との
乗算を行わせる。（ｖ）＜ｉｖ）で得られた乗算結果の値を、例えば各列
ごとに、上（または下）から順に加算を繰返しながら、下（または上）方向に値を放送し、下（または上）端プロセッサＰＥの行に、上述した（１）式の計口値を格納させる。（Ｖｉ）下（または上）端プロセッサ行において、この
値に関数ｆを適用した上述した（２）式の結果を、各列ごとに、上（または下）方向に放送させる。（ｖｊ　）以上の処理を繰返すことによって、出力層に
出ノＪを得、その出力値を、各列ごとに、下（または上
）方向に放送させる。（２）後向き伝搬処理（第３図Ｂ）（１）各プロセッサＰＥにおいて、上述した（４）式の
値を計算させる。このとき、各プロセッサＰＥの列て、同じ計算を行なわせる。（ｉｉ　）各プロセッサにおいて、上述した（３）式の
給を計算させ、各プロセッサＰＥにｖｊ当てられている重みを更新させる。（ｉｉｉ）各プロセッサＰＥの行ごとに、例えば行方向
に加算を繰返すことて、上述した（５）式の値を求めさせる。（ｉｖ　）各プロセッサＰＥにおいて、上述した（３）
式の値を計算させ、各プロセッサＰＥにｖｊ当てられている重みを更新させる。（Ｖ）転送方向を行方向及び列方向に交互に変化させな
がら、上述した（　ｉｉｉ　）及び（ｉｖ　）の処理を
、入力層に達するよ′Ｃ続けさせる。本発明は、以上のように、重みを各プロセッサに割当て
、行方向及び列方向のデータ転送、演算を繰返し行わせ
ることて、演算だけでなく、データ転送においても、高
い並列度で学習を行わＵることを特徴としている。In the present invention, when the maximum number of units in a layer is n, n×
Calculations of weighted inputs from all units of the (-1)th layer to the units of the first layer are performed on a group of calculation elements that are arranged in a two-dimensional grid of n and are capable of sending and receiving f-times. At the same time, by repeating data transfer and operation in the row or column direction, the sum of the input values to each unit can be calculated in parallel, and by repeating the same process, the correct output for the input can be calculated in iq The maximum between the first layer consisting of k units and the (+1)th layer consisting of m1ll units is calculated so that the modification of the coupling between each unit can be computed simultaneously on each computing element. The k×m weights are
The weights between all the units from the jth unit of the layer to the (i+1)th layer are made to correspond in order to the first column of the calculation element group,
From the j-th unit of the (i −)-1)th layer to the (i-th
+2) Learning is performed in parallel by making the weights between all units of lid correspond to the first row of the calculation element group in order. Next, such a process will be explained using a specific example based on the network model shown in FIG. 1 for the first half. J5, the number of intermediate layers and the number of units in each layer are
Even if the number is increased from the case of FIG. 1, processing according to the following explanation can be performed. In advance, initial values of all weights, input values, and teacher signals (
Desirable ff1) is determined in advance. These data are sent to each processor PE as shown in FIG. After the above processing, processing is performed according to the following steps. (1) Forward transmission Wj processing (FIG. 3A) (1) Each processor PE multiplies the input value by the weight value between the input layer and the hidden layer n. (ii) Transfer the value of the multiplication result obtained in (i) in the left (or right) direction while repeating addition from the right (or left) for each row, for example, to the leftmost processor column PE. Then, store the C71 mountain in the above-mentioned equation (1). (iii) In the column of the left (or right) edge sensor PE, the result of the above-mentioned equation (2), which is obtained by applying the function f to this value, is broadcast in the right (or left) direction for each row. (iv) As in the case of (1>), each processor PE multiplies the value obtained in (iii) by the weight value of the next layer. For example, for each column, the value of the multiplication result is broadcasted downward (or upward) while repeating addition in order from the top (or bottom), and the above-mentioned is applied to the row of the bottom (or top) end processor PE. The calculated value of equation (1) is stored. (Vi) In the lower (or upper) end processor row, the result of the above-mentioned equation (2) obtained by applying the function f to this value is broadcast upward (or downward) for each column. (vj) By repeating the above processing, output J is obtained in the output layer, and the output value is broadcast downward (or upward) for each column. (2) Backward propagation processing (FIG. 3B) (1) Each processor PE calculates the value of equation (4) described above. At this time, each row of processors PE is caused to perform the same calculation. (ii) In each processor, calculate the pay in equation (3) above, and update the weight vj assigned to each processor PE. (iii) For each row of each processor PE, for example, repeat the addition in the row direction to obtain the value of equation (5) above. (iv) In each processor PE, the above (3)
The value of the expression is calculated and the weight assigned to each processor PE is updated. (V) While changing the transfer direction alternately in the row direction and column direction, the above-mentioned processes (iii) and (iv) are continued until reaching the input layer. As described above, the present invention performs learning with a high degree of parallelism not only in calculations but also in data transfer by assigning weights to each processor and repeatedly performing data transfer and calculations in the row and column directions. It is characterized by U.

【Example】

次に、第４図を伴って、本発明の実施例を述べよう。、本発明の一例構成を示し、前処理部１と、インターフ
ェイス部２と、アレイ部４と、制御部５とを有する。前処理部１は、アレイ部４及びインターフェイス部２を
制御する制御部５を制御するとともに、各重みの初ＩＩ
値、学習をさせる各種パタン（入力パタン）及びそれら
に対応する望ましい出力信号（教師信号）を準罰する処
理を行い、逐次型計障機で構成されている。第５図は、第４図に示されているアレイ部４の一例構成
を示し、本図において、ＰＥはプロセッサ、６はｖ制御
信号線を示す。第６図は、第５図に示す各プロセッサＰＥを示し、本図
において、３０１〜３０４は選択回路、３０５はレジス
タ、３０６はアキュムレータ、３０７は演算器、３０８
はレジスタファイル、３０９は制御レジスタである。この場合、選択回路３０１は、相隣るプロセッサＰＥと
通信を行う場合、データを上下左右のどの隣接するプロ
セッサＰＥから受は取るかを選択する機能を有する。また、選択回路３０２は、レジスタ３０５にどのデータ
を格納するかを選択する機能を有する。さらに、選択回路３０３は、隣接するプロセッサＰＥと
通信を行う場合、どのデータを出力するかを選択する機
能を有する。ここて、選択回路３０１の出力を選択すれば、隣接する
プロセッサＰＥからのｆ−夕が、レジスタ３０５などの
記憶素子に格納されることなしに、そのまま出力される
。また、この選択回路３０３は、制御部５から全てのプロ
セッサＰＥに送られる制御１１信号６によって、全ての
ブロセッ’＋ＰＥを通じて、同一の動きをするだけでな
く、プロセッサＰＥ内の制御レジスタ３０９に格納され
ているデータによって、各プロセッサＰＥで個別に出力
信号を選択できる機能を有する。さらに、選択回路３０４は、演算器３０７の入力の片側
ボートに入力するデータを選択する機能を有する。第５図に示すアレイ部４において、プロセッサＰＥ間の
通信を行う場合は、各プロセッサＰＥのレジスタ３０５
をシフトレジスタのように動作させ、各プロセッサＰＥ
が、データを、−斉に、上（または下、もしくは左、ま
たは右）に隣接しているプロセッサＰＥにシフト転送さ
せることができる。また、プロセッサＰＥにおける制御レジスタ３０９を適
当に設定し、選択回路３０３を適当に制御すれば、ある
プロセッサＰＥでは演算器３０７の出力、あるいはレジ
スタ３０５の出力を、そのプロセッサＰＥに隣接してい
る他のプロセッサＰＥに出力しくこのプロセッサＰＥを
、発振プロセッサＰＥと呼ぶ）、別のプロセッサＰＥで
は他のブロセッ１ｔＰＥからのデータを、レジスタ３０
５に書き込むと同時に、選択回路３０３を経て出力する
（このプロセッサＰＥを受信プロセッサＰＥと呼ぶ）こ
とができる。このような機能を、リップル転送と称す。第４図に示す本発明による装置を動作させるには、前処
理部１′Ｃ″、各ユニット間の重みの初期値、学習をさ
せる各種パタン（入力パタン）及びそれらに対応する望
ましい出力信号（教師信号）を作成し、インタフェイス
部２を介して、第３図に示すように、各プロセッサＰＥ
にデータが割当てられるように、アレイ部４に送る。このとき、教師信号、及び入力パタンについては、各プ
ロセッサＰＥの列（または行）て、同一データであるの
て、上述したリップル転送を用いて、データを送る。重みの所期値については、各プロセッサＰＥによって異
なる値をｈするのて、通常のシフト転送を行わせる。各プロセッサＰＥにおいて、データは、レジスタ３０５
から演算器３０７を介して、レジスタファイル３０８の
適当なアドレスに格納される。第４図に示されている制御部５は、前処理部１からのｉ
ｌ制御信号に従って、以後の処理を行うようなインター
フェイス部２、アレイ部４を制器する命令群を、逐次生
成する。先ず、入力パタンあるいは重みをレジスタファイル３０
８から読出して、アキュムレータ３０６に格納した後、
入力パタンと重みの積の演算を演算器３０７で行い、そ
の演算結果を、アキュムレータ３０６に格納し、その後
、レジスタファイル３０８に格納する。各プロセッサＰＥの行（または列）毎に、上述したリッ
プル転送を用いた加算（リップル加算）を用いて、これ
らの値を順に加ｑさせ、各プロセッサＰＥの行（または
列）における端のプロセッサＰＥに、行（または列）ご
との結果を格納する。上述したリップル加算を行うには、選択回路３０１の出
力を選択し、演Ｗ７！Ｉ３０７て、レジスタファイル３
０８のデータと加算を行うととｂに、選択回路３０２が
、選択回路３０１の出力を選択し、それを、隣接してい
るプロセッサＰＥからのデータを格納することで行われ
る。以上のようにして、次の層の各ユニットの入力データが
、並列に求められたことになる。この各プロセッサの行（または列）における端のプロセ
ッサＰＥに格納された重み付き和の結果を、各行（また
は列）ごとの他のブロセツ’ｔＰＥに、リップル転送を
用いて放送し、各プロセッサＰＥて、この値を入力とし
て、シグモイド関数の値を計算する。この場合、各プロセッサの行（または列）にＪ５ける端
のプロセッサρ［において、シグモイド関数Ｊ数の計ｐ
を行った後、１テ（または列）ごとに、リップル転送を
用いて放送しても良い。次に、上述したシグモイド関数の値を、次の層のユニツ
１〜の入力データとして、上述したと同様な処理を行う
。ただし、この場合、前述したように、データの転送方向
が、列（または行）方向になる。そして、上述したと同様の処理を、中間層の数に応じた
回数だけ行う。また、後方の伝搬処理についても、詳細説明は省略する
が、上述したと同様の方法て、行わせることができる。なお、各層のユニットの数が一致しない場合は、接続関
係のない重みを常にＯにするように制器することて、フ
ィードバックのない任意の階層型ネットワーク構造に適
応可能である。また、各層のユニット数が、２次元プロセッサＰＥアレ
イの一辺のプロセッサＰＥの数を超えるときは、単純に
一問題のアレイを、物理アレイに格納できる大きさに折
畳む、すなわち、プロセッサＰＥ内のレジスタファイル
あるいは、各プロセッサＰＥから直接アクセス可能なロ
ーカルメモリの深さ方向に折畳んだデータを格納し、実
プロセツサＰＥアレイごとに、シリアルに処理すること
で適用可能である。ｃ本発明の効果】上述したところから明らかなように、本発明によれば、
プロセッサＰＥの数を増やすことによって、それに応じ
て並列度が向上し、大規模なネットワークのシミュレー
ションを高速化できる。また−全体の処理時間のほとんどを占める学習処理を行
うプロセッサＰＥアレイ部が、単純な同一構成のプロセ
ッサＰＥを規則正しく２次元状に接続している構成を有
するのて、容易にＬＳＩ化でき、同一ハードウエア品で
は、通常の３２ピッ１−プロセッサに比べて、多くのプ
ロセッサが搭載できるのて、大規模なネットワークのシ
ミュレーションにとって好適である。また、層ごとの重みのＨＩＲ，ｆｆｔみ付ぎ和の計算、
シグモイド関数の計口を、全て並列に行うのて、極めて
＾速に学習を行うことができる。法人は、本発明を実際に実現した時のシミュレーション
速度と、汎用計算機上で行った従来アルゴリズムによる
シミュレーション速度の比較を示している。出力ニューロン数−２５６、学習回数＝１００回の場合
である。上表から明らかなように、本発明によれば。大型汎用計算機上のシミュレーション速度に比べて約４
５倍の学習速度が得られる。さらに、本発明によれば、文字認識処理に適用した１８
合、学習溜みの文字パタンだけでなく、未知のパタンに
ついても、丈でに学習済みのパタンの中から選択して答
を出力するネットワークのΦみの値を、極めて短ｖｆ間
で１ｑることができる。Next, an embodiment of the present invention will be described with reference to FIG. , shows an example configuration of the present invention, and includes a preprocessing section 1, an interface section 2, an array section 4, and a control section 5. The preprocessing unit 1 controls the control unit 5 that controls the array unit 4 and the interface unit 2, and also controls the initial II of each weight.
It is a sequential meter that performs processing to quasi-punish values, various patterns to be learned (input patterns), and desirable output signals (teacher signals) corresponding to them. FIG. 5 shows an example of the configuration of the array section 4 shown in FIG. 4, in which PE represents a processor and 6 represents a v control signal line. 6 shows each processor PE shown in FIG. 5, in which 301 to 304 are selection circuits, 305 is a register, 306 is an accumulator, 307 is an arithmetic unit, and 308
is a register file, and 309 is a control register. In this case, the selection circuit 301 has a function of selecting from which neighboring processor PE (top, bottom, left, right) data is to be received or taken, when communicating with neighboring processors PE. Further, the selection circuit 302 has a function of selecting which data is stored in the register 305. Further, the selection circuit 303 has a function of selecting which data to output when communicating with an adjacent processor PE. Here, if the output of the selection circuit 301 is selected, the f-data from the adjacent processor PE is output as is without being stored in a storage element such as the register 305. In addition, this selection circuit 303 not only performs the same operation through all the processors PE but also stores information in the control register 309 in the processor PE in response to the control 11 signal 6 sent from the control unit 5 to all the processors PE. It has a function that allows each processor PE to individually select an output signal depending on the data being displayed. Further, the selection circuit 304 has a function of selecting data to be input to one side of the input port of the arithmetic unit 307. In the array unit 4 shown in FIG. 5, when communicating between processors PE, the register 305 of each processor PE
operates like a shift register, and each processor PE
can cause the data to be shifted and transferred to the upper (or lower, or left, or right) adjacent processor PE in unison. Furthermore, if the control register 309 in the processor PE is appropriately set and the selection circuit 303 is appropriately controlled, a certain processor PE can transfer the output of the arithmetic unit 307 or the output of the register 305 to other processors adjacent to that processor PE. (This processor PE is called an oscillation processor PE), and another processor PE outputs data from another processor PE to a register 30.
At the same time, the data can be written to 5 and outputted via the selection circuit 303 (this processor PE is called a receiving processor PE). Such a function is called ripple transfer. In order to operate the apparatus according to the present invention shown in FIG. A teacher signal) is created and sent to each processor PE via the interface unit 2 as shown in FIG.
The data is sent to the array unit 4 so that the data is allocated to the data. At this time, since the teacher signal and input pattern are the same data in each column (or row) of each processor PE, the data is sent using the ripple transfer described above. As for the initial value of the weight, a different value is assigned to each processor PE, and normal shift transfer is performed. In each processor PE, data is stored in register 305
The data is stored at an appropriate address in the register file 308 via the arithmetic unit 307. The control unit 5 shown in FIG.
According to the l control signal, a group of commands for controlling the interface section 2 and the array section 4 to perform subsequent processing are sequentially generated. First, input patterns or weights are stored in the register file 30.
After reading from 8 and storing it in the accumulator 306,
The calculation unit 307 calculates the product of the input pattern and the weight, and stores the calculation result in the accumulator 306 and then in the register file 308. For each row (or column) of each processor PE, these values are sequentially added q using the above-mentioned addition using ripple transfer (ripple addition), and the end processor in the row (or column) of each processor PE is Store the results row by row (or column) in PE. To perform the ripple addition described above, select the output of the selection circuit 301 and perform W7! I307, register file 3
When the addition is performed with the data of 08, the selection circuit 302 selects the output of the selection circuit 301 and stores the data from the adjacent processor PE. In the above manner, the input data for each unit of the next layer is obtained in parallel. The result of the weighted sum stored in the end processor PE in each row (or column) of each processor is broadcast to the other processor PEs in each row (or column) using ripple transfer, and each processor PE Then, using this value as input, calculate the value of the sigmoid function. In this case, in each processor row (or column) J5 times the edge processor ρ[, the sum of the sigmoid functions J is p
After performing this, it may be broadcast using ripple transfer for each te (or column). Next, the value of the sigmoid function described above is used as input data for units 1 to 1 of the next layer, and the same processing as described above is performed. However, in this case, as described above, the data transfer direction is the column (or row) direction. Then, the same process as described above is performed a number of times depending on the number of intermediate layers. Furthermore, detailed explanation of the backward propagation process will be omitted, but it can be performed in the same manner as described above. Note that if the number of units in each layer does not match, it is possible to adapt to any hierarchical network structure without feedback by controlling the weights that have no connection relation to always be O. Furthermore, when the number of units in each layer exceeds the number of processor PEs on one side of the two-dimensional processor PE array, simply fold the array in one problem into a size that can be stored in the physical array. This method can be applied by storing data folded in the depth direction in a register file or a local memory that can be directly accessed from each processor PE, and processing it serially for each real processor PE array. c. Effects of the present invention As is clear from the above, according to the present invention,
By increasing the number of processors PE, the degree of parallelism increases accordingly, making it possible to speed up the simulation of a large-scale network. In addition, the processor PE array unit that performs learning processing, which takes up most of the overall processing time, has a configuration in which simple processor PEs with the same configuration are regularly connected in a two-dimensional manner, so it can be easily integrated into an LSI, and As a hardware product, it is suitable for large-scale network simulations because it can be equipped with more processors than a typical 32-pin processor. Also, calculation of HIR, fft fitting sum of weights for each layer,
By performing all calculations of the sigmoid function in parallel, learning can be performed extremely quickly. The company shows a comparison between the simulation speed when the present invention was actually implemented and the simulation speed using a conventional algorithm performed on a general-purpose computer. This is a case where the number of output neurons is -256 and the number of learning times is 100 times. According to the present invention, as is clear from the table above. Approximately 4 times faster than the simulation speed on a large general-purpose computer
You can learn 5 times faster. Furthermore, according to the present invention, the 18
In this case, the value of Φ of the network that selects from the already learned patterns and outputs the answer, not only for the character patterns in the learning pool but also for unknown patterns, can be calculated by 1q in an extremely short vf. be able to.

[Brief explanation of drawings]

第１図は、３層構造の階層型ネットワークを示す図であ
る。第２図は、プロセッサＰＥへの各種データのマツピング
を承り図である。第３図Δは、前向き伝搬処理時の処理を示ず図である。第３図Ｂは、後向き伝ＷＩ処理時の処理を示す図である
。第４図は、本発明の一例構成を示す図である。第５図は、そのアレイ部の一例構成を示す図である。第６図は、そのプロセッサＰＥの一例構成を示す図であ
る。FIG. 1 is a diagram showing a hierarchical network with a three-layer structure. FIG. 2 is a diagram showing the mapping of various data to the processor PE. FIG. 3 Δ is a diagram that does not show the processing at the time of forward propagation processing. FIG. 3B is a diagram showing processing during backward transmission WI processing. FIG. 4 is a diagram showing an example configuration of the present invention. FIG. 5 is a diagram showing an example configuration of the array section. FIG. 6 is a diagram showing an example configuration of the processor PE.

Claims

[Claims] 1. A plurality of units whose output value is a value generated by applying a nonlinear differentiable function to the sum of a plurality of input values;
A neural network consisting of a hierarchical network that connects these units and propagates outputs with appropriate weights applied to input values approaches the desired output value for each of multiple input patterns. In the method of parallel processing of learning (self-organization) performed by modifying the weights between units, when the maximum number of units in a layer is n, the calculation elements are n × On a group of computing elements that are arranged in a two-dimensional lattice of n and are capable of transmitting and receiving data between at least adjacent computing elements, all the units in the (i-1)th layer to the units in the i-th layer are By simultaneously calculating weighted inputs and repeating data transfer and calculations in the row or column direction, the sum of input values to each unit can be calculated in parallel, and connections between each unit can be corrected in each calculation. In order to obtain the correct output for the input by calculating simultaneously on the elements and repeating the same process, the i-th layer consists of k units and the (i+1)th layer consists of m units. The maximum k×m weights between the th layer are
The weights between all the units from the j-th unit of the i-th layer to the (i+1)-th layer are made to correspond in order to the j-th column of the calculation element group, and the weights from the j-th unit of the (i+1)-th layer to the (i+
2) A neural network parallel simulation method characterized in that learning can be performed in parallel by sequentially assigning weights between all units of a layer to the j-th row of a group of calculation elements. 2. A plurality of units whose output value is a value generated by applying a nonlinear differentiable function to the sum of a plurality of input values;
Between them, a neural network consisting of a hierarchical network that connects them and propagates the output with appropriate weights applied to the input value is used to approach the desired output value for each of multiple input patterns. In a neural network parallel simulation device that processes learning (self-organization) performed in parallel by modifying weights between units, (a) an arithmetic element having an arithmetic circuit that can perform logical operations, addition/subtraction, and multiplication; are interconnected in a two-dimensional (n x n) configuration, and have the function of performing addition or data transfer while performing ripple transfer for each row or column, and are directly connected to arithmetic elements separated by an appropriate distance. a processor array having a bypass data transfer path and capable of broadcasting data in the row or column direction to each calculation element using ripple transfer; and (b) a control unit for controlling the processor array. (c) The maximum k×m weights between the i-th layer consisting of k units and the (i+1)-th layer consisting of m units are stored in each calculation element in the row or column direction. By repeating transfer and calculation, the weights between all units of the (i+1)th layer are calculated from the jth unit of the ith layer to the (i+1)th layer so that the sum of input values to each unit can be calculated in parallel. Allocate sequentially to the j-th column of the group, and from the j-th unit of the (i+1)-th layer to the (i+2-th
) allocating weights among all units of the layer in correspondence with the j-th row of the calculation element group; (d) calculating weighted inputs between the units of the first layer and the units of the adjacent layer; A neural network parallel simulation device characterized in that connection correction is performed simultaneously on all calculation elements, and (e) learning is performed in parallel by repeating the above processing for other input patterns.