JP2522406B2

JP2522406B2 - Fully integrated network parallel processing method and device

Info

Publication number: JP2522406B2
Application number: JP1238616A
Authority: JP
Inventors: 正雄岩下
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1989-09-13
Filing date: 1989-09-13
Publication date: 1996-08-07
Anticipated expiration: 2011-08-07
Also published as: JPH03100755A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、ニューラルネットワーク処理方法及び装置
に関し、特に３層の逆伝播型学習ニューラルネットワー
クを複数のプロセッサモジュールに分割し並列に処理を
行う方法及び装置に関するものである。Description: TECHNICAL FIELD The present invention relates to a neural network processing method and apparatus, and more particularly to a method for dividing a three-layer back-propagation learning neural network into a plurality of processor modules and performing processing in parallel. And the device.

（従来の技術）従来、ニューラルネットワークによる認識、学習処理
をｎ個のプロセッサモジュールに分割する方法として、
中間層及び出力層に結合するネットワークをそれぞれｎ
個に分割し、各プロセッサモジュールに分担させて一度
に総和を求めて処理する方法がある。(Prior Art) Conventionally, as a method of dividing recognition and learning processing by a neural network into n processor modules,
N networks connected to the middle layer and the output layer, respectively.
There is a method in which each processor module is divided into pieces and the sum is calculated at one time for processing.

第５図は各プロセッサへの割り当ての従来例を示す図
である。FIG. 5 is a diagram showing a conventional example of allocation to each processor.

第５図において、太線で示すネットワークに関する処
理が一つのプロセッサモジュールで行う処理である。同
様に一つの中間出力、最終出力毎にそれに入力されるネ
ットワークの処理を受け持つ。ここで求められる最終出
力結果は総和である。求められた中間出力、最終出力に
入力されるネットワークの重み、及び最終出力は、共有
メモリに集められた後、転置され各プロセッサに分配さ
れる。In FIG. 5, the processing related to the network indicated by the bold line is processing performed by one processor module. Similarly, it takes charge of the network processing input to each intermediate output and final output. The final output result obtained here is the total sum. The obtained intermediate output, the network weight input to the final output, and the final output are collected in the shared memory, transposed, and distributed to each processor.

（発明が解決しようとする課題）しかし、上述のような方法にあっては、処理速度の点
で次のような難点がある。すなわち、従来のように中間
層及び出力層と結合するネットワークをそれぞれｎ個に
分割を行うと、認識過程においては並列に処理が実行さ
れ、各々の独立に処理されるが、逆伝播学習の際に各々
のプロセッサモジュールでネットワーク全体のデータが
必要となるため、各々のプロセッサモジュールから一つ
の共有メモリ上に集めた中間出力値、最終出力値、ネッ
トワークの重み値を転置し、全てのプロセッサモジュー
ルに再配分するという処理が必要となり、各プロセッサ
モジュールと共有メモリ間でのデータ転送回数が多くな
り、データ転置のための処理も余分に加わるため多大の
処理時間を要していた。(Problems to be Solved by the Invention) However, the above method has the following problems in terms of processing speed. That is, when the network connected to the middle layer and the output layer is divided into n pieces as in the conventional case, the processes are executed in parallel in the recognition process, and they are processed independently. Since each processor module needs the data of the entire network, the intermediate output value, the final output value, and the network weight value collected from each processor module on one shared memory are transposed, and all processor modules are transposed. The process of redistributing is required, the number of times of data transfer between each processor module and the shared memory is increased, and the process for data transposition is additionally added, which requires a great deal of processing time.

本発明の目的は、複数のプロセッサモジュールに分割
処理する場合、各プロセッサモジュール内で、予め転置
された中間処理結果を求めておき、特別な転置のための
処理やデータ転送回数を減らすことにより高速処理でき
る方法及び装置を提供することにある。An object of the present invention is to achieve a high speed by dividing the processing result into a plurality of processor modules and obtaining a transposed intermediate processing result in advance in each processor module to reduce the processing for special transposition and the number of data transfers. It is to provide a method and a device that can be processed.

（課題を解決するための手段）ａ個の入力層、ｂ個の中間層、ｃ個の出力層からなる
３層のニューラルネットワークの中間層出力置、最終出
力置、教師信号との誤差、ネットワークの重みの更新値
を求めるための演算を並列に行うｎ個のプロセッサとロ
ーカルメモリからなるプロセッサモジュールに分割して
並列に処理する場合において、ｎ個のプロセッサモジュ
ールにｂ個の中間層をｎ分割した数の処理を割り当て、
それと結合するａ個の入力層、ｃ個の出力層のネットワ
ークの処理を分担させ、各プロセッサモジュールで出力
層の部分和を求めた後、各プロセッサモジュールでの部
分和を一つの共有メモリ上に一括してデータ転送し、総
和を求めた後、各々のプロセッサモジュールにその総和
と教師信号から求めた誤差を一括してデータ転送し、各
々のプロセッサモジュールでｎ個の中間層と結合するネ
ットワークの重みの値を求め、共有メモリ上でのネット
ワークの重み値に対する転置をすることなしにニャーラ
ルネットワークの認識、学習処理を行う。(Means for Solving the Problem) Intermediate layer output position, final output position, error from teacher signal, network of three-layer neural network consisting of a input layers, b intermediate layers, and c output layers In the case of dividing into n processor modules each of which performs an operation for obtaining the update value of the weight of n in parallel and a processor module including a local memory and processing in parallel, b intermediate layers are divided into n processor modules. Assigned the number of processes
The processing of the network of a input layers and c output layers connected to it is shared, and after obtaining the partial sum of the output layers in each processor module, the partial sums in each processor module are stored in one shared memory. After collectively transferring the data and obtaining the total sum, the total sum and the error obtained from the teacher signal are collectively transferred to each processor module, and each processor module is connected to the n intermediate layers of the network. The weight value is obtained, and the recognition and learning processing of the neural network is performed without transposing the weight value of the network on the shared memory.

（作用）本発明は、ｎ個のプロセッサとローカルメモリからな
るプロセッサモジュールと、一つの共有メモリと、各プ
ロセッサモジュールのローカルメモリと共有メモリ間で
一括して高速のデータ転送を行うブロック転送バスとか
らなる。ローカルメモリ、共有メモリには、シリアルポ
ート、パラレルポートの２ポートメモリを用いており、
それらの間でシリアルポートを用いて高速にブロック転
送を行い、各プロセッサからはパラレルポートを介して
ローカルメモリにアクセスできる。プロセッサから出力
されるデータの値に応じてローカルメモリへの書き込
み、読みだしをパラレルポート側から行うか、あるいは
ローカルメモリと共有メモリとの間でシリアルポートを
介して高速ブロック転送を行うかを制御でき、データの
入力、処理と、出力が効率よく並行して実行できる。(Operation) The present invention provides a processor module including n processors and a local memory, one shared memory, and a block transfer bus for collectively performing high-speed data transfer between the local memory and the shared memory of each processor module. Consists of. Two-port memory of serial port and parallel port is used for local memory and shared memory.
Block transfer is performed at high speed between them using the serial port, and each processor can access the local memory through the parallel port. Controls whether to write and read to the local memory from the parallel port side or to perform high-speed block transfer between the local memory and shared memory via the serial port according to the value of the data output from the processor Yes, data input / processing and output can be executed efficiently in parallel.

各々のプロセッサモジュールでは、全体のネットワー
ク処理を均等に分割した処理を分担し、並列に実行でき
る。分割はｎ個の中間層と結合するネットワーク毎に行
われるので、認識過程、即ち中間層、最終出力層部分和
が処理し終わった段階でそれらの結果を共有メモリに集
め、総和を求めた後各プロセッサモジュールに再配分し
ネットワークの重みの更新処理を行う。In each processor module, the process in which the whole network process is equally divided is shared and can be executed in parallel. Since the division is performed for each network connected to the n hidden layers, the results are collected in the shared memory at the stage of the recognition process, that is, the stage where the partial sum of the intermediate layer and the final output layer has been processed, and the sum is obtained. It redistributes to each processor module and updates the weight of the network.

（実施例）次に本発明の実施例について図面を参照して説明す
る。第２図は本発明の一実施例の方法を示す図である。
第２図において本発明の一実施例の全結合ネットワーク
並列処理方法は、プロセッサモジュール１、２と、バス
アービタ３と、１つの共有メモリとしての共有イメージ
メモリ13を有する共有イメージメモリモジュール４から
構成される。(Example) Next, the Example of this invention is described with reference to drawings. FIG. 2 is a diagram showing a method according to an embodiment of the present invention.
Referring to FIG. 2, a fully-connected network parallel processing method according to an embodiment of the present invention comprises processor modules 1 and 2, a bus arbiter 3, and a shared image memory module 4 having a shared image memory 13 as one shared memory. It

本実施例では、プロセッサモジュールの数が２つの場
合についてであるが、一般にｎ個の場合も同様である。In the present embodiment, the number of processor modules is two, but the same applies to the case of n in general.

各プロセッサモジュール１、２内部にはローカルメモ
リとプロセッサがペアとなって含まれており、符号11、
12がローカルメモリ、また21、22がプロセッサを示して
いる。ローカルメモリ11、12はプロセッサ21、22とパラ
レルポートを介して各々接続され、１つの共有イメージ
メモリ13とシリアルポートを介して接続される。Each of the processor modules 1 and 2 includes a local memory and a processor as a pair.
Reference numeral 12 is a local memory, and reference numerals 21 and 22 are processors. The local memories 11 and 12 are respectively connected to the processors 21 and 22 via parallel ports, and are connected to one shared image memory 13 via serial ports.

バスアービタ３は、バスアービトレーション回路23を
有する。このバスアービタ３は、プロセッサ21、22から
の共有イメージメモリ13及びローカルメモリ11、12間の
データのブロック転送要求を予め定められた優先順位に
従い、調停を行い、共有イメージメモリ13とローカルメ
モリ11、12との間の一括データ転送を制御するもので、
バスアービトレーション回路23とプロセッサ21、22と共
有イメージメモリモジュール４における共有イメージメ
モリ13の制御回路24とがそれぞれ制御線33によって接続
されている。この制御回路24からは共有イメージメモリ
13にリードライト切り替え信号43が送出されるようにな
っている。リードライト切り替え信号43がリード側に切
り替えられたときは、共有イメージメモリ13からのデー
タの読みだしが行われる。共有イメージメモリ13とロー
カルメモリ11、12とは、データバス32によって接続され
ている。The bus arbiter 3 has a bus arbitration circuit 23. The bus arbiter 3 arbitrates block transfer requests for data between the shared image memory 13 and the local memories 11 and 12 from the processors 21 and 22 in accordance with a predetermined priority order, and the shared image memory 13 and the local memory 11 and It controls the bulk data transfer between 12 and
The bus arbitration circuit 23, the processors 21 and 22, and the control circuit 24 of the shared image memory 13 in the shared image memory module 4 are connected by control lines 33, respectively. Shared image memory from this control circuit 24
A read / write switching signal 43 is sent to 13. When the read / write switching signal 43 is switched to the read side, the reading of data from the shared image memory 13 is performed. The shared image memory 13 and the local memories 11 and 12 are connected by a data bus 32.

また、プロセッサ21、22とローカルメモリ11、12と共
有イメージメモリ13とはそれぞれアドレスバス31によっ
て接続されている。Further, the processors 21 and 22, the local memories 11 and 12, and the shared image memory 13 are connected by an address bus 31, respectively.

本実施例では、上述のように、複数のプロセッサ21、
22と、１つの共有イメージメモリ13と、複数のプロセッ
サ21、22とパラレルポートを介して各々接続され、１つ
の共有イメージメモリ13とシリアルポートを介して接続
される複数のローカルメモリ11、12と、プロセッサから
の共有イメージメモリ13及びローカルメモリ11、12間の
データのブロック転送要求を予め定められた優先順位に
従い、調停を行い、共有イメージメモリ13とローカルメ
モリ11、12との間の一括データ転送を制御するバスアー
ビタ３とから構成され複数のプロセッサ21、22からのデ
ータ転送要求に応じ、共有イメージメモリ13と複数のロ
ーカルメモリ11、12との間で一括データ転送を行う。In the present embodiment, as described above, the plurality of processors 21,
22 and one shared image memory 13, and a plurality of processors 21 and 22, and a plurality of local memories 11 and 12 connected to each other via parallel ports and connected to one shared image memory 13 and a serial port, respectively. , A block transfer request of data between the shared image memory 13 and the local memories 11 and 12 from the processor is arbitrated according to a predetermined priority order, and collective data between the shared image memory 13 and the local memories 11 and 12 is collected. A bus arbiter 3 for controlling the transfer is performed to collectively transfer data between the shared image memory 13 and the local memories 11, 12 in response to a data transfer request from the processors 21, 22.

更に、第３図をも参照して具体的に説明する。まず、
第２図に従って動作を説明する。Further, a specific description will be given also with reference to FIG. First,
The operation will be described with reference to FIG.

プロセッサモジュール１の中のプロセッサ21からデー
タ転送要求が出される場合、まず、バスアービトレーシ
ョン回路23に対し制御線33を介してバス要求信号を生成
する。バスアービトレーション回路23はバスの使用状態
を調べ、バスが空いていれば、バス要求のあるプロセッ
サ21に対し、バスの利用可信号を返す、バス利用可信号
を受け取ったプロセッサ21は、共有イメージメモリ13の
データを転送すべきローカルメモリ番号に対応するビッ
ト位置にフラグをセットし、制御線33を介し、バスアー
ビトレーション回路23にローカルメモリ番号指定情報に
基づき、対応する複数のローカルメモリのデータバスを
受信可能状態に切り替える。バス利用可となったプロセ
ッサは、引き続いて共有イメージメモリ13及び転送対象
となるローカルメモリ11、12に対し先頭アドレスを生成
し、アドレスバス31を介して、共有イメージメモリ13及
びローカルメモリ11、12に対して送り出す。バスアービ
トレーション回路23は制御線33を介し、制御回路24にリ
ード要求を出す。バスアービトレーション回路23は制御
線33を介し、制御回路24にリード要求を出す。制御回路
24はリードライト切り替え信号43をリードに切り替え、
アドレスバス31を介して指定されたアドレスの値を用い
て共有イメージメモリ13をアクセスし、読みだしたデー
タを共有イメージメモリ13内部のシリアルポート側のレ
ジスタにセットする。セット終了後、共有イメージメモ
リ13内部のシリアルポートレジスタと、ローカルメモリ
11、12内部のシリアルポートレジスタ間で、シリアルク
ロックに同期し、レジスタ間の連続ブロック転送を開始
する。これにより、ローカルメモリ11、12内部のシリア
ルポートレジスタには全く同一のデータがコピーされ転
送される。転送終了後、各ローカルメモリ11、12は、ア
ドレスバス31を用いてアクセスされ、シリアルポートレ
ジスタ内のデータはそれぞれローカルメモリ11、12内部
のメモリ部に書き込まれ、以上により１サイクルの動作
が終了する。その後、バスアービトレーション回路23は
バスを解放し、次の要求を待つ。When a data transfer request is issued from the processor 21 in the processor module 1, first, a bus request signal is generated for the bus arbitration circuit 23 via the control line 33. The bus arbitration circuit 23 checks the bus usage status, and returns a bus availability signal to the processor 21 that has a bus request if the bus is free. A flag is set at the bit position corresponding to the local memory number to which the data of 13 is to be transferred, and the data bus of the corresponding local memories is set to the bus arbitration circuit 23 via the control line 33 based on the local memory number designation information. Switch to the receivable state. The bus-usable processor subsequently generates a head address for the shared image memory 13 and the local memories 11 and 12 to be transferred, and the shared image memory 13 and the local memories 11 and 12 via the address bus 31. Send out to. The bus arbitration circuit 23 issues a read request to the control circuit 24 via the control line 33. The bus arbitration circuit 23 issues a read request to the control circuit 24 via the control line 33. Control circuit
24 switches the read / write switching signal 43 to read,
The shared image memory 13 is accessed using the value of the specified address via the address bus 31, and the read data is set in the register on the serial port side inside the shared image memory 13. After the setting is completed, the serial port register inside the shared image memory 13 and the local memory
Synchronize with the serial clock between the internal serial port registers 11 and 12, and start continuous block transfer between registers. As a result, exactly the same data is copied and transferred to the serial port registers inside the local memories 11 and 12. After the transfer is completed, the local memories 11 and 12 are accessed by using the address bus 31, and the data in the serial port register is written to the internal memory units of the local memories 11 and 12, respectively, and the operation of one cycle is completed. To do. After that, the bus arbitration circuit 23 releases the bus and waits for the next request.

第３図は第２図におけるプロセッサ21の詳細なブロッ
ク図である。FIG. 3 is a detailed block diagram of the processor 21 in FIG.

プロセッサ21は、メモリインタフェース回路51と、デ
ータフロープロセッサ52−57と、パイプラインバス61−
67からなる。データフロープロセッサ52−57としては、
例えば特開昭58−70360号公報に記載されているものを
用いることができる。The processor 21 includes a memory interface circuit 51, a data flow processor 52-57, and a pipeline bus 61-
It consists of 67. As the data flow processor 52-57,
For example, those described in JP-A-58-70360 can be used.

パイプラインバス61−67上のデータは、データの行き
先モジュール番号及び処理の種別を表す情報からなる識
別フィールドと、アドレスやデータを表すデータ値フィ
ールドとから構成される。通常のローカルメモリアクセ
ス時には、各プロセッサからローカルメモリに対するア
ドレス値、データ値を生成し、リードライト動作を行
う。なお、第２図及び第３図において、71、72はアドレ
ス値、データ値を、41、42はリードライト切り替え信号
を示している。The data on the pipeline buses 61-67 is composed of an identification field including information indicating the destination module number of the data and the type of processing, and a data value field indicating the address and the data. At the time of normal local memory access, an address value and a data value for the local memory are generated from each processor and a read / write operation is performed. In FIGS. 2 and 3, 71 and 72 represent address values and data values, and 41 and 42 represent read / write switching signals.

共有イメージメモリ13内に貯えられているデータを処
理したい場合には、まず共有イメージメモリ13からロー
カルメモリ11にデータを転送し、次にローカルメモリ11
内のデータに対して処理を行う。To process the data stored in the shared image memory 13, first transfer the data from the shared image memory 13 to the local memory 11 and then the local memory 11
Process the data inside.

共有イメージメモリ13からローカルメモリ11へのデー
タ転送の動作は次のように行う。プロセッサ52−57内部
で共有イメージメモリ13に対し、メモリインタフェース
回路51を介してアドレス値をアドレスバス31に出力す
る。転送先のローカルメモリ番地情報を制御線33を介し
てバスアービトレーション回路23に出力し、転送要求を
出力する。バスアービトレーション回路23が転送要求を
受け付け転送が行われ終了すると、メモリインタフェー
ス回路51は要求のあったプロセッサに対し、転送終了デ
ータを送り返す。転送終了通知を受け取ったプロセッサ
は、通常のローカルメモリアクセスに移る。メモリイン
タフェース回路51はプロセッサ51−57から送られてくる
データに含まれる識別フィールドをデコードし、データ
フィールドをメモリアドレス値として解釈する場合、デ
ータ値として解釈する場合、ローカルメモリ番地情報と
して解釈する場合等の選択を行う。The data transfer operation from the shared image memory 13 to the local memory 11 is performed as follows. Address values are output to the address bus 31 via the memory interface circuit 51 to the shared image memory 13 inside the processors 52-57. The transfer destination local memory address information is output to the bus arbitration circuit 23 via the control line 33, and the transfer request is output. When the bus arbitration circuit 23 receives the transfer request and the transfer is completed, the memory interface circuit 51 sends back the transfer end data to the requested processor. The processor receiving the transfer end notification shifts to normal local memory access. The memory interface circuit 51 decodes the identification field included in the data sent from the processors 51-57, interprets the data field as a memory address value, interprets it as a data value, and interprets it as local memory address information. Etc. are selected.

第４図には本発明の処理の概要を示す。第４図には一
例として４入力４出力の場合で、３層構造のモデルを示
す。処理には与えられた入力に対し結合しているネット
ワークの重みとの積和を求める認識過程と、得られた出
力値と教師信号との誤差を求め、ネットワークの重みを
更新する学習過程とがある。学習過程にはバックプロパ
ゲーション法を用いる。入力をＸ（ｊ）、中間出力をＹ
（ｉ）、出力をＺ（ｉ）、教師信号（トレーニングデー
タ）をＴ（ｉ）の１次元ベクトルとする。FIG. 4 shows an outline of the processing of the present invention. FIG. 4 shows a model of a three-layer structure in the case of four inputs and four outputs as an example. The process consists of a recognition process that finds the product sum of the weights of the networks that are connected to the given input, and a learning process that finds the error between the output value obtained and the teacher signal and updates the network weights. is there. The backpropagation method is used in the learning process. Input X (j), intermediate output Y
(I), the output is Z (i), and the teacher signal (training data) is a one-dimensional vector of T (i).

与えられた入力Ｘ（ｉ）とトレーニングデータＴ
（ｉ）から、重みＡ（ｉ、ｊ）、Ｂ（ｉ、ｊ）を決定す
ることにより学習を行う。Ａ（ｉ、ｊ）、Ｂ（ｉ、ｊ）
が全て求められれば、それらの値を用いて、入力Ｘ
（ｉ）が与えられたとき、トレーニングデータＴ（ｉ）
とほぼ一致する出力Ｚ（ｉ）が確定的に求められる。Given input X (i) and training data T
Learning is performed by determining weights A (i, j) and B (i, j) from (i). A (i, j), B (i, j)
If all are found, use those values to input X
Given (i), training data T (i)
The output Z (i) that substantially agrees with

先ず学習過程について説明する。初期値として、Ａ
（ｉ、ｊ）、Ｂ（ｉ、ｊ）には−１から＋１までの間の
乱数を割り当てる。最初は、出力とトレーニングデータ
が一致しないので、誤差が生じる。この誤差を少しずつ
小さくしていくことが学習である。そのためには、繰り
返し同一のトレーニングデータを与え、重みを修正して
いく。収束すれば解の中の一つが求められる。First, the learning process will be described. As an initial value, A
Random numbers between -1 and +1 are assigned to (i, j) and B (i, j). Initially, the output and the training data do not match, which causes an error. Learning is to reduce this error little by little. For that purpose, the same training data is repeatedly given and the weights are corrected. If it converges, one of the solutions will be obtained.

認識過程、学習過程における計算を以下に示す。な
お、以下の式においては、通常、ｐ＝１〜0.5、ｑ＝０
〜0.2、ｒ＝１〜２に設定される。Calculations in the recognition process and learning process are shown below. In the equation below, p = 1 to 0.5 and q = 0.
.About.0.2 and r = 1 to 2 are set.

（１）、（２）式が各々中間出力、出力である。 Equations (1) and (2) are the intermediate output and the output, respectively.

学習過程では、トレーニングデータをＴ（ｉ）とする
と、出力誤差はＴ（ｉ）−Ｚ（ｉ）であり、これらからにもとづいて、誤差修正量Ｄ（ｉ）、Ｃ（ｉ）が求めら
れ、これから ΔＢ（i,j）＝ｐ×Ｄ（ｉ）×Ｙ（ｊ）＋ｑ×ΔＢ（i,j）（５） ΔＡ（i,j）＝ｐ×Ｃ（ｉ）×Ｘ（ｊ）＋ｑ×ΔＡ（i,j）（６）により、重み修正量ΔＢ（i,j）、ΔＡ（i,j）が求めら
れる。これらは、Ｂ（i,j）、Ａ（i,j）に加算され、Ｂ（i,j）＝Ｂ（i,j）＋ΔＢ（i,j）（７）Ａ（i,j）＝Ａ（i,j）＋ΔＡ（i,j）（８）により新たな重みが算出される。この重み修正はくり返
し行われ、出力誤差Ｔ（ｉ）−Ｚ（ｉ）が充分小さくな
るまで繰り返される。以上により求められた重みを用
い、認識過程では、入力Ｘ（ｊ）に対し、（１）式、
（２）式により、Ｙ（ｉ）、Ｚ（ｉ）が求められる。こ
の際に、シグモイト関数ｆ（ｘ）、及びその微分ｇ
（ｘ）が用いられる。ｆ（ｘ）はｘ＝０のとに0.5、ｘ
＝±∞で０となる正値関数である。 In the learning process, if the training data is T (i), the output error is T (i) -Z (i). Based on this, the error correction amounts D (i) and C (i) are obtained, and from this ΔB (i, j) = p × D (i) × Y (j) + q × ΔB (i, j) (5) ΔA (i, j) = p × C (i) × X (j) + q × ΔA (i, j) (6) The weight correction amounts ΔB (i, j) and ΔA (i, j) are obtained. . These are added to B (i, j) and A (i, j), and B (i, j) = B (i, j) + ΔB (i, j) (7) A (i, j) = A (I, j) + ΔA (i, j) (8) A new weight is calculated. This weight correction is repeated and repeated until the output error T (i) -Z (i) becomes sufficiently small. Using the weights obtained as described above, in the recognition process, for the input X (j), equation (1),
Y (i) and Z (i) are obtained from the equation (2). At this time, the sigmoite function f (x) and its derivative g
(X) is used. f (x) is 0.5 when x = 0, x
It is a positive value function that becomes 0 at = ± ∞.

ｇ（ｘ）＝ｆ（ｘ）×（１−ｆ（ｘ））（10）第１図は各プロセッサへの割り当てを示す図である。 g (x) = f (x) * (1-f (x)) (10) FIG. 1 is a diagram showing allocation to each processor.

第１図において、太線で示すネットワークに関する処
理が一つのプロセッサモジュールで行う処理である。同
様に一つの中間出力毎にそれに接続されているネットワ
ークの処理を受け持つ。ここで求められる出力結果は部
分和であり、求められた部分和をブロック転送により共
有メモリに集め、それで総和をとる。求められた総和の
値をブロック転送により各プロセッサに分配した後、各
プロセッサで重みの修正値を求める。In FIG. 1, the processing related to the network indicated by the bold line is processing performed by one processor module. Similarly, it takes charge of processing the network connected to each intermediate output. The output result obtained here is a partial sum, and the obtained partial sums are collected in a shared memory by block transfer, and the total sum is obtained. After distributing the obtained sum value to each processor by block transfer, each processor obtains a correction value of the weight.

（発明の効果）以上説明したように本発明によれば、データの転置が
不要であるのみならず、転置処理に必要なデータ転送を
せずに高速に処理が行えるという効果を持つ。(Effects of the Invention) As described above, according to the present invention, not only the transposition of data is unnecessary, but also the processing can be performed at high speed without the data transfer necessary for the transposition processing.

[Brief description of drawings]

第１図は各プロセッサへの処理の割り当てを示す図、第
２図は本発明の一実施例を示すブロック図、第３図は第
２図におけるプロセッサ部の詳細なブロック図、第４図
は本発明の処理の概要を示す図、第５図は各プロセッサ
への処理の割り当ての従来例を示す図である。図において、１、２……プロセッサモジュール、３……バスアービ
タ、４……共有イメージメモリモジュール、11、12……
ローカルメモリ、13……共有イメージメモリ、21、22…
…プロセッサ、23……バスアービトレーション回路、24
……制御回路、51……メモリインタフェース回路、52−
57……データフロープロセッサ、61−57……パイプライ
ンバス。FIG. 1 is a diagram showing allocation of processing to each processor, FIG. 2 is a block diagram showing an embodiment of the present invention, FIG. 3 is a detailed block diagram of a processor unit in FIG. 2, and FIG. FIG. 5 is a diagram showing an outline of processing of the present invention, and FIG. 5 is a diagram showing a conventional example of processing allocation to each processor. In the figure, 1, 2 ... Processor module, 3 ... Bus arbiter, 4 ... Shared image memory module, 11, 12 ...
Local memory, 13 ... Shared image memory, 21, 22 ...
… Processor, 23 …… Bus arbitration circuit, 24
...... Control circuit, 51 ...... Memory interface circuit, 52-
57 …… Data flow processor, 61−57 …… Pipeline bus.

Claims

(57) [Claims]

1. Recognition by a three-layer neural network consisting of an input layer, an intermediate layer, and an output layer, each of which comprises a plurality of elements,
When the learning process is divided into processor modules consisting of multiple processors and local memory and processed in parallel, an intermediate layer consisting of multiple elements is evenly assigned to the multiple processor modules, and the input layer is connected to the intermediate layer. , After sharing the network processing of the output layer and obtaining the partial sum of the output layer in each processor module,
After collecting partial sums in each processor module on one shared memory and obtaining the sum, the sum is sent back to each processor module, and the value of the weight of the network connected to the n middle layers in each processor module. And a neural network recognition / learning process without transposing the network weight value on the shared memory.

2. An n number of processors and a local memory that perform arithmetic operations in parallel to obtain an intermediate layer output value, a final output value, an error from a teacher signal, and a network weight update value of a three-layer neural network. Each of the processor modules, a shared memory, and a block transfer bus that collectively transfers data between the respective processor modules and the shared memory. When the network processing of the input layer and the output layer coupled to the respective elements of the intermediate layer is performed in parallel, after obtaining the partial sum of the output layer, the partial sum obtained by each processor module is blocked in the shared memory. After transferring and calculating the sum, the sum was transferred to each processor module in blocks and calculated from the sum and the teacher signal A fully connected network parallel processing device characterized by performing back propagation of an error and performing recognition and learning processing in a neural network.