JPH03105581A

JPH03105581A - Parallel data processing system

Info

Publication number: JPH03105581A
Application number: JP1243969A
Authority: JP
Inventors: Hideki Kato; 英樹加藤; Hideki Yoshizawa; 英樹吉沢; Hiromoto Ichiki; 宏基市來; Kazuo Asakawa; 浅川　和雄
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-09-20
Filing date: 1989-09-20
Publication date: 1991-05-02

Abstract

PURPOSE:To perform a matrix arithmetic or a neuro-computer arithmetic in terms of an analog signal by carrying out the transfer of data of a shift means, the transfer of data between a tray and a data processing unit, and a data processing operation of the data processing unit synchronously with each other. CONSTITUTION:The data processing units 1 process data and the trays 2 transfer data and form a shift register 3. The register 3 performs a circulating shift of data. The units 1 are separated from the trays 2 having the data holding functions so that a product is obtained between and (nXm) matrix and the vector of the element number (n) with the simultaneous and parallel operations of the transfer of data and the process of data regardless of the coincidence or noncoincidence secured between the number (m) of units 1 and the number (n) of trays 2. Thus the original parallel degree is used to the full to obtain the satisfactory effects of the number of units even with such a process that obtains a product between a rectangular matrix and the vector. As a result, a matrix arithmetic or a neuro-computer arithmetic is performed in terms of an analog signal.

Description

【発明の詳細な説明】〔｝既　　　　　要〕複数個のデータ処理ユニットを同期的に用いてデータを
処理する並列データ処理方式に関し、リングシストリッ
クアレイ方式や共通ハス結合型Ｓ　Ｉ　ＭＤ　（Ｓｆｎ
ｇｌｅ　Ｉｎｓｔｒｕｃｔｉｏｎ　Ｍｕｌｔｉ　Ｄａｔ
．ａ　）合方式と同程度なハードウエア構成で、データ
転送によるオーバヘッドを減少せしめ、特に、長方形行
列とベクトルとの積を求めるような処理に対しても、本
来の並列度を最大限利用できるようにして良好な台数効
果を得ることにより、行列演算あるいはニューロコンピ
ュータ演算をアナログ信号について行うことをことを目
的とし、各々少なくとも一つの入力を持つ複数個のデー
夕処理ユニットと、各々第１の入力及び出力を持ちかつ
各々データ保持及びデータ転送を行う複数個のトレイで
あって、前記トレイの全部又はその一部が各々前記デー
タ処理ユニットの第１の入力に接続された第２の出力を
有するものと、前記接続するトレイの第１の入力及び出
力が接続されて成るシフト手段とを具備し、前記シフト
手段上のデータ転送と、前記トレイと前記データ処理ユ
ニット間のデータ転送と、前記データ処理ユニットによ
るデータ処理とを同期して行うことにより、行列演算あ
るいはニューロコンピュータ演算をアナログ信号につい
て行うように構成する。[Detailed Description of the Invention] [Already Required] Regarding parallel data processing methods that process data by using a plurality of data processing units synchronously, ring systolic array methods and common hash combination type SIMD (Sfn
gle Instruction Multi Dat
．． a) With a hardware configuration comparable to that of the combination method, the overhead due to data transfer is reduced, and the inherent degree of parallelism can be utilized to the maximum, especially for processing such as calculating the product of a rectangular matrix and a vector. The purpose is to perform matrix calculations or neurocomputer calculations on analog signals by obtaining a good number effect by using a plurality of data processing units each having at least one input, and each having a first input. and a plurality of trays each for data retention and data transfer, the trays having a second output connected to the first input of the data processing unit, all or a portion of the trays each having a second output connected to the first input of the data processing unit. and a shifting means connected to a first input and an output of the connecting tray, the data transfer means on the shifting means, the data transfer between the tray and the data processing unit, and the data By synchronously performing data processing by the processing unit, matrix calculations or neurocomputer calculations are configured to be performed on analog signals.

[Industrial application field]

本発明は並列データ処理方式に係り、更に詳しくは、複
数個のデータ処理ユニットを同期的に用いてデータを処
理する並列データ処理方式に関する。The present invention relates to a parallel data processing system, and more particularly to a parallel data processing system that processes data using a plurality of data processing units synchronously.

近年、電子計算機或いはデジタル信号処理装置等のシス
テムにおいて、データ処理の適用分野の拡大に伴い、処
理されるデータの量が膨大になり、特に画像処理或いは
音声処理等の分野では高速なデータ処理を行う必要があ
り、そのため、複数個のデータ処理ユニットを同期的に
用いてデータを処理するデータ処理の並列性の利用が重
要となる。In recent years, with the expansion of the fields of application of data processing in systems such as electronic computers and digital signal processing devices, the amount of data being processed has become enormous.Especially in fields such as image processing and audio processing, high-speed data processing is required. Therefore, it is important to utilize parallelism in data processing, in which multiple data processing units are used synchronously to process data.

一般に、複数の処理ユニットを用いた処理において重要
な概念に台数効果がある。これは用意されたデータ処理
ユニットの台数に比例したデータ処理速度の向上が得ら
れることを意味するが、並列処理方式においては良好な
台数効果を得ることが非常に重要となる。In general, an important concept in processing using multiple processing units is the number effect. This means that the data processing speed can be improved in proportion to the number of data processing units prepared, but in parallel processing systems, it is very important to obtain a favorable number effect.

台数効果が悪化する主要な原因は、問題そのものの並列
度による限界を別にすれば、データ処理に伴うデータ転
送に要する時間が本来のデータ処理に要する時間に加算
されてトータルとしての処理時間が引き延ばされること
にある。従って、台数効果の向上にはデータ伝送路の容
量をフルに活用することが有効であるが、これはなかな
か難しいしかし、処理が規則的な場合には、この規則性を利用し
て台数効果を上げることが可能となる。The main reason for the deterioration of the number-of-units effect is that, apart from the limitation due to the parallelism of the problem itself, the time required for data transfer associated with data processing is added to the time required for original data processing, which reduces the total processing time. It lies in being postponed. Therefore, it is effective to make full use of the capacity of the data transmission path to improve the number of devices, but this is difficult. However, if the processing is regular, this regularity can be used to improve the number of devices. It is possible to raise it.

データをシストリソクアレイ、すなわち、巡回的にデー
タを流し、２つのデータがその流れにおいてそろったと
ころで演算を行うようにする。処理が規則的なことを利
用する並列処理がシストリソクアレイ方式であり、この
中でリングシストリンクアレイ方弐と呼ばれる１次元の
シストリシクアレイ方式は、複数個のデータ処理ユニッ
トを同期的に用いてシストリックなデータを処理する並
列データ処理方弐であって実現が比較的容易である。Data is distributed in a system array, that is, data is passed cyclically, and an operation is performed when two pieces of data are aligned in the flow. Parallel processing that takes advantage of the regularity of processing is the systolic array method. Among these, the one-dimensional systolic array method called ring system linked array method uses multiple data processing units synchronously. This is a parallel data processing method for processing systolic data, and it is relatively easy to implement.

規則性のある処理として、ベクトルの内積演算を基本と
した行列演算や、ニューラルネットの積和演算に非線形
関数を介して出力する並列処理がある。Examples of regular processing include matrix operations based on inner product operations of vectors and parallel processing that outputs output via nonlinear functions in neural network product-sum operations.

[Conventional technology]

第１１図（Ａ）は従来の共通ハス結合型並列方式の原理
構成図である。同図において９１はプロセッサエレメン
ト、４はメモリ、９３は共通バス、９２は共通バスに接
続されるバス、９４は各ブロセンサエレメントと、それ
に対応して接続されるメモリ４を接続する内部バスであ
る。この共通バス結合型並列方式においては、プロセッ
サエレメント（以下ＰＥと称す）間の通信が共通ハス９
３を介して行われる。特定な時間区域には共通ハスに乗
せるデータは１つであるため、共通バスによる通信は共
通バス全体にわたって同期をとる必要がある。FIG. 11(A) is a diagram illustrating the principle of a conventional common helical coupling type parallel system. In the figure, 91 is a processor element, 4 is a memory, 93 is a common bus, 92 is a bus connected to the common bus, and 94 is an internal bus that connects each processor element and the memory 4 connected to it. be. In this common bus-coupled parallel system, communication between processor elements (hereinafter referred to as PEs) is carried out using a common bus 9.
3. Since only one piece of data can be carried on the common bus in a specific time period, communication using the common bus must be synchronized over the entire common bus.

第１１図（Ｂ）はこの共通バス結合型並列方式による行
列ベクトル積の動作フローチャートである。各ＰＥは他
のＰＥからのデータＸと内部レジスタのＹとをかけ、そ
の積をＹに足しこむ動作を行う。そのためフローチャー
トに示すように、ｉ番目のＰＥに関して、その内部にあ
るレジスタの内容、すなわち、Ｙｌの値をまず０にする
。そして以下をｎ回繰り返す。すなわち、共通バス９３
にＸ，を与えるとｉ番目のＰＥは共通バスに接続された
ハス９２からの入力とメモリ４から内部バス９４を介し
て与えられる入力を掛け合わせ、その積をＹｌに足し込
む。これを繰り返す。FIG. 11(B) is an operational flowchart of matrix-vector product using this common bus-coupled parallel method. Each PE multiplies data X from other PEs by Y in the internal register and adds the product to Y. Therefore, as shown in the flowchart, for the i-th PE, the contents of the internal register, that is, the value of Yl, are first set to 0. Then, repeat the following n times. That is, the common bus 93
When X is given to , the i-th PE multiplies the input from the lotus 92 connected to the common bus by the input given from the memory 4 via the internal bus 94, and adds the product to Yl. Repeat this.

？１２図（Ａ）は従来のリングシストリック方式の原理
説明図である。同図において２０はプロセッサエレメン
ト（ＰＥ）である。各ＰＥは巡回バス２２によって接続
されている。また、２１は係数ＷＩＪを格納するメモリ
である。Ｗｌ＋．　　’Ｗ＋■，・・・，Ｗ３３などは
係数行列の要素であり、一般にＷＩＪは行列のｉｊ成分
である。この係数行列Ｗと、ヘク｝Ｐｖｘ−（Ｘ＋　，
Ｘ２　，Ｘｚ　）を掛ケル動作をこのリングシストリッ
ク方式で行う場合、次のようにして行われる。? FIG. 12(A) is a diagram explaining the principle of the conventional ring systolic method. In the figure, 20 is a processor element (PE). Each PE is connected by a circular bus 22. Further, 21 is a memory that stores the coefficient WIJ. Wl+. 'W+■, . . . , W33, etc. are elements of a coefficient matrix, and generally WIJ is an ij component of the matrix. This coefficient matrix W and Hek}Pvx−(X+,
When performing the multiplication operation using this ring systolic method, it is performed as follows.

第１２図（Ｂ）はブロセソサエレメントの第ｉ番目の内
部構造である。同図において２３は乗算器、２４は加算
器、２５はアキュムレータ（ＡＣＣ）、２１は係数の要
素Ｗ１，を格納するレジスタ群である。このレジスタ群
はいわゆるＦＩＦＯであって、係数行列の第ｉ行目に関
する係数としてＷ口、すなわちｊ番目の列の要素が出力
されようとしている状態である。このＦＩＦＯは出力さ
れた次のクロックでは巡回し、バス２２を介して後ろ側
からまた入力される。従って図に示すように、Ｗｌ＋，
　　・・・，Ｗ，，−，はすでに巡回されて後側６こ格
納されている状態となっている。FIG. 12(B) shows the i-th internal structure of the Brocesosa element. In the figure, 23 is a multiplier, 24 is an adder, 25 is an accumulator (ACC), and 21 is a register group for storing the coefficient element W1. This register group is a so-called FIFO, and the W entry, that is, the element in the j-th column is about to be output as the coefficient for the i-th row of the coefficient matrix. This FIFO circulates at the next output clock and is input again from the rear side via the bus 22. Therefore, as shown in the figure, Wl+,
. . , W,, -, have already been cycled through and are stored in the rear six locations.

一方、ベクトルの各要素はバス２２を介しテ入力される
。現在、要素Ｘｊが入力されている状態である。すでに
アキュムレータ２５にはＷ，，ＸＸ，＋・・・＋Ｗｉ　
Ｊ−Ｉ　ＸＸｊ−１　の内積結果が格納されている。こ
れが今アキュムレータ２５がら出力され、加算器２４の
一方の入力に入力されている。On the other hand, each element of the vector is input via the bus 22. Currently, element Xj is being input. W,,XX,+...+Wi are already in the accumulator 25.
The inner product result of J-I XXj-1 is stored. This is now output from the accumulator 25 and input to one input of the adder 24.

外部からのＸｊとＦＩＦＯから出力されるｗ，Ｊの積が
乗算器２３によって乗算され、その結果が加算器２４の
他方の入力に入力され、現在のアキュムレータ２５の内
容とが加えられ、次のクロ・ンクで同じアキュムレータ
２５に加算される。この繰り返しによって、係数行列Ｗ
の第ｉ行目の行ベクトルと外部から与えらるＸベクトル
との内積演算がＷ行される。なお、スイッチ（Ｓｗｉｔ
ｃｈ）はデータＸＩをスルーに外部に出すか、あるいは
内部に取り込み、アキュムレータ２５にセットする場合
との選択を行うためのものである。このようなＰＥで、
行列×ベクトルの積を行う場合、第１２図？Ａ）に示す
ように、ＰＥ−１はまず、Ｗ．とｘ１を掛け、次のクロ
ック周期に、Ｘ２が右側のＰＥ一２から流れ込み、Ｗ，
２がメモリ２１から出力されるので、ＷＩ２×ｘ２が演
算される。同様に次のクロックではＷ１３とＸ３との積
が実行され、このことにより係数行列の第１列目とベク
トル入との積がＰＥ−１において可能となる。また、第
２列目とベクトルとの積はＰＥ−２において行われる。The product of Xj from the outside and w, J output from the FIFO is multiplied by the multiplier 23, the result is input to the other input of the adder 24, the current content of the accumulator 25 is added, and the next It is added to the same accumulator 25 at the clock. By this repetition, the coefficient matrix W
The inner product calculation between the i-th row vector and the externally given X vector is performed W times. Note that the switch
ch) is for selecting whether to output the data XI to the outside or take it inside and set it in the accumulator 25. With such PE,
When performing matrix x vector product, Fig. 12? As shown in A), PE-1 first undergoes W. is multiplied by x1, and in the next clock period, X2 flows from PE-2 on the right, and W,
Since 2 is output from the memory 21, WI2×x2 is calculated. Similarly, in the next clock, the product of W13 and X3 is executed, which allows the product of the first column of the coefficient matrix and the vector input in PE-1. Also, the product of the second column and the vector is performed in PE-2.

すなわち、Ｗ２■とＸ２を掛け、次のクロンーク周期に
、Ｗ２■とＸ３を掛け、次のクロンク周期においてＷ２
１と巡回的にもどってきたＸ１との積を行うことになる
。同様に、第３行目とベクトルとの積はＷ３３とＸ３を
掛け、Ｗ３１と巡回してくるＸＩとを掛け、Ｗ３■と巡
回して戻ってくるＸ２との積をとって内積演算を実行す
ることによって可能となる。従って、この動作において
、Ｗ１１と×１との積、及びＷ２■とＸ２、Ｗ３３とＸ
３との積は同時に行えることになる。しかし、図に示す
ように、この同時性を実行するためには係数行列の要素
の並べ方にねじれが生じている。このようなリングシス
トリソクアレイ方式においては、各ＰＥ間のデータ転送
と、各ＰＥでのデータ処理を同期して実行することで、
データ転送路を有効に利用でき、従って良好な台数効果
を得ることができる。That is, multiply W2■ by X2, multiply the next Cronk period by W2■ and X3, and multiply W2■ by X3 in the next Cronk period.
1 and X1 returned cyclically are multiplied. Similarly, for the product of the third row and the vector, multiply W33 by X3, multiply W31 by the circulating XI, and perform the inner product operation by multiplying W3■ by the circulating X2. This becomes possible by doing so. Therefore, in this operation, the product of W11 and ×1, W2■ and X2, W33 and X
The product with 3 can be done at the same time. However, as shown in the figure, in order to achieve this simultaneity, the arrangement of the elements of the coefficient matrix is distorted. In such a ring system litho array method, by synchronously executing data transfer between each PE and data processing at each PE,
The data transfer path can be used effectively, and therefore a good effect on the number of devices can be obtained.

第１２図（Ｃ）は、第１２図（Ａ）のリングシストリッ
ク方式の構戒を多段に組み合わせたのもであり、この構
成により、連続する行列とベクトルの積を行うことが可
能となる。このようなシストリックアレイ方式は処理が
規則的であるため、データ伝送路の容量をフルに活用す
ることが可能であり、従って台数効果の向上が計れる。FIG. 12(C) is a multi-stage combination of the ring systolic method configurations of FIG. 12(A), and this configuration makes it possible to perform the product of continuous matrices and vectors. Since such a systolic array method performs regular processing, it is possible to make full use of the capacity of the data transmission path, and therefore the number of devices can be improved.

[Problem to be solved by the invention]

第ＩＩ図（Ａ）のような従来の共通ハス結合の並列方式
においては、プロセンシングエレメント、すなわちＰＥ
間の結合が共通バスによっているため、一時には１つの
データしか転送できない。また、共通バスによる結合は
共通パス全体にわたる同期をとらなければならない。し
たがって、従来の共通バス結合型並列方式においては良
好な台数効果を得られる処理の種類が少ないという問題
が生じ、さらに共通バスによる結合は、結合されるＰＥ
の個数の増加とともに共通バスが長くなり、共通バス全
体にわたる同期をとるのが難しくなるという問題、そし
て、大規模並列には適さないという問題が生じていた。In the conventional common helical coupling parallel system as shown in FIG.
Since the connection between them is through a common bus, only one piece of data can be transferred at a time. Also, coupling by a common bus requires synchronization across the common path. Therefore, in the conventional common bus coupling type parallel system, there is a problem that there are only a few types of processing that can obtain a good number of units effect.
As the number of devices increases, the common bus becomes longer, causing problems such as difficulty in synchronizing the entire common bus and unsuitability for large-scale parallelism.

また、第１２図のような従来のリングシストリックアレ
イ方式においては、各ＰＥ間のデータ転送とＰＥでのデ
ータ処理を同期して実行することにより、台数効果を得
ることができるが、この方式では、各ＰＥ間でのデータ
転送と、各ＰＥ間でのデータ処理のタイミングを合わせ
ねばならない。また、この方式では、たとえば長方形の
行列とベクトルとの積を求める場合などのようにデータ
処理ユニントとデータ保持ユニットのそれぞれの最適な
個数が等しくない場合には、実際のデータ処理に係わら
ないＰＥが必要となり、すなわち、遊ぶＰＥが多くなり
、そのため台数効果が悪化するという問題がある。言い
換えれば、効率よくとける問題と回路構成とが固く対応
し、問題の大きさが最適な値と異なると台数効果が悪化
してしまう。逆にいうと、良好な台数効果が得られる問
題が特定されてしまい、広範な処理に適用できず、柔軟
性、あるいは汎用性に欠け、結果として、ある程度広い
範囲の処理に適用できる高速なデータ処理系を実現する
ことが困難となる。In addition, in the conventional ring systolic array method as shown in FIG. 12, the effect of the number of devices can be obtained by synchronizing data transfer between each PE and data processing in the PE, but this method Then, it is necessary to match the timing of data transfer between each PE and the timing of data processing between each PE. In addition, in this method, if the optimal numbers of data processing units and data holding units are not equal, such as when calculating the product of a rectangular matrix and a vector, PEs that are not involved in actual data processing In other words, there is a problem that the number of PEs to play increases, and the number effect worsens. In other words, there is a strong correspondence between the problem of efficient solving and the circuit configuration, and if the size of the problem differs from the optimal value, the number effect will worsen. Conversely, the problem of obtaining a good number of units effect has been identified, and it cannot be applied to a wide range of processing, lacks flexibility or versatility, and as a result, high-speed data that can be applied to a somewhat wide range of processing has been identified. This makes it difficult to implement a processing system.

本発明は、リングシストリックアレイ方式や共通バス結
合型Ｓ　Ｉ　Ｍ　Ｄ　（Ｓｉｎｇｌｅ　Ｉｎｓｔｒｕｃ
ｔｉｏｎＭｕｌｔｉ　Ｄａｔａ）結合方式と同程度なハ
ードウエア構成で、データ転送によるオーバヘッドを減
少せしめ、特に、長方形行列とヘクトルとの積を求める
ような処理に対しても、本来の並列度を最大限利用でき
るようにして良好な台数効果を得ることにより、行列演
算あるいはニューロコンピュータ演算をアナログ信号に
ついて行うことを目的とする。The present invention is applicable to a ring systolic array method or a common bus-coupled type SIMD (Single Instrument
With the same hardware configuration as the combination method (Multi Data), it reduces the overhead due to data transfer, and can maximize the original degree of parallelism, especially for processing such as calculating the product of a rectangular matrix and a hector. The purpose is to perform matrix calculations or neurocomputer calculations on analog signals by obtaining a good number of units effect in this manner.

〔課題を解決するための手段］第１図は本発明の原理説明図である。同図において１は
データ処理ユニット、２はデータの保持及び転送を行う
トレイ、３は各トレイの相互接続により構成されるシフ
トレジスタ、１１はデータ処理ユニノトの第１の入力、
１２はデータ処理ユニットの第２の入力、２１はトレイ
の第１の入力、２２はトレイの第１の出力、２３はトレ
イ２の第２の出力である。[Means for Solving the Problems] FIG. 1 is a diagram explaining the principle of the present invention. In the figure, 1 is a data processing unit, 2 is a tray that holds and transfers data, 3 is a shift register formed by interconnecting each tray, 11 is a first input of a data processing unit,
12 is the second input of the data processing unit, 21 is the first input of the tray, 22 is the first output of the tray, and 23 is the second output of the tray 2.

データ処理ユニット１はデータの処理を行い、トレイ２
は転送の動作を行うものでシフトレジスタ３を構威して
、データの巡回シフトを行う。本発明では、ｍｘｎ行列
仄と要素数ｎのベクｌ・ルＸとの積を求める場合、行列
Ａの行数ｍが列数ｎより小さい場合であっても、或いは
ｍがｎより大きい場合であっても、ｍ個のデータ処理ユ
ニントとｎ個のトレイを用いてｎに比例する処理時間で
その積が実行可能となり、従って、良好な台数効果を得
ることができる。すなわち、第１図（Ａ）に示すように
、それぞれ２つの入力を持ち、その入力間の乗算機能と
その乗算結果の累積機能、すなわち内積演算を実・行す
るｍ個のデータ処理ユニノト１と、ｎ個のトレイ２とか
らなる構戒において、ユニット内の累積レジスタをＹと
した場合に、デ一ク処理ユニノトは１１からの入力とｌ
２からの入力を掛け合わせ、積を累積Ｙに足し込み、そ
の後、シフトレジスタ３内の隣接するトレイ間でベクト
ルＸの要素をシフトする。この動作をｎ回繰り返すこと
により、ｍＸｎの行列Ａと、ｎ次元ベクトルとの乗算が
ｍ個のデータ処理ユニットを用いてｎに比例する処理時
間で実行可能となる。すなわち、本発明は、従来方式と
異なり、データ処理ユニット１とデータ保持機能を有す
るトレイ２とを分離することにより、それぞれｍとｎが
異なっている場合であっても、タイミングを合わせるた
めの処理を必要とせずに良好な台数効果を得ることが可
能となる。さらに、本発明では、トレイ２間のデータ転
送とデータ処理ユニント１によるデータ処理とを同時並
行的に行い、一ｉ的にはデータ処理ユニットがデータ処
理に有する時間よりもデータ転送時間を短くすることが
期待できるので、データ転送時間をデータ処理時間の影
に隠すことで実質的にＯにし、そのことにより、処理時
間を短縮することが可能となっている。このことにより
、行列演算あるいはニューロコンピュータ演算をアナロ
グ信号について行う。Data processing unit 1 processes data and tray 2
performs a transfer operation, and utilizes the shift register 3 to perform cyclic shifts of data. In the present invention, when calculating the product of an mxn matrix and a vector X having n elements, it is possible to calculate the product even if the number of rows m of the matrix A is smaller than the number of columns n, or if m is larger than n. Even if there are, the product can be executed using m data processing units and n trays with a processing time proportional to n, and therefore a good number effect can be obtained. That is, as shown in FIG. 1(A), there are m data processing units 1 each having two inputs and performing a multiplication function between the inputs and an accumulation function of the multiplication result, that is, an inner product operation. , n trays 2, and if the cumulative register in the unit is Y, then the deck processing unit has input from 11 and l
2, add the product to the accumulation Y, and then shift the elements of vector X between adjacent trays in shift register 3. By repeating this operation n times, the multiplication of the m×n matrix A by the n-dimensional vector can be performed using m data processing units in a processing time proportional to n. That is, unlike the conventional system, the present invention separates the data processing unit 1 and the tray 2 having a data holding function, so that even if m and n are different from each other, processing for synchronizing the timing can be performed. It is possible to obtain a good number of units without the need for Furthermore, in the present invention, data transfer between the trays 2 and data processing by the data processing unit 1 are performed simultaneously in parallel, and the data transfer time is generally shorter than the time taken by the data processing unit for data processing. Therefore, by hiding the data transfer time behind the data processing time, it is essentially possible to reduce the processing time to zero. This allows matrix operations or neurocomputer operations to be performed on analog signals.

［作　　　　用］データ処理ユニットと、データ保持機能を有するトレイ
とを分離することにより、データ処理ユニ，トの個数ｍ
とトレイの個数ｎとが同一の場合も違っている場合も、
ｎＸｍの行列Ａと要素数ｎのベクトル鬼との積を、デー
タ転送と、データ処理の同時並列処理により行うことが
できる。[Function] By separating the data processing unit and the tray having the data holding function, the number of data processing units m can be reduced.
Regardless of whether and the number of trays n are the same or different,
The product of the n×m matrix A and the vector demon having n elements can be performed by simultaneous data transfer and data processing in parallel.

〔実　　施　　例］以下、本発明の実施例を図面を参照して説明する。〔Example] Embodiments of the present invention will be described below with reference to the drawings.

第１図（Ｂ）は第１図（Ａ）の本発明の原理構成図のシ
ステムの動作フローチャートである。第１図（Ａ）に示
されるように本発明ではデータ処理ユニントｌとデータ
保持機能を有するトレイ２とを分離し、さらにトレイを
隣接間で接続し、巡回接続することによってシストリッ
クなシステム？構成している。データ処理ユニットの数
をｎ、トレイの数をｍとした場合に、ｍＸｎの行列八と
要素数ｎのへクトルＸとの積を求める場合、第１図（Ｂ
）のフローチャートに示される動作となる。FIG. 1(B) is an operation flowchart of the system shown in FIG. 1(A) which is a diagram of the principle configuration of the present invention. As shown in FIG. 1(A), in the present invention, a data processing unit 1 and a tray 2 having a data holding function are separated, and the trays are connected adjacently to each other for circular connection to create a systolic system. It consists of When the number of data processing units is n and the number of trays is m, when calculating the product of a matrix 8 of mXn and a hector X of n elements, as shown in Figure 1 (B
) The operation is shown in the flowchart.

Ｘｌをトレイ２のｉ番目にセットする。Ｙ，の値を０に
する。すなわちデータ処理ユニットのｉ番目のユニット
における累積レジスタの値を初期化する。ｉ番目の処理
ユニッ｝１＋　はｌｉｔからの入力と、１２ｉの入力を
掛け合わせて、積を累積器Ｙ．に足し込む。そしてシフ
トレジスタ３をシフトする。この内積とシフト動作をｎ
回繰り返す。Set Xl in the i-th position of tray 2. Set the value of Y to 0. That is, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit}1+ multiplies the input from lit by the input from 12i, and sends the product to the accumulator Y. Add to. Then, shift register 3 is shifted. This inner product and shift operation are n
Repeat times.

この処理において長方行列ＡとベクトルＸとの積が形成
される。この場合、トレイ間のデータ転送とデータ処理
ユニットにおけるデータ処理とは同時並行処理となる。In this process, a product of rectangular matrix A and vector X is formed. In this case, data transfer between trays and data processing in the data processing unit are simultaneous and parallel processes.

第１図（Ｃ）は本発明方式の動作概念図である。FIG. 1(C) is a conceptual diagram of the operation of the system of the present invention.

同図においてトレイ２内のデータＸＩからＸ■はベクト
ルＸの要素でその個数はｎであるとする。In the figure, it is assumed that data XI to X■ in tray 2 are elements of vector X, and the number thereof is n.

またデータ処理ユニットはｍ個あり、その各々に累積器
がＹ＋　，Ｙｚ　，　　・・・，Ｙ１がある。ｍ×？の
長方行列の要素はＡ　Ｉ　＋からＡ■までのｍＸｎ個存
在する。データ処理ユニノトの１１には係数行列の第１
行目であるＡｚ，Ａ＋■，・・・　Ａ，ｎが同期的に１
２．の入カハスから入力される。またデータ処理ユニッ
ト１２はＡ２２．　　Ａ２３．　　・・・Ａ２１がシス
トリック動作の各タイミングで順番に与えられる。また
、データ処理ユニットｌｌＩにはＡ　ｆｆｉｌｌ＋　Ａ
ｍ　＋ｍ＋１＋・・・，　　Ａｑ　ｍ−１が同期的に与
えられる。Furthermore, there are m data processing units, each of which has an accumulator Y+, Yz, . . . , Y1. mx? There are mXn elements of the rectangular matrix from A I + to A■. 11 of the data processing unit is the first coefficient matrix.
The row Az, A+■,... A, n are 1 synchronously
2. It is input from the input cassette. Further, the data processing unit 12 is A22. A23. ... A21 is given in order at each timing of the systolic operation. In addition, the data processing unit llI has A ffill+A
m+m+1+..., Aq m-1 are given synchronously.

第１図（Ｄ）は第１図（Ｃ）の動作のタイピングチャー
トである。時間Ｔ１からＴｎの動作は第１図（Ｃ）のそ
れぞれの図と第１図（Ｄ）の時間Ｔ＋　，Ｔｚ　，　　
・・・，Ｔｎとが対応している。時間タイミングＴ，に
おいては第１図（Ｃ）に示されるようにトレイの２１．
２２，　　・・・　２ｎにはｘ，，ｘ２，ｘ．，−−−
，ｘｎがあり、ユニント１１，１２，　　・・・，ｌｍ
にはそれぞれ係数行列の要素ＡＩＩ＋　Ａ２２＋　　・
・・八〇が入力されている。従って、このタイミングに
おいてデータ処理ユニットはＡ目とトレイ２１のデータ
Ｘ１との？を求め、データ処理ユニットに対応するトレ
イ２２にあるＸ２と、メモリから与えられるＡ２■との
積を求め、同様に、トレイ２ｍにおいてはＡ　ｍＩｌと
Ｘ，，ｌの積を求める。このタイ主ングは第１図（Ｄ）
のＴ１のタイミングで行われている。すなわち積和を求
める同期クロックにおいて、バス１１１にはｘ１があり
、バス１２，にはＡＩ＋があり、バス１１２にはＸ２、
１２２にはＡ２２、１１３にはＸ３、１２３にはＡ３３
があり、１１，，にはＸ１、】２．，にはＡ■がのって
いる。従って、第１図（Ｃ）のＴ１タイムにおける図に
示すように内積演算が行われる。累積器Ｙの値はこの時
は０であるから内積結果は０に掛けた値が加わることに
なる。積和演算が終わるとシフト動作に入る。すなわち
第１図（Ｄ）に示されるようにＴＩとＴ２との間がシフ
ト動作であり、隣接するトレイ間でデータのシフトが行
われる。すなわち左シフトがこの場合行われる。すると
第１図（Ｃ）のタイミングＴ２に移る。第１図（Ｄ）の
動作タイミングでも同様にＴ２の積和の時間区域となる
。するとシ？トされているからトレイ２，にはＸ２、ト
レイ２２にはＸ３、そしてトレイ２ｍにはＸ　ｍ＋１が
格納され、また、係数行列の要素もトレイ１，２，・−
・，ｍにはそれぞれＡＩ２，　Ａ２３　　Ａｍ　Ｉ＋ｌ
＋１が入力される。これは第１図（Ｄ）のＴ２のタイミ
ングにおいてもハス上のデータがそれぞれ示されている
。従って、Ｔ２のタイミングにおいて、Ａ１■とＸ２と
の積をとり、前の累積器Ｙとの和が求められる。従って
ユニット１１においてはＴ，において求まったＡ．とＸ
１との積とＴ２において求められるＡ＋■とＸ２との積
との和が求められその結果が累積器に格納される。同様
にユニット１２においては前の結果であるＡ　２２　Ｘ
　Ｘ　２　＋　Ａ　２３　Ｘ　Ｘ　３の結果が累積器に
格納される。ユニット１■に対しても同様である。そし
てまたシフトし、タイミングＴ３に移る。トレイｌには
Ｘ３、トレイ２にはＸ４、トレイｍにはＸｍｍ＋２、ト
レイｎにはＸ２が入り、第１図（Ｃ）のＴ３時間におけ
る図に示されるような内積演算が実行される。FIG. 1(D) is a typing chart of the operation of FIG. 1(C). The operations from time T1 to Tn are shown in the respective diagrams in FIG. 1(C) and times T+, Tz, and Tz in FIG. 1(D).
..., Tn correspond to each other. At time timing T, as shown in FIG. 1(C), the tray 21.
22, ... 2n has x,, x2, x. ，---
, xn, and units 11, 12, ..., lm
are respectively the coefficient matrix elements AII+ A22+ ・
...80 is entered. Therefore, at this timing, the data processing unit calculates the difference between the A-th data and the data X1 on the tray 21? is obtained, and the product of X2 in the tray 22 corresponding to the data processing unit and A2 given from the memory is obtained. Similarly, in the tray 2m, the product of A mIl and X,,l is obtained. This tie is shown in Figure 1 (D).
This is done at the timing of T1. In other words, in the synchronous clock for calculating the sum of products, bus 111 has x1, bus 12 has AI+, bus 112 has X2,
A22 for 122, X3 for 113, A33 for 123
, and 11,, has X1, ]2. , has A■ on it. Therefore, the inner product calculation is performed as shown in the diagram at time T1 in FIG. 1(C). Since the value of the accumulator Y is 0 at this time, the value multiplied by 0 is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, as shown in FIG. 1(D), a shift operation is performed between TI and T2, and data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T2 in FIG. 1(C). Similarly, the operation timing shown in FIG. 1(D) corresponds to the time area of the sum of products of T2. Then shi? Therefore, X2 is stored in tray 2, X3 is stored in tray 22, and X m+1 is stored in tray 2m, and the elements of the coefficient matrix are also stored in trays 1, 2, .
・, m respectively have AI2, A23 Am I+l
+1 is input. This data is also shown on the lotus at the timing T2 in FIG. 1(D). Therefore, at timing T2, the product of A1 and X2 is calculated, and the sum with the previous accumulator Y is obtained. Therefore, in unit 11, A. and X
The sum of the product with 1 and the product of A+■ and X2 obtained at T2 is calculated and the result is stored in the accumulator. Similarly, in unit 12, the previous result A 22
The result of X 2 + A 23 X X 3 is stored in the accumulator. The same applies to unit 1■. Then, it shifts again and moves to timing T3. X3 is placed in tray l, X4 is placed in tray 2, Xmm+2 is placed in tray m, and X2 is placed in tray n, and an inner product calculation as shown in the diagram at time T3 in FIG. 1(C) is executed.

第１図（Ｄ）の動作タイξングの時間区域Ｔ３において
は、データ処理ユニ・ノトに入るべき入力の記号が示さ
れている。このような演算が進み、時間区域Ｔｎまで行
うと、第１図（Ｃ）の時間区域Ｔ，，に示されるように
ＡＩｒ＋ＸＸｎは前の累積器との値に加えられると、ト
レイ２１においては、Ｔ１で求めたＡｚＸＸ＋　．．Ｔ
ｚにおけるＡＩ２Ｘ　Ｘ２　、Ｔｅｌで求めたＡ，，Ｘ
Ｘ３等の積の和が求まり、Ｔｎ−１までの内積結果が累
算器Ｙに格納されているので、その結果にＡ　１　６　
Ｘ　Ｘ　（．が加わって行列Ａの１行目とベクトルＸと
の内積が実行される。トレイ２２においては同様に、行
列八の２行目の行ベクトルとベクトルＸとの内積演算が
ｎクロック周期で行われ、同様にｍ行目の行ベクトルと
、ヘクトルＸの内積がデータ処理ユニソトｌ．で実行さ
れる。In the time period T3 of the operation timing ξ in FIG. 1(D), the symbols of the inputs to be entered into the data processing unit are shown. When this kind of calculation progresses and is performed up to time area Tn, as shown in time area T, , in FIG. 1(C), when AIr+XXn is added to the value of the previous accumulator, in tray 21, AzXX+ determined at T1. ．． T
AI2X X2 at z, A,,X determined by Tel
The sum of products such as
X X (. is added and the inner product of the first row of matrix A and vector Similarly, the inner product of the m-th row vector and the hector X is performed in the data processing unit 1.

従って、このような時系列で処理を行うことによって、
ｍＸｎの長方行列とｎ次元ベクトルとの乗算がｍ個のデ
ータ処理ユニットを用いてｎに比例する処理時間で実行
可能となる。従って、良好な台数効果を得ることが可能
となる。ここで重要なことは、データを処理するデータ
処理ユニノトと、データ保持機能を有するトレイとを分
離し、それぞれの個数を長方行列の行と列に対応させ、
それらの次元が異なっていても、時系列動作が同期的に
可能となっている点である。なおｎがｍよりも小さい場
合でもｍ個のトレイ２を用いることで処理時間は延びる
が、すなわちｍに比例するが、台数効果的な処理が可能
となる。Therefore, by performing processing in such a chronological order,
Multiplication of an m×n rectangular matrix by an n-dimensional vector can be performed using m data processing units in a processing time proportional to n. Therefore, it is possible to obtain a good effect on the number of units. What is important here is to separate the data processing unit that processes data from the tray that has the data holding function, and to make the number of each correspond to the rows and columns of the rectangular matrix.
The point is that time-series operations can be performed synchronously even if their dimensions are different. Note that even if n is smaller than m, the processing time will be extended by using m trays 2, that is, it will be proportional to m, but it will be possible to process the trays effectively in terms of the number of trays 2.

第２図（Ａ）は第１図の構成の詳細ブロック図であり、
ｍＸｎ　（ｎ≧ｍ≧１）の行列Ａと要素数ｎのベクトル
Ｘの積Ｙ（要素数ｍ）を求めるものである。同図におい
て、第１図で示したものと同一のものは同一の記号で示
してあり、１ａはデータ処理ユニットｌの処理装置であ
り、例えばデジタルシグナルプロセッサで構成され、２
ａはトレイ２のデータ保持回路であり、例えばラッチ回
路で構成され、２ｂはトレイ２のデータ転送回路であり
、例えばハスドライバで構威され、２Ｃはトレイ２の制
御手段であり、例えば論理回路で構威され、４はデータ
処理ユニット１にデータを供給する手段の一部であると
同時にデータ処理ユニット１を制御する手段の一部であ
る記憶装置であり、例えばＲＡＭ　（ラングムアクセス
メモリ）で構成され、５はデータ処理ユニット１とトレ
イ２の同期動作を行う手段であり、５ａはクロック発生
回路であり、例えば水晶発振回路で構威され、５ｂはク
ロック分配回路であり、例えばバッファ回路から構威さ
れる。FIG. 2(A) is a detailed block diagram of the configuration of FIG. 1,
This is to find the product Y (number of elements m) of matrix A of mXn (n≧m≧1) and vector X of n elements. In the same figure, the same parts as shown in FIG.
2b is a data transfer circuit for the tray 2, such as a lotus driver, and 2C is a control means for the tray 2, such as a logic circuit. 4 is a storage device which is part of the means for supplying data to the data processing unit 1 and at the same time is part of the means for controlling the data processing unit 1, such as RAM (Rangular Access Memory). 5 is a means for synchronizing the data processing unit 1 and the tray 2, 5a is a clock generation circuit, for example, a crystal oscillation circuit, and 5b is a clock distribution circuit, for example, a buffer circuit. is threatened by

本実施例の動作は本発明の原理図で説明した動作とほぼ
同しである。The operation of this embodiment is almost the same as the operation explained in the principle diagram of the present invention.

第２図（Ｂ）は第２図（Ａ）の本発明のシステムの動作
フローチャートである。第２図（Ａ）に示されるように
本発明ではデータ処理ユニット１とデータ保持機能を有
するトレイ２とを分離し、さらにトレイを隣接間で接続
し、巡回接続することによってシストリックなシステム
を構戒している。データ処理ユニソトの数をｍ、トレイ
の数をｎとした場合に、ｍＸｎの行列Ａと要素数ｍのベ
クトルＸとの積を求める場合、第４図（Ｂ）のフローチ
ャートに示される動作となる。Ｘ１をトレイ２ｉにセッ
トする。Ｙ１の値を０にする。すな？ちデータ処理ユニ
ットのｉ番目のユニットにおける累積レジスタの値を初
期化する。ｉ番目の処理ユニットをｈは１１１からの入
力と、１２．の入力を掛け合わせて、積を累算器Ｙｉに
足し込む。そしてシフトレジスタ３をシフトする。この
内積とシフト動作をｎ回繰り返す。この処理において長
方行列ＡとベクトルＸとの積が形成される。FIG. 2(B) is an operation flowchart of the system of the present invention shown in FIG. 2(A). As shown in FIG. 2(A), in the present invention, a data processing unit 1 and a tray 2 having a data holding function are separated, and the trays are connected between adjacent trays for circular connection, thereby creating a systolic system. I am cautious. When the number of data processing units is m and the number of trays is n, when calculating the product of a matrix A of mXn and a vector X of m elements, the operation is shown in the flowchart of Figure 4 (B). . Set X1 on tray 2i. Set the value of Y1 to 0. sand? First, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit h receives input from 111 and 12. , and add the product to the accumulator Yi. Then, shift register 3 is shifted. This inner product and shift operation is repeated n times. In this process, a product of rectangular matrix A and vector X is formed.

この場合、トレイ間のデータ転送とデータ処理ユニノト
におけるデータ処理とは同時並行処理となる。In this case, data transfer between trays and data processing in the data processing unit are simultaneous and parallel processes.

第２図（Ｃ）は本発明方式の動作概念図である。FIG. 2(C) is a conceptual diagram of the operation of the system of the present invention.

同図においてトレイ２内のデータＸ１からＸｎはベクト
ルＸの要素でその個数はｎであるとする。In the figure, it is assumed that data X1 to Xn in tray 2 are elements of vector X, and the number thereof is n.

またデータ処理ユニットはｍ個あり、その各々に累積器
がＹ＋　，Ｙｚ　，　　・・・，Ｙ．．がある。ｍ×ｎ
の長方行列の要素はＡ目からＡ■までのｍＸｎ個存在す
る。データ処理ユニットのｈには係数行列の第１行目で
あるＡｚ．．　Ａｒｚ＋　　・・・．　Ａｘｎが同期的
に１２１の入力パスから入力される。またデータ処理ユ
ニット１２はＡ２２、Ａ２３，・・・？Ｈがシストリッ
ク動作の各タイミングで順番に与えられる。また、デー
タ処理ユニット１ｆｆｉにはＡ　ｍｍ＋　Ａｍ　ｍ＋＋
　＋　　”　’　＋　Ａｍ　．ｗ−１が同期的に与えら
れる。Furthermore, there are m data processing units, each of which has an accumulator Y+, Yz, . . . , Y. ．． There is. m×n
There are mXn elements of the rectangular matrix from A to A. h of the data processing unit has Az. which is the first row of the coefficient matrix. ．． Arz+... Axn is synchronously input from 121 input paths. Also, the data processing units 12 are A22, A23,...? H is given in turn at each timing of the systolic operation. In addition, the data processing unit 1ffi has A mm+ Am m++
+ ” ' + Am .w-1 is given synchronously.

第２図（Ｄ）は第２図（Ｃ）の動作のタイミングチャー
トである。時間Ｔ，からＴｎの動作は第１図（Ｃ）のそ
れぞれの図と第１図（Ｄ）の時間Ｔｒ　，　Ｔｚ　．　
　・・・，Ｔ７とが対応している。時間タイミングＴ１
においては，第２図（Ｃ）に示されるように、トレイ２
１，２２，　　・・・，２ｎにはＸ，，Ｘ２＋　Ｘ．，
　　・−・，Ｘｎがあり、ユニソ｝１１，１２，　　・
・・，ｌｍにはそれぞれ係数行列の要素ＡＩｔ，　Ａｚ
ｚ，　Ａ一が入力されている。FIG. 2(D) is a timing chart of the operation of FIG. 2(C). The operations from time T, to Tn are shown in the respective diagrams of FIG. 1(C) and times Tr, Tz .
..., T7 correspond. Time timing T1
In this case, as shown in Fig. 2(C), the tray 2
1, 22, ..., 2n has X, , X2+ X. ，
・-・, Xn, Uniso}11, 12, ・
..., lm are coefficient matrix elements AIt, Az, respectively
z and A1 are input.

従って、このタイミングにおいてデータ処理ユニット１
１のＡ１１とトレイ２１のデータＸ１との積を求め、デ
ータ処理ユニット１２においてはトレイ２２にあるＸ２
と、メモリから与えられるＡ２■との積を求め、同様に
、トレイｍにおいてはＡーとＸ１の積を求める。このタ
イミングは第２図（Ｄ）のＴ１のタイミングで行われて
いる。すな？ち積和を求める同期クロックにおいて、バ
ス１ｈにはＸ１があり、バス１２１にはＡ　＋　＋があ
り、バス１１■にはＸ２、１２２にはＡ　２　２、１１
３にはＸ３、１２３にはＡ３３があり、１１１にはＸ１
、１２１には７！ｔ，　ｍｍがのっている。従って、第
２図（Ｃ）のＴ１タイムにおける図に示すように内積演
算が行われる。累積器Ｙの値はこの時は０であるから内
積結果はＯに掛けた値が加わることになる。積和演算が
終わるとシフト動作に入る。すなわち第２図（Ｄ）の図
に示されるようにＴ＋　とＴ２との間がシフト動作であ
り、トレイの隣接するトレイ間でデータのシフトが行わ
れる。すなわち左シフトがこの場合行われる。すると第
２図（Ｃ）のタイミングＴ２に移る。第２図（Ｄ）の動
作タイミングでも同様にＴ２の積和の時間区域となる。Therefore, at this timing, the data processing unit 1
1 and the data X1 on the tray 21, and the data processing unit 12 calculates the product of the data X2 on the tray 22.
and A2■ given from memory, and similarly, for tray m, the product of A- and X1 is found. This timing is performed at the timing T1 in FIG. 2(D). sand? In other words, in the synchronous clock that calculates the sum of products, bus 1h has X1, bus 121 has A + +, bus 11■ has X2, and 122 has A 2 2, 11.
3 has X3, 123 has A33, 111 has X1
, 7 for 121! It has t and mm on it. Therefore, the inner product calculation is performed as shown in the diagram at time T1 in FIG. 2(C). Since the value of the accumulator Y is 0 at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, as shown in FIG. 2(D), a shift operation is performed between T+ and T2, and data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T2 in FIG. 2(C). Similarly, the operation timing shown in FIG. 2(D) corresponds to the time area of the sum of products of T2.

するとシフトされているからトレイ２１にはＸ２、トレ
イ２２にはＸ３、そしてトレイし２ｍにはＸ　ｆｆｌ＋
　１が格納され、また、係数行列の要素もトレイ２１，
２２，　　・・−，２ｍにはそれぞれＡＩ２．　Ａ２３
．　Ａ−■１が入力される。これは第２図（Ｄ）のＴ２
の？イミングにおいてもバス上のデータがそれぞれ示さ
れている。従って、Ｔ２のタイミングにおいて、Ａ１■
とＸ２との積をとり、前の累積器Ｙとの和が求められる
。従って、ユニット１１においてはＴ１において求まっ
たＡＩ．とＸＩ　との積とＴ２において求められるＡ１
２とＸ２との積との和が求められ、その結果が累積器に
格納される。同様にユニット１２においては前の結果で
あるＡ　ｚｚ　Ｘ　Ｘ　２＋　Ａ　２　３　Ｘ　Ｘ　３
の結果が累積器に格納される。ユニット１ｍに対しても
同様である。そしてまたシフトし、タイミングＴ３に移
る。トレイ２１にはＸ３、トレイ２２にはＸ４、トレイ
２ｍにはＸｌｌｌ。２、トレイ２ｎにはＸ２が入り、第
２図（Ｃ）のＴ，時間における図に示されるような内積
演算が実行される。Then, since it has been shifted, there is X2 on tray 21, X3 on tray 22, and X ffl+ on tray 2m.
1 is stored, and the elements of the coefficient matrix are also stored in the tray 21,
22,...-, 2m each have AI2. A23
．． A-■1 is input. This is T2 in Figure 2 (D)
of? The timing also shows the data on the bus. Therefore, at the timing of T2, A1■
The product of and X2 is taken, and the sum with the previous accumulator Y is determined. Therefore, in unit 11, the AI. A1 obtained at the product of and XI and T2
The product of 2 and X2 is summed and the result is stored in an accumulator. Similarly, in unit 12, the previous result A zz X X 2+ A 2 3 X X 3
The result is stored in the accumulator. The same applies to the unit 1m. Then, it shifts again and moves to timing T3. X3 for tray 21, X4 for tray 22, and Xlll for tray 2m. 2. X2 enters the tray 2n, and the inner product calculation as shown in the diagram at T and time in FIG. 2(C) is executed.

第２図（Ｄ）の動作タイミングにおいての時間区域Ｔ３
においては、データ処理ユニットに入るべき入力の記号
が示されている。このような演算が進み、時間区域Ｔｎ
まで行うと第２図（Ｃ）の時間区域Ｔ．に示されるよう
にＡ＋．ｘＸ，＋は前の累積器との値に加えられると、
トレイＩにおいてはＴ＋で求めたＡ＋＋ＸＸ　Ｉ，Ｔ２
におけるＡ，２×Ｘ２、Ｔ３で求めたＡ　１　３Ｘ　Ｘ
　３等の積の和が求まり、’ｒｎ−＋までの内積結果が
累積器Ｙに格納されているので、その結果にＡ　１　１
１　Ｘ　Ｘ，，が加わって行列Ａの１行目とへクトル見
との内積が実行される。Time zone T3 in the operation timing of FIG. 2(D)
In , the symbols of the inputs to be entered into the data processing unit are shown. As such calculations proceed, the time area Tn
If the process is performed up to the time zone T in FIG. 2(C). As shown in A+. When xX,+ is added to the value with the previous accumulator,
For tray I, A++ determined by T+ I, T2
A, 2×X2, A 1 3X X determined by T3
The sum of the 3-magnitude products is found, and the inner product result up to 'rn-+ is stored in the accumulator Y, so the result is A 1 1
1 X

トレイ２においては同様に、行列八の２行目の行ベクト
ルとベクトルＸとの内積演算がｎクロック周期で行われ
、同様にｍ行目の行ベクトルと、ベクトルＸの内積がデ
ータ処理ユニソト１１Ｉで実行される。従って、このよ
うな時系列で処理を行うことによってｍｘｎの長方行列
とｎ次元ベクトルとの乗算がｍ個のデータ処理ユニット
を用いてｎに比例する処理時間で実行可能となる。従っ
て、良好な台数効果を得ることが可能となる。Similarly, in tray 2, the inner product calculation of the row vector of the second row of matrix 8 and the vector is executed. Therefore, by performing processing in such a time series, multiplication of an m×n rectangular matrix by an n-dimensional vector can be performed using m data processing units in a processing time proportional to n. Therefore, it is possible to obtain a good effect on the number of units.

第３図は、本発明の第２の実施例説明図である。FIG. 3 is an explanatory diagram of a second embodiment of the present invention.

ｍＸｎの行列Ａと要素数ｎのヘクトルＸとの積に対し、
引き続きｋＸｍの行列旧を左から掛ける場合の動作に対
するシストリック方式の構戒図である。第３図（Ａ）に
おいて第１図で示したものと同一のものは同一の記号で
示してある。すなわち１ａはデータ処理ユニットｌの処
理装置であり、例えばデジタルシグナルプロセッサであ
る。２ａはトレイ２のデータ保持回路であり、例えばラ
ッチ回路で構成され、２ｂはトレイ２のデータ転送回路
であり、例えばバスドライバで構戒され、２Ｃはトレイ
２の制御手段であり、例えば論理回路で構成されている
。４はデータ処理ユニット１にデータを供給する手段の
一部であると同時にデータ処理ユニット１を制御する手
段の一部でもある記憶装置であって、例えばＲＡＭ　（
ランダムアクセスメモリ）で構成されている。５はデー
タ処理ユニット１とトレイ２の同期動作を行う手段であ
り、内部の５ａは、クロツク発生回路で、例えば、水晶
発振回路で構成され、５ｂはクロック分配回路であり、
例えば、バッファ回路から構威される。For the product of mXn matrix A and hector X with n elements,
It is a composition diagram of the systolic method for the operation when the matrix old of kXm is subsequently multiplied from the left. Components in FIG. 3(A) that are the same as those shown in FIG. 1 are indicated by the same symbols. That is, 1a is a processing device of the data processing unit 1, for example a digital signal processor. 2a is a data holding circuit for the tray 2, which is composed of, for example, a latch circuit; 2b is a data transfer circuit for the tray 2, which is controlled by, for example, a bus driver; and 2C is a control means for the tray 2, which is composed of, for example, a logic circuit. It consists of Reference numeral 4 denotes a storage device which is part of the means for supplying data to the data processing unit 1 and at the same time is part of the means for controlling the data processing unit 1, such as a RAM (
consists of random access memory). 5 is a means for synchronizing the data processing unit 1 and the tray 2; 5a therein is a clock generation circuit, for example, a crystal oscillation circuit; 5b is a clock distribution circuit;
For example, it can be constructed from a buffer circuit.

６はシストリック的に戻るデータとトレイに入力する場
合のデータと外部データとの選択を行う選択回路で、７
はシストリックされるデータを途中からバイパスする選
択回路である。6 is a selection circuit that selects between systically returned data, data to be input into the tray, and external data;
is a selection circuit that bypasses systolic data from the middle.

本実施例は、中間結果Ａｘを求めるところまでは第１の
実施例と全く同一であり、各データ処理ユニット中にそ
の中間結果Ａｘの各要素が求まっている状態から（ａ）中間結果をトレイ２に書き込み、（ｂ）バイパス
の選択回路７をオンさせて、シフトレジスタの長さをｍ
に変更し、（Ｃ）以後は本発明の第１の実施例において、行列Ａを
行列旧に、そして、ｎをｍに、ｍをｋにそれぞれ変更す
ればまったく同じ動作となる。This embodiment is exactly the same as the first embodiment up to the point where the intermediate result Ax is obtained, and from a state in which each element of the intermediate result Ax has been obtained in each data processing unit, (a) the intermediate result is transferred to the tray. (b) Turn on the bypass selection circuit 7 and set the length of the shift register to m.
(C) From now on, in the first embodiment of the present invention, if matrix A is changed to matrix old, n is changed to m, and m is changed to k, the operation will be exactly the same.

第３図（Ｂ）は第２の実施例の動作フローチャート、第
３図（Ｃ）は第２の実施例の動作概要図、第３図（Ｄ）
は第２の実施例の動作タイムチャートである。FIG. 3(B) is an operation flowchart of the second embodiment, FIG. 3(C) is an operation overview diagram of the second embodiment, and FIG. 3(D)
is an operation time chart of the second embodiment.

まず、ｍＸｎの行列Ａと要素数ｎのへクトルＸとの積、
そして、ｋＸｍの行列旧を左から掛ける場合、第３図（
Ｂ）のフローチャートに示される動作となる。Ｘｉをト
レイ２１にセットする。ｙｔＯ値を０にする。すなわち
データ処理ユニットのｉ番目のユニットにおける累積レ
ジスタの値を初期化する。ｉ番目の処理ユニットｈはｌ
ｌｔからの入力と、１２＋の入力を掛け合わせて、積を
累積器Ｙｌに足し込む。そしてシフトレジスタ３をシフ
トする。この内積とシフト動作をｎ回繰り返す。この処
理において長方行列ＡとベクトルＸとの積が形威される
。First, the product of a matrix A of mXn and a hector X of n elements,
Then, when multiplying the kXm matrix old from the left, Figure 3 (
The operation is shown in the flowchart B). Set the Xi on the tray 21. Set the ytO value to 0. That is, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit h is l
Multiply the input from lt by the input from 12+ and add the product to accumulator Yl. Then, shift register 3 is shifted. This inner product and shift operation is repeated n times. In this process, the product of rectangular matrix A and vector X is expressed.

次に、シフトレジスタの長さをｍに変更し、Ｙ１をトレ
イ２ｉに転送する。そして、Ｚ＋　　（ｉ＝１，・・・
，ｋ）をＯにする。次にＢ行列を掛けるために、まず、
ｉ番目の処理ユニットｈと１ｈからの入力と１２１の入
力を掛け合わせて、積を累積器Ｚｉに足し込む。そして
、シフトレジスタ３をシフトするこの内積とシフト動作
をｋ回繰り返す。Next, the length of the shift register is changed to m, and Y1 is transferred to tray 2i. And Z+ (i=1,...
, k) to O. Next, in order to multiply the B matrix, first,
The inputs from the i-th processing units h and 1h are multiplied by the input of 121, and the product is added to the accumulator Zi. Then, this inner product and shift operation for shifting the shift register 3 is repeated k times.

第３図（Ｃ）は以上の動作概念図である。同図において
トレイ２内のデータＸ１からＸ。はベクトルＸの要素で
その個数はまず、ｎであるとする。FIG. 3(C) is a conceptual diagram of the above operation. In the figure, data X1 to X in tray 2. is an element of vector X and its number is first assumed to be n.

またデータ処理ユニットは最初は、ｍ個が有効で、その
各々に累積器がＹ＋　，Ｙｚ　，　　・・・，Ｙ．があ
るとする。まず、ｍＸｎの長方行列Ａの要素は？．から
Ａ■までのｍＸｎ個存在する。データ処理ユニットのｈ
には係数行列の第１行目であるＡ＋＋，　Ａ．＋２＋　
　＋　Ｈ　＋，　ＡＨ，１が同期的に１２１の入力ハス
から入力される。またデータ処理ユニット１■はＡ　２
　２　，　Ａ　２　３　，　　・・・，Ａ２．がシスト
リック動作の各タイミングで順番に与えられる。また、
データ処理ユニット１ゆにはＡ　ａ　ｓ　＋　Ａ　ａ　
ｍ　＋　＋　＋・・＋　　Ａｌｌ　Ｉｔ−ｌ　が同期的
に与えられる。In addition, m data processing units are initially effective, and each of them has an accumulator Y+, Yz, . . . , Y. Suppose there is. First, what are the elements of mXn rectangular matrix A? ．． There are mXn numbers from A to A. data processing unit h
The first row of the coefficient matrix is A++, A. +2+
+H+, AH,1 are synchronously input from 121 input lotuses. Also, data processing unit 1■ is A2
2, A23, ..., A2. are given in turn at each timing of the systolic operation. Also,
Data processing unit 1: A a s + A a
m + + +...+ All It-l are given synchronously.

第３図（Ｄ）は第３図（Ｃ）の動作のタイミングチャー
トである。時間Ｔ１からＴ０の動作は第３図（Ｃ）のそ
れぞれの図と第３図（Ｄ）の時間ＴＩ，Ｔ２　，　　・
・・，Ｔｎとが対応している。時間タイミングＴ，にお
いては、第３図（Ｃ）に示されるように、トレイの１，
２，・・・，ｎにはＸＩ　　　Ｘ２，　　・・・，Ｘ，
，・・・，Ｘ，があり、ユニッｌ−１．２，　　・・・
，ｋ，・・・，ｍにはそれぞれ係数行列の要素ＡＩＩ．
Ａ２■，・・・，Ａｋｋ，・・・，Ａ．．が入力されて
いる。従って、このタイミングにおいてデータ処理ユニ
ットは、トレイ１において、Ａ１１とトレイｌのデータ
Ｘ１との積？求め、データ処理ユニット２においてはト
レイ２にあるＸ２と、メモリから与えられるＡ２２との
積を求め、同様に、トレイｋにおいてはＡｋｋとＸｋの
積を求め、トレイｍにおいて、Ａ　ｍｌｌｌとＸ８の積
を求める。このタイミングは第３図（Ｄ）のＴのタイミ
ングで行われている。すなわち積和を求める同期クロッ
クにおいて、バス１１＋　にはＸ，があり、バス１２１
にはＡＩ１があり、ハス１１２にはＸｚ、１２ｚにはＡ
２２、１１ｋにはＸ，、１２ｋにはＡｋｋがあり、１１
■にはＸ．、ｌ２１ＩにはＡ一がのっている。従って、
第３図（Ｃ）のＴ，タイムにおける図に示すように、内
積演算が行われる。累積器Ｙの値はこの時は０であるか
ら内積結果はＯに掛けた値が加わることになる。積和演
算が終わるとシフト動作に入る。すなわち第３図（Ｄ）
の図に示されるように、Ｔ，とＴ２との間がシフト動作
であり、トレイの隣接するトレイ間でデータのシフトが
行われる。すなわち左シフトがこの場合行われる。する
と第３図（Ｃ）のタイミングＴ２に移る。第３図（Ｄ）
の動作タイミン？でも同様にＴ２の積和の時間区域とな
る。するとシフトされているからトレイ１にはｘ２、ト
レイ２にはＸ３、トレイｋにはＸｋ．■、そしてトレイ
ｍにはＸ１．，が格納され、また、係数行列の要素もト
レイ１，２，・・・，ｋ，・・・，ｍにはそれぞれＡＩ
２１　Ａ２３，　　・・・Ａｙ　ｋ＋＋　，　　・・・
，Ａｓｌｌ１。１が入力される。これは第３図（Ｄ）の
Ｔ２のタイミングにおいてもバス上のデータがそれぞれ
示されている。従ってＴ２のタイミングにおいて、ＡＩ
■とＸ２との積をとり、前の累積器Ｙとの和が求められ
る。従ってトレイ１においてはＴ＋において求まったＡ
．とＸ１との積とＴ２において求められるＡ１■とＸ２
との積との和が求められその結果が累積器に格納される
。同様にトレイ２においては前の結果であるＡ２２Ｘ　
Ｘ２　＋Ａ２］Ｘ　Ｘ３の結果が累積器に格納される。FIG. 3(D) is a timing chart of the operation of FIG. 3(C). The operation from time T1 to T0 is shown in each diagram in FIG. 3(C) and at time TI, T2, ・ in FIG. 3(D).
..., Tn correspond. At time timing T, as shown in FIG. 3(C), tray 1,
2,...,n is XI X2,...,X,
,...,X, and the unit l-1.2,...
, k, . . . , m each have coefficient matrix elements AII.
A2■,...,Akk,...,A. ．． is entered. Therefore, at this timing, the data processing unit calculates the product of A11 and the data X1 of tray 1 in tray 1? In data processing unit 2, the product of X2 in tray 2 and A22 given from memory is found. Similarly, in tray k, the product of Akk and Xk is found, and in tray m, the product of A mlll and X8 is found. Find the product. This timing is performed at timing T in FIG. 3(D). In other words, in the synchronous clock that calculates the sum of products, bus 11+ has X, and bus 121
has AI1, Has112 has Xz, 12z has A
22, 11k has X, 12k has Akk, 11
■ is X. , l21I has A1 on it. Therefore,
As shown in the diagram at T and time in FIG. 3(C), an inner product calculation is performed. Since the value of the accumulator Y is 0 at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, Figure 3 (D)
As shown in the figure, the shift operation is between T and T2, and data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T2 in FIG. 3(C). Figure 3 (D)
Operation timing? Similarly, it becomes the time domain of the sum of products of T2. Then, since they have been shifted, x2 is placed on tray 1, X3 is placed on tray 2, and Xk is placed on tray k. ■, and tray m has X1. , and the elements of the coefficient matrix are also stored in trays 1, 2, ..., k, ..., m, respectively.
21 A23, ...Ay k++, ...
, Asll1.1 are input. This also shows the data on the bus at the timing T2 in FIG. 3(D). Therefore, at timing T2, AI
2 is multiplied by X2, and the sum with the previous accumulator Y is obtained. Therefore, in tray 1, A found at T+
．． A1■ and X2 found at the product of and X1 and T2
The sum of the product and the product is calculated and the result is stored in the accumulator. Similarly, in tray 2, the previous result is A22X
The result of X2 + A2]X X3 is stored in the accumulator.

トレイｋやｍに対しても同様である。そしてまたシフト
し、タイミングＴ３に移る。トレイ１にはＸ３、トレイ
２にはＸ４、トレイｋにはＸｋｋ．■、トレイｍにはＸ
Ｉ，ｌ．．２、トレイｎにはＸ２が入り、第３図（Ｃ）
の？３時間における図に示されるような内積演算が実行
される。The same applies to trays k and m. Then, it shifts again and moves to timing T3. Tray 1 has X3, tray 2 has X4, tray k has Xkk. ■、X for tray m
I, l. ．． 2. Tray n contains X2, Figure 3 (C)
of? A dot product operation as shown in the figure at 3 hours is performed.

このような演算が進み、時間区域Ｔｎまで行うと第３図
（Ｃ）の時間区域Ｔ，，ｌに示されるようにＡ，。×Ｘ
ＩＩが前の累積器との値に加えられるとトレイ１におい
てはＴ，で求めたＡＩＩＸＸＩ　、Ｔ２におけるＡＩ■
Ｘ　Ｘ２　、Ｔ＊で求めたＡＩｋＸχ、等の積の和が求
まり、Ｔｎ−Ｉまでの内積結果が累積器Ｙに格納されて
いるので、その結果にＡｌｎＸＸｎが加わって行列Ａの
Ｉ行目とベクトル入との内積が実行される。トレイ２に
おいては同様に行列八の２行目の行ヘクトルとベクトル
罵との内積演算がｎクロック周期で行われ、同様にｋ行
目の行ベクトルと、ベクトルＸの内積がデータ処理ユニ
ン｝ｌｍで実行される。When this kind of calculation progresses and is performed up to time area Tn, A, as shown in time area T,,l in FIG. 3(C). ×X
When II is added to the value of the previous accumulator, T in tray 1, AIIXXI determined by T, AI■ in T2
The sum of the products of X Dot product with vector input is performed. Similarly, in tray 2, the inner product calculation of the row hector of the second row of matrix 8 and the vector is performed in n clock cycles, and similarly, the inner product of the row vector of the k-th row and the vector is executed.

データ処理ユニットの有効数をｋ、トレイの有効数をｍ
とした場合に、ｋＸｍの行列Ｂと要素数ｍのヘクトルｙ
との積を求める動作となる。Ｙｌをトレイ２の１ｉにセ
ットする。Ｚｌの値をＯにする。すなわちデータ処理ユ
ニットのｉ番目のユニットにおける累積レジスタの値を
初期化する。The effective number of data processing units is k, and the effective number of trays is m.
In this case, matrix B of kXm and hector y of m elements
The operation is to find the product of . Set Yl on tray 2 1i. Set the value of Zl to O. That is, the value of the accumulation register in the i-th data processing unit is initialized.

ｉ番目の処理ユニットｈはＩＬからの入力と、１２１の
入力を掛け合わせて、積を累積器Ｚｉに足し込む。そし
てシフトレジスタ３をシフトする。The i-th processing unit h multiplies the input from IL by the input of 121, and adds the product to the accumulator Zi. Then, shift register 3 is shifted.

この内積とシフト動作をｍ回繰り返す。この処理におい
て長方行列Ｂとベクトル１との積が形威される。This inner product and shift operation is repeated m times. In this process, the product of rectangular matrix B and vector 1 is expressed.

第３図（Ｃ）においてトレイ２内のデータＹ１からＹ。In FIG. 3(C), data Y1 to Y in tray 2.

はベクトルｌの要素でその個数はｍであるとする。また
データ処理ユニットの有効数はｋ個あり、その各々に累
積器がＺｌ，Ｚ２，　　・・・Ｚｋがある。ｋＸｍの長
方行列Ｂの要素はＢｌ＋からＢｋ．．までのｋＸｍ個存
在する。データ処理ユニットの１１には係数行列Ｂの第
１行目であるＢｌｌ、Ｂ１２，・・・．Ｂ１．が同期的
に１２＋の人カバスから入力される。またデータ処理ユ
ニットｌ２はＢ　２２，　　Ｂ　２３，　　・・・，Ｂ
２１がシストリック動作の各タイミングで順番に与えら
れる。また、データ処理ユニット１ｋにはＢ＋＋ｈ，　
Ｂｈ　ｋ。１，・・・Ｂｈ　ｈ−＋が同期的に与えられ
る。is an element of vector l and its number is m. Further, the effective number of data processing units is k, and each of them has an accumulator Zl, Z2, . . . Zk. The elements of the kXm rectangular matrix B are from Bl+ to Bk. ．． There are up to kXm pieces. The data processing unit 11 stores the first row of the coefficient matrix B, Bll, B12, . B1. is input synchronously from 12+ people's cabs. Moreover, the data processing unit l2 has B 22, B 23, ..., B
21 are given in turn at each timing of the systolic operation. In addition, the data processing unit 1k includes B++h,
Bhk. 1,...Bh h-+ are given synchronously.

？３図（Ｄ）は第３図（Ｃ）の動作のタイピングチャー
トでも同様の記号が使われている。時間Ｔ，，。１から
Ｔｎ。．１の動作は第３図（Ｃ）のそれぞれの図と第３
図（Ｄ）の時間とが対応している。? Similar symbols are used in FIG. 3(D) in the typing chart for the operation in FIG. 3(C). Time T,. 1 to Tn. ．． The operation of step 1 is shown in each figure in Figure 3 (C) and in Figure 3.
This corresponds to the time shown in Figure (D).

時間タイミングＴ　ｎ　＋　＋　においては第３図（Ｃ
）に示されるように、トレイ１，２，・・・，ｍにはＹ
＋　，Ｙ２　，　　・・・，Ｙ．が移され、ユニットｌ
．２，・・・，ｋにはそれぞれ係数行列Ｂの要素ＢＢ２
■，・・・，Ｂｋｋが入力されている。次のタイごング
Ｔ　ｎ　＋　２においてデータ処理ユニット１において
Ｂｌ＋とトレイ１のデータＹ１との積を求め、データ処
理ユニット２においてはトレイ２にあるＹ２と、メモリ
から与えられるＥ３ｚｚとの積を求め、同様にユニット
ｋにおいてはＢｋｋとＹ，の積を求める。このタイξン
グは第５図（ｄ）のＴ０＋２のタイミングで行われてい
る。すなわち積和を求める同期クロックにおいて、バス
１１＋にはＹ，があり、バス１２＋にはＢｌｌがあり、
バス１１■にはＹ２、１２２にはＢ２２、１１３にはＹ
３、１２３にはＢ。At time timing T n + + , as shown in Fig. 3 (C
), trays 1, 2, ..., m have Y
+ , Y2 , ..., Y. is transferred, unit l
．． 2,...,k are elements BB2 of coefficient matrix B, respectively.
■,...,Bkk are input. In the next timing T n + 2, the data processing unit 1 calculates the product of Bl+ and the data Y1 of the tray 1, and the data processing unit 2 calculates the product of Y2 in the tray 2 and E3zz given from the memory. Similarly, in unit k, the product of Bkk and Y is found. This timing ξ is performed at timing T0+2 in FIG. 5(d). In other words, in the synchronous clock for calculating the sum of products, bus 11+ has Y, bus 12+ has Bll,
Y2 for bus 11■, B22 for 122, Y for 113
3, 123 is B.

があり、ｌｌｋにはＹｈ，１２ｈにはＢｋｋがのっ？い
る。従って、第３図（Ｃ）のＴ　ｎ　４　２における図
に示すように内積演算が行われる。累積器Ｚの値はこの
時は０であるから内積結果はＯに掛けた値が加わること
になる。積和演算が終わるとシフト動作に入る。すなわ
ち第３図（Ｄ）の図に示されるようにＴ　ｎ　＋　２　
とＴ。。３との間がシフト動作であり、トレイの隣接す
るトレイ間でデータのシフトが行われる。すなわち左シ
フトがこの場合行われる。すると第３図（Ｃ）のタイミ
ングＴ　ｎ　＋　ｆｆに移る。第３図（Ｄ）の動作タイ
ミングでも同様にＴ　ｎ　＋　３の積和の時間区域とな
る。すると、シフトされているからトレイ１にはＹ２、
トレイ２にはＹ３、そしてトレイｋにはＹｋや．が格納
され、また、係数行列Ｂの要素もトレイ１，２，・・・
，ｋにはそれぞれＢ１■＋Ｂ２３＋　　・・・＋　Ｂｋ
　ｋ＋１が入力される。これは第３図（Ｄ）のＴ０．３
のタイミングにおいてもバス上のデータがそれぞれ示さ
れている。従ってＴ，。３のタイξングにおいてＢ１■
とＹ２との積をとり、前の累積器Ｚとの和が求められる
。従って、ユニット１においては、Ｔ，。２？おいて求
まったＢ．とＹ１との積とＴ■３において求められるＢ
１■とＹ２との積との和が求められその結果が累積器Ｚ
に格納される。同様にユニット２においては前の結果で
あるＢ２■Ｘ　Ｙｚ　＋　Ｂ　２３×Ｙ３の結果が累積
器Ｚに格納される。トレイｋに対しても同様である。そ
してまたシフトし、タイミングＴ０．４に移る。There is Yh in llk and Bkk in 12h? There is. Therefore, the inner product calculation is performed as shown in the diagram at T n 4 2 in FIG. 3(C). Since the value of the accumulator Z is 0 at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, as shown in the diagram of FIG. 3(D), T n + 2
and T. . 3 is a shift operation, in which data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T n + ff in FIG. 3(C). Similarly, the operation timing shown in FIG. 3(D) corresponds to the time area of the sum of products of T n +3. Then, since it has been shifted, tray 1 has Y2,
Tray 2 has Y3, and tray k has Yk and so on. are stored, and the elements of coefficient matrix B are also stored in trays 1, 2,...
, k respectively have B1■+B23+ ...+ Bk
k+1 is input. This is T0.3 in Figure 3 (D)
The data on the bus is also shown at the timing of . Therefore, T. In the tying ξ of 3, B1■
is multiplied by Y2 and summed with the previous accumulator Z. Therefore, in unit 1, T,. 2? B. B obtained from the product of and Y1 and T■3
The sum of the product of 1■ and Y2 is calculated and the result is stored in the accumulator Z.
is stored in Similarly, in unit 2, the previous result B2*X Yz + B23*Y3 is stored in the accumulator Z. The same applies to tray k. Then, it shifts again and moves to timing T0.4.

このような演算が進み、時間区域Ｔ　ｎ　＋　ａ　＋　
１まで行うと第３図（Ｃ）の時間区域Ｔ　ｒｌ　＋　ｍ
　＋　１に示されるようにＢＩＭＸ　Ｙ　１１が前の累
積器Ｚとの値に加えられるとユニットｌにおいてはＴｎ
。２で求めたＢ＋ｘｙ，　、Ｔｎ．２におけるＢＩ２Ｘ
Ｙ２　、Ｔｎ＋３で求めたＢ　．］Ｘ　Ｙ３等の積の和
が求まり、Ｔ　Ｑ　＋。までの内積結果が累積器Ｚに格
納されているので、その結果にＢｌ■×Ｙ．が加わって
行列Ｂの１行目とへクトルｌとの内積が実行される。ユ
ニット２においては同様に行列Ｂの２行目の行ベクトル
とベクトルｌとの内積演算が行われ、同様にｋ行目の行
ヘクトルと、ベクトルｙの内積がデータ処理ユニットｌ
ｋで実行される。従って、このような時系列で処理を行
うことによってｋＸｍの長方行列旧に対してｍに比例す
る処理時間で実行可能となり、従って良好な台数効果を
得ることが可能となる。As such calculations proceed, the time area T n + a +
1, the time area T rl + m in Figure 3 (C)
+ 1, when BIMX Y 11 is added to the value with the previous accumulator Z, in unit l Tn
. B+xy, , Tn. BI2X in 2
Y2, B determined by Tn+3. ]X Y3, etc. The sum of the products is found, and T Q +. Since the inner product result up to is stored in the accumulator Z, the result is Bl■×Y. is added and the inner product of the first row of matrix B and hector l is performed. In unit 2, an inner product operation is similarly performed between the row vector of the second row of matrix B and vector l, and similarly, the inner product of the row hector of the k-th row and vector y is calculated by data processing unit l.
It is executed in k. Therefore, by performing the processing in such a time series, it becomes possible to perform the processing for a kXm rectangular matrix old in a processing time proportional to m, and therefore it becomes possible to obtain a good number of units effect.

本実施例においてはシフトレジスタ３の長さを変更でき
ること、及び中間結果をトレイ２に書き込み、それを新
たなデータとして処理できることが重要である。シフト
レジスタ３の長さを変更できなければ、データをすべて
巡回するためにｎ単位時間が必要になってしまう。また
中間結果を新たなデータとして処理できることで小規模
なハードウエアでリングシストリンクアレイ方弐より広
い範囲の処理が実行可能となっている。さらに書き込み
に要する時間が短くて各一定であることも重要である。In this embodiment, it is important that the length of the shift register 3 can be changed and that intermediate results can be written to the tray 2 and processed as new data. If the length of the shift register 3 cannot be changed, it will take n units of time to cycle through all the data. Additionally, by being able to process intermediate results as new data, it is possible to perform a wider range of processing than the ring system link array method with small-scale hardware. Furthermore, it is important that the time required for writing is short and constant.

第４図は本発明の第３の実施例説明図である。FIG. 4 is an explanatory diagram of a third embodiment of the present invention.

このシステムではｍｘｎの長方行列八の転置行列ＡＴ、
すなわち（ｎＸｍ）の行列と要素数ｍのベクトルＸとの
積とを計算するものである。同図において第１図に示し
たもの回しものは同一の記号で示してある。In this system, the transposed matrix AT of eight mxn rectangular matrices,
That is, it calculates the product of an (nXm) matrix and a vector X having m elements. In the same figure, the rotating parts shown in FIG. 1 are indicated by the same symbols.

転置行列八〇とベクトルルとの積を求める場合において
は行列Ａを構成する部分行ヘクトルを各データ処理ユニ
ット１に接続された記憶装置４中に格納し、演算途中に
生ずる部分和をトレイ中のデータ保持回路２ａ上に累積
しつつシフトレジスク３上のデータを循環させる。When calculating the product of the transposed matrix 80 and the vector, the partial row hectors constituting the matrix A are stored in the storage device 4 connected to each data processing unit 1, and the partial sum generated during the calculation is stored in the tray. The data on the shift register 3 is circulated while being accumulated on the data holding circuit 2a.

第４図（Ａ）は第３の実施例の構成の詳細ブロック図で
あり、ｎＸｍ（ｎ≧ｍ≧１）の行列八〇と要素数ｍのベ
クトルＸの積ｌ（要素数ｎ）を求めるものである。同図
において、第１図で示したものと同一のものは同一の記
号で示してあり、１ａはデータ処理ユニット１の処理装
置であり、例えばデジタルシグナルプロセッサで構成さ
れ、２ａはトレイ２のデータ保持回路であり、例えばラ
ッチ回路で構威され、２ｂはトレイ２のデータ転送回路
であり、例えばバスドライバで構威され、２Ｃはトレイ
２の制御手段であり、例えば論理回路で構成され、４は
データ処理ユニット１にデータを供給する手段の一部で
あると同時にデータ処？ユニット１を制御する手段の一
部である記憶装置であり、例えばＲＡＭ　（ラングムア
クセスメモリ）で構威され、５はデータ処理ユニットｌ
とトレイ２の同期動作を行う手段であり、５ａはクロッ
ク発生回路であり、例えば水晶発振回路で構成され、５
ｂはクロック分配回路であり、例えばバッファ回路から
構威される。FIG. 4(A) is a detailed block diagram of the configuration of the third embodiment, and the product l (number of elements n) of a matrix 80 of nXm (n≧m≧1) and a vector X of m elements is calculated. It is something. In the figure, the same components as those shown in FIG. 2b is a holding circuit, for example, a latch circuit; 2b is a data transfer circuit for the tray 2, for example, a bus driver; 2C is a control means for the tray 2, which is composed of, for example, a logic circuit; is part of the means for supplying data to the data processing unit 1 and at the same time is a data processor? 5 is a storage device that is part of the means for controlling the unit 1, and is made up of, for example, a RAM (RAM); 5 is a data processing unit l;
and tray 2, and 5a is a clock generation circuit, which is composed of, for example, a crystal oscillation circuit;
b is a clock distribution circuit, which is composed of, for example, a buffer circuit.

第４図（Ｂ）は第３の実施例の動作フローチャートであ
る。Ｘｉをユニットｌｔ　　（ｉ＝１，　　・・ｍ）に
セットする。そしてＹｌ　　（ｉ＝１，・・　ｎ）の値
をＯにする。各ユニノトｌｉ　はＡ■とＸ．を掛け合わ
せ、積をＹ，に足し込む動作をｉ＝１，・・・，ｎに対
して行ってシフトする。FIG. 4(B) is an operation flowchart of the third embodiment. Set Xi to unit lt (i=1, . . . m). Then, set the value of Yl (i=1, . . . n) to O. Each unit is A■ and X. , and add the product to Y for i=1, . . . , n, and then shift.

この動作をｊ　’＝　１　，　　・・・．ｍに対して繰
り返す。This operation is expressed as j'=1, . Repeat for m.

転置行列とベクトルの掛け算は、記憶装置４中に格納さ
れた行列Ａの各部分行ヘクトルをそのままにして計算可
能となり、これは後述するニューラルネットの学習アル
ゴリズムのｌつであるバックプロバゲションの実行にお
いては極めて重要となる。またネットワークの量はオー
ダｎですむこと。The multiplication of a transposed matrix and a vector can be calculated by leaving each sub-row hector of the matrix A stored in the storage device 4 unchanged, and this can be done using backpropagation, which is one of the neural network learning algorithms described later. This is extremely important in execution. Also, the amount of network should be on the order of n.

？ングネットワークである。またデータ転送時間が処理
時間の影に隠れて転送時間に対するオーバヘッドはない
ことになる。しかもＳＩＭＤ方式である。? It is a networking network. Furthermore, since the data transfer time is hidden by the processing time, there is no overhead to the transfer time. Moreover, it is a SIMD method.

第４図（Ｃ）は第３の実施例の動作概要図である。ユニ
ット１．には、Ａ．からＡ１■までを順に与えていく。FIG. 4(C) is a schematic diagram of the operation of the third embodiment. Unit 1. In, A. to A1■ are given in order.

ユニット１■にはＡ２２からＡ２３，・・，Ａ２．を与
え、ｋ番目のユニットには記憶回路を介して、Ａｉｋ．
　　Ａｈ　ｋ＋１　．　　・・・■，　　Ａｋ　ｋ−１
を順に与える。ｍ番目にはＡ　Ｉ１ｆｆｉ＋　　Ａｍ　
ｍ。，，・・ｒ　Ａｓ　＋，Ｉ−１　を順に与えていく
。また、トレイ上を循環するものはＹ，からＹ，，であ
る。Unit 1 ■ includes A22 to A23,..., A2. Aik. is given to the k-th unit via a memory circuit.
Ah k+1. ...■, Ak k-1
are given in order. For mth, A I1ffi+ Am
m. ,,... r As +, I-1 are given in order. Moreover, the items circulating on the tray are Y, to Y,,.

第４図（Ｄ）は第３の実施例の動作タイムチャートであ
る。時間区域Ｔ１からＴ。までのバス上のデータが示さ
れ、これらは第６図（Ｃ）の時間区域Ｔ１からＴ。まで
の図にそれぞれ対応している。FIG. 4(D) is an operation time chart of the third embodiment. Time zone T1 to T. The data on the bus up to and including the time period T1 to T of FIG. 6(C) are shown. Each corresponds to the previous figures.

時間区域Ｔ１においては、Ｙ１からＹ，，まではすべて
０である。そしてＡ　１　１とＸ１との積がユニット１
＋　で形威され、それをＹ，に足し込む。それと同時に
Ａ２２とＸ２がＹ２に足し込まれ、Ａｋｋ×？ｋがＹｋ
に足し込み、Ａ　ａｍ　Ｘ　Ｘ　ｍがＹ．に足し込まれ
る。そしてシフト動作に入るとタイミングＴ２になる。In time area T1, all values from Y1 to Y, , are 0. And the product of A 1 1 and X1 is unit 1
It is expressed as +, and it is added to Y. At the same time, A22 and X2 are added to Y2, and Akk×? k is Yk
and A am X X m becomes Y. It is added to. Then, when the shift operation starts, timing T2 occurs.

すなわちＹデータが循環する。第１のユニットではＡ　
１　２　Ｘ　Ｘ　１が計算され、これがＹ２に足し込ま
れるが、そのＹ２はＴ１において求まったＡ２■ｘＸｚ
の値が格納されているのでこれに足し込まれる。そのた
め、Ａ２２ＸＸ２＋ＡＩ■ＸＸ，の結果がＹ２となる。In other words, the Y data circulates. In the first unit A
1 2 X X 1 is calculated and added to Y2, but Y2 is A2 x Xz found at T1
Since the value of is stored, it is added to this. Therefore, the result of A22XX2+AI■XX becomes Y2.

同様にユニット２においては、前のＹ３の結果にＡ２３
ＸＸ２が足し込まれる。Similarly, in unit 2, A23 is added to the previous Y3 result.
XX2 is added.

ｋ番目のユニットにおいてはＹ１．１にＡｋｋ．ＩＸＸ
ｋが加えられる。また、ｍ番目のユニットにはＹ．。１
にＡｍ　ｍ＋＋　Ｘ　Ｘｒａが加えられことになる。In the k-th unit, Akk. IXX
k is added. Also, the m-th unit has Y. . 1
Am m++ X Xra will be added to.

このようにＹデータを循環するとｍ番目の時間区域Ｔｎ
においては、例えば第１のユニット１、においては、そ
の前までに求まったＹ，，にＡ，ｆｉＸＸ，が加えられ
る。またＹ＋　にはＡ　２　１　Ｘ　Ｘ　ｚが加えられ
る。これを全体的に眺めてみると、例えば、ベクトルＸ
の第１の要素Ｘ１には、Ｔ１においてＡＩＩと積がとら
れ、Ａ　．　，　Ｘ　Ｘ　，が計算される。それはＹ１
に格納される。また、転置行列八〇の第１行？の第２番
目の要素Ａ２１ＸＸ２は実は最後のクロック周期Ｔ，，
において計算されている。これは同じＹ１に格納されて
いる形になっている。また、転置行列ＡＴの第１行目の
最後の要素であるＡ１とＸ１との積は第４図（Ｃ）のク
ロック周期Ｔ，，１．２のｍ番目のユニットで計算され
ている。すなわちＡ１とＸ．の積がＹ＋に足し込むこと
によって得られる。転置行列Ａ７の第２行目においても
同様であり、ＡＩ■とＸ１との積はＴ２のクロンクにお
いては、ユニント１において計算されている。When Y data is circulated in this way, the mth time area Tn
For example, in the first unit 1, A, fiXX, is added to Y, , which has been found up to that point. Further, A 2 1 X X z is added to Y+. Looking at this as a whole, for example, vector
The first element X1 of A. is multiplied with AII at T1, and A. , X X , are calculated. That is Y1
is stored in Also, the first row of the transposed matrix 80? The second element A21XX2 of is actually the last clock period T, .
It is calculated in This is stored in the same Y1. Further, the product of A1 and X1, which is the last element in the first row of the transposed matrix AT, is calculated in the mth unit of the clock period T, 1.2 in FIG. 4(C). That is, A1 and X. can be obtained by adding the product to Y+. The same holds true for the second row of the transposed matrix A7, and the product of AI■ and X1 is calculated in unit 1 at Cronk of T2.

また、Ａ２■ＸＸｚはクロック周期Ｔ＋の第２番目のユ
ニットにおいて行われている。モしてＹ２が再び循環し
て積の実行が行われるのは、時間区域Ｔ　ｎ　−　ａｓ
　＊　３である。その時間区域以後は乗算が行われ、シ
フト動作が行われる。そして時間区域ＴｎにおいてはＹ
２に足し込まれる値は第３番目のユニットであり、Ｙ２
に足し込まれる値はＡ３■×Ｘ３である。従って、Ｔｎ
において転置行列ＡＴ・の第２行目とベクトルＸの内積
が計算される。一般に第ｋ番目のユニントに関してはｋ
番目のトレイからのデータ線が１１ｋであるから第４図
（Ｄ）に示されるように、１１，に示すところを追って
いけばよいことになる。すなわち、Ｔ１においてはＹｋ
　十ＡｋｋＸＸｋ　、Ｔ２においてはＹ　ｈ　−　＋　
ｆ　Ａ　ｋｋ＋Ｉ　ｘＸｋ，Ｔ３においてはＹｈ＋Ｂ　
＋Ａｋｋ＋２　Ｘｋが計算され、Ｔｎ−１　においては
Ｙｋ−ｚ　＋　Ａｈ　ｋ−ｚＸ，が計算され、時間区域
ＴｎにおいてはＹｋ．十八ｋｋ−１　Ｘｋが計算される
ことになる。このことにより転置行列ＡＴとｍ次元のベ
クトルＸの積が実行される。すなわち、転置行列ＡＴと
ベクトルＸとの積を求める場合においては、行列Ａを構
成する部分行ベクトルを各データ処理ユニッ｝１に接続
された記憶装置４中に格納し、演算途中に生ずる部分和
をトレイ２中のデータ保持回路上に累積しつつシフトレ
ジスタ上を循環させている。Further, A2■XXz is performed in the second unit of clock period T+. Then, Y2 is circulated again and the product is performed in the time area T n − as
*3. After that time period, a multiplication is performed and a shift operation is performed. And in the time area Tn, Y
The value added to 2 is the third unit, Y2
The value added to is A3×X3. Therefore, Tn
The inner product of the second row of the transposed matrix AT· and the vector X is calculated in . Generally, for the kth unit, k
Since the data line from the th tray is 11k, it is sufficient to follow the line 11, as shown in FIG. 4(D). That is, at T1, Yk
10AkkXXk, Y h − + at T2
f A kk+I xXk, Yh+B at T3
+Akk+2 18 kk-1 Xk will be calculated. As a result, the product of the transposed matrix AT and the m-dimensional vector X is executed. That is, when calculating the product of the transposed matrix AT and the vector is accumulated on the data holding circuit in tray 2 and circulated on the shift register.

このような方法により行列ＡとベクトルＵとの積Ｘに継
続して行列Ａの転置八〇とベクトルλの積を求める場合
は、行列ＡとベクトルＵとの積を求める時に用いた各デ
ータ処理ユニット１に接続された記憶装置４中に格納さ
れた行列Ａの各部分行ベクトルをそのまま用いて、すな
わち転置行列ＡＴの部分行列を各データ処理ユニット１
に転送することなしに処理をおこなしうことかでき、従
って転送に要する時間が節約でき、さらに処理時間が短
縮できることになる。When calculating the product of the transposed 80 of matrix A and the vector λ following the product X of matrix A and vector U using such a method, each data processing used when calculating the product of matrix A and vector U Each data processing unit 1 uses each partial row vector of the matrix A stored in the storage device 4 connected to the unit 1 as it is, that is, the partial matrix of the transposed matrix AT.
Processing can be performed without having to transfer the data to another computer, thereby saving time required for transfer and further reducing processing time.

第４図（Ｅ）は第４図（Ｂ）の繰り返し部分を詳細に分
解して示したフローチャートである。FIG. 4(E) is a flowchart showing a detailed breakdown of the repeated portion of FIG. 4(B).

第５図は本発明の第４の実施例図である。本実施例は本
発明を利用したニューロコンピュータの構成図である。FIG. 5 is a diagram showing a fourth embodiment of the present invention. This embodiment is a configuration diagram of a neurocomputer using the present invention.

同図において第４図に示したものと同一のものは同一の
記号で示してある。同図において１ａはデータ処理ユニ
ットｌの処理装置であり、例えばデジタルシグナルプロ
セッサで構成される。２ａはトレイ２のデータ保持回路
であり、例えばラッチ回路で構成される。２ｂはトレイ
２のデータ転送回路であり、例えばバスドライバで構成
される。２Ｃはトレイ２の制御手段であり、例えば論理
回路で構戒される。４はデータ処理ユニッｌ−１にデー
タを供給する手段の一部であると同時にデータ処理ユニ
ット１を制御する手段の一部でもある記憶装置である。In this figure, the same parts as those shown in FIG. 4 are indicated by the same symbols. In the figure, 1a is a processing device of a data processing unit 1, which is composed of, for example, a digital signal processor. Reference numeral 2a denotes a data holding circuit for the tray 2, which is composed of, for example, a latch circuit. 2b is a data transfer circuit for the tray 2, which is composed of, for example, a bus driver. 2C is a control means for the tray 2, and is controlled by, for example, a logic circuit. A storage device 4 is part of the means for supplying data to the data processing unit l-1 and is also part of the means for controlling the data processing unit 1.

例えばＲＡＭで構成される。５ａはデータ処理ユニット
１とトレイ２の同期動作を行う手段であり、５ａはクロ
ック発生回路、例えば水晶発振回路で構威される。５ｂ
はクロンク分配回路であり、例えばバッファ回路で構威
される。これに加えて１０１はシグモイド関数と称され
る単調非減少連続関数及びその微分係数を計算するシグ
モイド関数ユニットであり、例えば多項式による近似式
により実現される。１０３は学習時の終了を判定する手
段であり、例えば通信手段により前記各処理ユニット１
と接続されたホストコンピュータと、各処理ユニット１
が計算した出力誤差を前記通信手段により前記ホストコ
ンピュータに通知する手段と、一般に複数個の前記出力
誤差値を基に学習の終了を判定し、ニューロコンピュー
タの停止を行う手段とから構威される。なお１０２はニ
ューロコンピュータの全体である。For example, it is composed of RAM. 5a is a means for synchronizing the data processing unit 1 and the tray 2, and 5a is constituted by a clock generation circuit, for example, a crystal oscillation circuit. 5b
is a Cronk distribution circuit, and is configured with a buffer circuit, for example. In addition, 101 is a sigmoid function unit that calculates a monotonically non-decreasing continuous function called a sigmoid function and its differential coefficient, and is realized by, for example, an approximate expression using a polynomial. Reference numeral 103 denotes a means for determining the end of learning, and for example, communicates with each processing unit 1 by communication means.
and a host computer connected to each processing unit 1.
means for notifying the host computer of the output error calculated by the neurocomputer through the communication means, and means for determining the end of learning based on the plurality of output error values and stopping the neurocomputer. . Note that 102 is the entire neurocomputer.

第５図（Ｂ）は本発明のニューロコンピュータにおいて
処理の計算における基本素子であるニューロンモデルの
実施例図である。ニューロンモデルは入力Ｘ，，Ｘ２，
　　・・・＋Ｌの各々にシナプス結合としての重み時Ｗ
，，Ｗ２，　　・・・，Ｗｎをそれぞれ掛け、その総和
を求め、これを内部値Ｕとする。このＵに非線形関数ｆ
を施し、出力Ｙとする。ここで非線形関数ｆは図に示す
ようなＳ型のシグモイド関数が一般に使われる。FIG. 5(B) is an example diagram of a neuron model which is a basic element in processing calculations in the neurocomputer of the present invention. The neuron model has inputs X,,X2,
...+L for each weight W as a synaptic connection
,, W2, . This U has a nonlinear function f
and output Y. Here, as the nonlinear function f, an S-type sigmoid function as shown in the figure is generally used.

第５図（Ｃ）は第５図（Ｄ）のニューロンモデルの複数
を用いて入力層、中間層、出力層の３Ｎ構造でニューロ
コンピュータを形或する階層型のニューラルネットワー
クの概念図である。第１層の入力層は入力信号ＩＩ，Ｉ
２，　　・・・，■、。，を入力する。第２層の中間層
は各々のユニット、すなわち、各々のニューロンモデル
が第１層のすべてのニューロンモデルに接続され、その
結合技がシナプス結合であって、重み値Ｗｉ４が与えら
れている。第３Ｎの出力層は同様に中間層の各ニューロ
ンモデルの全てに各々のユニットが接続されている。そ
の出力は外部に出される。このニューラルネットにおい
ては学習時において入力層に与えられる入力パターンの
信号に対応する教師信号と出力層との出力信号との誤差
を求め、この差が非常に小さくなるように中間層と出力
層との間の重み及び第１ｉと第２Ｈの間の重みを定める
ようにする。このアルゴリズムがバックプロパゲーショ
ン法則、すなわち逆伝播学習則と呼ばれるものである。FIG. 5(C) is a conceptual diagram of a hierarchical neural network that forms a neurocomputer with a 3N structure of an input layer, an intermediate layer, and an output layer using a plurality of the neuron models shown in FIG. 5(D). The input layer of the first layer is the input signal II, I
2, ...,■,. , input. In the intermediate layer of the second layer, each unit, that is, each neuron model, is connected to all the neuron models of the first layer, and the connection technique is synaptic connection, and a weight value Wi4 is given. Similarly, each unit of the 3N output layer is connected to all of the neuron models in the intermediate layer. The output is output to the outside. In this neural network, during learning, the error between the teacher signal corresponding to the input pattern signal given to the input layer and the output signal of the output layer is calculated, and the error between the intermediate layer and the output layer is determined so that this difference is very small. The weight between the first i and the second H is determined. This algorithm is called the backpropagation law, or backpropagation learning law.

逆伝播学習則によって定められた重み値を保存し、例え
ばパターン認識等の連想処理を行う場合には、第１層の
入力にて認識するべきパターンからややずれた不完全な
パターンを与えると、出力層からそのパターンに対応し
た出力信号が出力され、その信号は学習時に与えたその
パターンに対応する教師信号と非常に似たような信号が
出てくる。教師信号との差が非常に小さければ、その不
完全なパターンを認識したことになる。When storing the weight values determined by the backpropagation learning rule and performing associative processing such as pattern recognition, if an incomplete pattern slightly deviated from the pattern to be recognized is given as input to the first layer, An output signal corresponding to the pattern is output from the output layer, and this signal is very similar to the teacher signal corresponding to the pattern given during learning. If the difference from the teacher signal is very small, this means that the incomplete pattern has been recognized.

第５図（Ａ）のニューロコンピュータ１０２を用いてこ
のニューラルネットワークの動作を工学的に実現できる
。本実施例では第５図（Ｃ）に示すような３層のネット
ワーク構成を用いるが、以下の説明のようにこの層数は
本実施例の動作にはなんら本質的な影響を受けない。同
図においてＮ（１）は第１層のニューロン数である。ま
た通常、第１層、すなわち入力層の各ニューロンの出力
は入力と等しいものとするので、実質的な処理の必要は
ない。通常の動作、すなわちパターン認識を行う場合の
前向きの処理を第５図（Ｄ）に示す。The operation of this neural network can be realized engineeringly using the neurocomputer 102 shown in FIG. 5(A). In this embodiment, a three-layer network configuration as shown in FIG. 5(C) is used, but as explained below, this number of layers has no essential influence on the operation of this embodiment. In the figure, N(1) is the number of neurons in the first layer. Further, since the output of each neuron in the first layer, that is, the input layer is usually equal to the input, there is no need for any substantial processing. FIG. 5(D) shows normal operation, that is, forward-looking processing when performing pattern recognition.

第５図（Ｄ）は第４の実施例の前向き処理フローチャー
トである。前向き処理では第５図（Ｃ）に示すネットワ
ークにおいて、各層間の結合技上の重み係数は定まって
いるものとする。第５図（Ｃ）のネットワークを第５図
（Ａ）のニューロコンピュータで実現する場合、次の処
理が行われる。前向き動作の基本動作は第５図（Ｂ）の
ニューロンモデルにおいて、入力に重みを掛けその総和
をとったものをＵとし、そのＵに非線形関数を施す処理
となる。これを各層において行うことになる。そのため
、まず、ステンプ７ｏにおいて入力データ、すなわち■
１からＩＮ（１１までのデータをシフトレジスタ上にセ
ットする。そして層の数をＬで表すと、以下のすべての
処理を層分繰り返す。例えばＬが３であった場合には、
３回繰り返す。繰り返される層は１層分の前向き処理で
ある。FIG. 5(D) is a forward processing flowchart of the fourth embodiment. In the forward processing, it is assumed that in the network shown in FIG. 5(C), the weighting coefficients for the connection technique between each layer are fixed. When the network of FIG. 5(C) is realized by the neurocomputer of FIG. 5(A), the following processing is performed. The basic operation of the forward movement is a process of multiplying the inputs with weights and taking the summation as U, and applying a nonlinear function to that U in the neuron model shown in FIG. 5(B). This will be done for each layer. Therefore, first, in step 7o, the input data, that is, ■
Set the data from 1 to IN (11) on the shift register. Then, if the number of layers is represented by L, all the following processes are repeated for each layer. For example, if L is 3,
Repeat 3 times. The repeated layer is one layer of forward processing.

そして、処理が終了する。その１層分の前向き処理が下
側に示されている。今、中間層に注目すると、ｌは２で
ある。ステップ７２において、シフトレジスタの長さを
Ｎ（ｆ−１）にする。すなわち、ｆ＝２であるからＮ（
１）、すなわち入力層の数にする。ステップ７３は中間
層におけるニューロンモデルの処理である。インデック
スのｊは１から入力層のユニット数Ｎ（１）まで変化さ
せる。Ｗｚ（ｆ）は入力層と中間層の間の結合における
重み係数である。すなわちＩ２−２である。Ｙｊ（ｆ−
１）は入力層のｊ番目のユニットからの出力である。ｉ
は中間層のｉ番目のユニットを意味する。ｉ番目のユニ
ットの状態Ｕ＋（２）は入力層の出力Ｙｊ、すなわちｊ
番目のＹに重みＷ，をかけてその総和より計算される。Then, the process ends. The forward processing for one layer is shown at the bottom. Now, if we focus on the middle layer, l is 2. In step 72, the length of the shift register is set to N(f-1). In other words, since f=2, N(
1), that is, the number of input layers. Step 73 is processing of the neuron model in the intermediate layer. The index j is varied from 1 to the number of units in the input layer N(1). Wz(f) is a weighting factor in the connection between the input layer and the hidden layer. That is, I2-2. Yj(f-
1) is the output from the j-th unit of the input layer. i
means the i-th unit of the intermediate layer. The state U+(2) of the i-th unit is the output Yj of the input layer, i.e. j
It is calculated from the sum of the weights W multiplied by the weight W.

ステップ７４に移って、“その中間層のｉ番目の状態Ｕ
＋（２）は非線形関数、すなわちシグモイド関数に入力
され、その出力がＹｌ（２）となる。すなわちステップ
７３の内積計算は第５図（Ａ）のユニット内で行うが、
このシグモイド関数の計算は、１０１によって行われる
。ステップ７５で例えば、中間層のｉ番目のユニットの
出力Ｙｉ（２）はトレイのｉ番目に出力される。そして
処理が終わる。以上の前向き処理を入力層、中間層、出
力層に対して行うことになる。このようにして各層の前
向き処理が終了する。すなわちニューロン単体のシξユ
レーションに必要な処理は第５図（Ｂ）の式で示される
演算で、その内容は重みと入力ベクトルとの内積演算及
びその演算結果に対するシグモイド関数値の計算であり
、その関数値の計算はシグモイド関数ユニット１０１に
より実現される。従って、ネットワーク中のある１層の
処理は第５図（Ｃ）に示すように、そのニューロン単体
の演算をその層内の全ニューロン分行うことである。従
って内積演算は各ニューロンｉ番目とするの結合係数ベ
クトルを並べた行列ｗ　Ｂ）＝　（ＷＩＪ　（ｊ２））
と、その層への入力を並べたヘクトルｘ　（ｊ２）　一
ＣＸｉ（ｌ）〕の積のヘクトルＵ　　（１！）＝　　（Ｕｉ　（１））となり、これは
本発明の第３の実施例で説明した方法で実行可能となる
。またシグモイド関数演算は各シグモイド関数ユニット
ｌＯ１が積ベクトルの各要素、Ｕｉ（ｆ）を入力し、対
応する関数値ｙｔ　（ｐ）　＝ｒ　（Ｕ１　（／！））
を出力することによってなされる。継続する層すなわち
、第（ｆｆｉ＋１）層が存在する場合は、その各関数値
出力ｙ．（ｘ）を各トレイに書き込み、第（ｌ＋１）層
の処理においてはこれを入力として以上の過程を繰り返
す。Proceeding to step 74, “the i-th state U of the intermediate layer
+(2) is input to a nonlinear function, that is, a sigmoid function, and its output becomes Yl(2). That is, the inner product calculation in step 73 is performed within the unit of FIG. 5(A),
Calculation of this sigmoid function is performed by 101. In step 75, for example, the output Yi(2) of the i-th unit of the intermediate layer is output to the i-th tray. Then the process ends. The above forward processing is performed on the input layer, intermediate layer, and output layer. In this way, the forward processing of each layer is completed. In other words, the processing required for the simulation of a single neuron is the calculation shown in the equation shown in FIG. , the calculation of the function value is realized by the sigmoid function unit 101. Therefore, as shown in FIG. 5(C), the processing of one layer in the network is to perform calculations for a single neuron for all neurons in that layer. Therefore, the inner product operation is a matrix w B) = (WIJ (j2)) in which the coupling coefficient vectors of each i-th neuron are arranged.
and the hector x (j2) - CXi(l)] in which the inputs to that layer are arranged, the hector U (1!) = (Ui (1)), which is the third embodiment of the present invention. This can be done using the method described. In addition, in the sigmoid function operation, each sigmoid function unit lO1 inputs each element of the product vector, Ui (f), and the corresponding function value yt (p) = r (U1 (/!))
This is done by outputting . If a continuing layer exists, that is, the (ffi+1)th layer, each function value output y. (x) is written in each tray, and in the processing of the (l+1)th layer, this is input and the above process is repeated.

次に第５図（Ａ）のニューロコンピュータヲ用いて学習
動作、すなわちハノクプロバゲーションアルゴリズムを
実行する場合について説明する。Next, a case will be described in which a learning operation, that is, a Hanok provagation algorithm is executed using the neurocomputer shown in FIG. 5(A).

第５図（Ｅ）は第４の実施例の学習処理フローチャート
である。ニューロコンピュータにおける学習とはネ７｝
ワークが所望の入出力関係を満たすようになるまで各ニ
ューロンの重みを修正することである。学習方法は所望
の入力信号ベクトルと教師信号ベクトルとの対を複数個
、すなわち教師信号の集合分だけ用意し、その中から１
対を選び、その入力信号Ｉ，を学習対象ネットワークに
入力し、入力に対するネットワークの出力と正しい出力
信号、すなわちその入力信号に対応した教師信号ＯＦ　
とを比較する。この差を誤差と称するが、その誤差、及
びこの時の入出力信号の値を基に、各ニューロンの重み
を修正することになる。FIG. 5(E) is a learning process flowchart of the fourth embodiment. What is learning in neurocomputers?7
The goal is to modify the weights of each neuron until the workpiece satisfies the desired input-output relationship. The learning method is to prepare a plurality of pairs of a desired input signal vector and a teacher signal vector, that is, for a set of teacher signals, and select one from them.
Select a pair, input the input signal I, to the network to be learned, and combine the output of the network with respect to the input and the correct output signal, that is, the teacher signal OF corresponding to the input signal.
Compare with. This difference is called an error, and the weight of each neuron is corrected based on the error and the values of the input and output signals at this time.

この過程を教師信号の集合中の全要素にわたり学習が収
束するまで繰り返すものである。すなわち、入力パター
ンの数の分だけ、すべて重み値として分布的に記憶する
ことになる。この後ろ向き処理と呼ばれる重みの修正過
程において出力層で得られた誤差を途中で変形しながら
入力層に向け通常の信号の流れる向きとは逆方向に伝播
させる。これがバックプロパゲーションのアルゴリズム
である。This process is repeated until learning converges over all elements in the set of teacher signals. That is, the number of input patterns are all stored as weight values in a distributed manner. In this weight correction process called backward processing, the error obtained in the output layer is propagated toward the input layer in the opposite direction to the normal signal flow direction, while being modified along the way. This is the backpropagation algorithm.

まず前記誤差Ｄを以下のように再帰的に定義する。Ｄｉ
（ｆｆ）は第ｆＪｉＪのｉ番目のニューロンから逆向き
に伝播される誤差、Ｌはネットワークの層数である。First, the error D is defined recursively as follows. Di
(ff) is the error propagated backward from the i-th neuron of fJiJ, and L is the number of layers of the network.

Ｄｉ　（Ｌ）　一Ｖ　　（Ｕｉ　（Ｌ））（Ｙｉ　（Ｌ
）○ｐｉ）　　　　　　　（最終層）（１）Ｄｉ　（ｆ
ｆ−１）一ｆ’　　（Ｕｉ　　（１−１））Σｔ＝＋．
Ｎ＋Ｌ，Ｗｊ　ｉ　　（ｊ２）　Ｄ　ｊ　（ｊ２）（ｆ
ｆｉ＝２，　　・・・，Ｌ）　　　（２）（ｉ＝１，　
　・・・，　Ｎ　（１２）　）ここでｒ’　　（ｕ）は
シグモイド関数ｒ　（ｘ）のＸに対する微係数ｒ’　　
（ｘ）のＸ＝Ｕの時の値であり、例えばｆ　（Ｘ）　＝ｔａｎｈＸ　　　　　　　　　　　　（
３）ならば、ｆ　’　　（Ｘ）　＝ｄ　（ＬａｎｈＸ）　／　ｄ　Ｘ
＝　１−ｔａｎｈ２Ｘ＝　１　−　ｆ　ｚ（４）であるから、ｆ’　　（Ｕｉ）＝１−ｆ２　（Ｕｉ）＝１−Ｙｉｚ（
５）である。Di (L) 1V (Ui (L)) (Yi (L)
)○pi) (Final layer) (1) Di (f
f-1) -f' (Ui (1-1))Σt=+.
N+L, Wj i (j2) D j (j2) (f
fi=2,...,L) (2)(i=1,
..., N (12)) where r' (u) is the differential coefficient r' of the sigmoid function r (x) with respect to X.
It is the value of (x) when X=U, for example, f (X) = tanhX (
3), then f' (X) = d (LanhX) / dX
= 1-tanh2X= 1-f z (4), so f' (Ui) = 1-f2 (Ui) = 1-Yiz(
5).

このＤｉとＹｉを基に、以下にように重みを更新する。Based on these Di and Yi, the weights are updated as follows.

基本的には次の式を用いる。ここでηは重みを更新する
刻み巾であり、小さければ学習安定に収束する収束が遅
くなり、大きすぎると収束ひなくなるという性質を持っ
たパラメタである。Basically, the following formula is used. Here, η is the increment width for updating the weights, and is a parameter that has the property that if it is small, the convergence to stable learning will be delayed, and if it is too large, the convergence will not occur.

Ｗ　ｉ　ｊ　　（ｆ）　（ｔ”’　　＝Ｗ　ｉ　ｊ　（
ｊｌ！）　（Ｌ）＋ΔＷｉ　ｊ　　（ｊ２）　′ｔ′（
６）ΔＷｉｊ（ｆ）（”　＝ηＤｆ　　（ｊ！）　Ｙｊ
（ｊ２−１）　　　　　　（ｊ２＝２．　　・・・，Ｌ
）（７）しかし、次に式も良く用いられている。これは
上式のΔＷｉ　ｊ　（ｊ２）　Ｃｔ）を１次にデジタル
ローバスフィルタに通したことになっており、αはその
時定数を決めるパラメタである。W i j (f) (t”' = W i j (
jl! ) (L)+ΔWi j (j2) 't'(
6) ΔWij (f) (” = ηDf (j!) Yj
(j2-1) (j2=2....,L
) (7) However, the following formula is also often used. This means that ΔWi j (j2) Ct) in the above equation is passed through a primary digital low-pass filter, and α is a parameter that determines its time constant.

ΔＷｉ　ｊ　（Ａ）　ＬＬ”’　＝ηＤｉ（ｆ）Ｙｊ（
ｆ−１）＋αΔＷ　ｉ　ｊ　（ｆ）　（ｔ）（８）この
後ろ向き処理の過程において必要となる演算はベクトル
間の演算、或いは行列とベクトルとの演算であり、特に
その中心となるのは各層のニューロンの重みを要素とす
る重み行列Ｗの転置行列ＷＴと前記誤差ベクトルＤｉ　
（１）との乗算である。この誤差ベクトルは１層内に複
数個のニューロンがある一般の場合、誤差はヘクトルと
なる。ΔWi j (A) LL"' = ηDi(f)Yj(
f-1) + αΔW i j (f) (t) (8) The operations required in this backward processing process are operations between vectors, or operations between matrices and vectors, and the main focus is on each layer. The transposed matrix WT of the weight matrix W whose elements are the weights of the neurons and the error vector Di
(1). In the general case where there are a plurality of neurons in one layer, this error vector becomes a hector.

第５図（Ｅ）の左のフローチャートを説明する。The flowchart on the left side of FIG. 5(E) will be explained.

１層分の前向きの処理と後向きの処理が行われる。まず
、入力データＩＰをシフトレジスタ上にセットし、１層
分の前向き処理をシステムで行う。Forward processing and backward processing for one layer are performed. First, input data IP is set on a shift register, and the system performs forward processing for one layer.

これは各層で行われるため、この前向き処理を層の数分
だけ操り返す。すると出力データＯｐが出力されるので
、これをシフトレジスタ上にセットする。そして、ステ
ップ７９から以下を出力層のユみット分だけ並列に実行
する。すなわち誤差Ｄ，（Ｌ）−Ｙｉ（Ｌ）　　Ｏｒ（
ｉ）を計算し、この誤差をトレイのｉ番目にセットする
。そして出力層から入力層に向かって各層毎に後向き処
理を行う。This is done for each layer, so this forward processing is repeated by the number of layers. Then, output data Op is output, so this is set on the shift register. Then, from step 79, the following steps are executed in parallel for as many units as the output layer. That is, the error D, (L) - Yi (L) Or (
i) and set this error to the i-th tray. Then, backward processing is performed for each layer from the output layer to the input layer.

この後向き処理は第５図（Ｅ）の右上側に示されている
。第Ｌ番目の層に関して、この層の数はＮ（ｊ２）であ
るからシフトレジスタ長をＮ（Ｉ！．）にする。そして
以下の動作をこの前の層のユニット数だけ並列に実行す
る。すなわち、上記（２）式を、ステップ８３において
実行する。ここで重要なのは重みはＷＪＩ（ｌ）となっ
ており、これは重み行列の転置行列ＷＴの要素になって
いる。そしてステップ８４において、上記（６），　（
７）あるいは（８）式を計算し、重みの更新を行う。ス
テップ８５で、求まった誤差Ｄ．（ｉ　　Ｎをトレイの
ｉ番目に出力する。これは次の誤差を計算するため、ス
テップ８４の動作に必要となる。第５図（Ａ）の右下は
第５図（Ｅ）の左のフローチャート、すなわち前向き処
理と後向き処理の連続処理を学習が習得するまで繰り返
すことを意味するフローチャートである。また、このよ
うな処理において重みの更新と学習を安定にするために
重みの修正量の平滑化等の処理があるが、これらはいず
れも行列のスカラ倍及び行列同士の加減算からなり、や
はり、本ニューロコンピュータにおいて行える。またシ
グモイド関数ユニット１０１はハードウエアで実現する
ものとしているが、ソフトウェアで実現してもよい。ま
た、学習の終了の反転千段１０３はホストコンピュータ
上のソフトウエアで実現してもよい。This backward processing is shown on the upper right side of FIG. 5(E). Regarding the Lth layer, since the number of layers is N(j2), the shift register length is set to N(I!.). Then, the following operations are executed in parallel for the number of units in the previous layer. That is, the above equation (2) is executed in step 83. What is important here is the weight WJI(l), which is an element of the transposed matrix WT of the weight matrix. Then, in step 84, the above (6), (
7) or (8) is calculated and the weights are updated. In step 85, the error D. (i N is output to the i-th tray. This is necessary for the operation of step 84 in order to calculate the next error. The lower right of FIG. 5(A) is the same as the left of FIG. 5(E). This is a flowchart that means that sequential processing of forward processing and backward processing is repeated until learning is mastered.In addition, in such processing, in order to update the weights and stabilize learning, the amount of weight correction is smoothed. These processes include scalar multiplication of matrices and addition and subtraction between matrices, which can also be performed on this neurocomputer.Furthermore, although the sigmoid function unit 101 is assumed to be realized by hardware, it can be realized by software. Furthermore, the inversion step 103 at the end of learning may be realized by software on the host computer.

以上のニューロコンピュータをさらに第５図（Ｆ）を用
いて説明する。第５図（Ｆ）はエラーパックプロパゲー
ションの学習を行う時の処理フロー図である。ここでは
、ベクトル表示を用いている。同図においてｘ　（ｆ）
は第１層のニューロンベクトル、Ｗは同じく結合係数、
すなわち重み行列である。ｒはシグモイド関数、ｅ　（
ｆ）は第１層の出力側から逆向きに伝播してきた誤差ベ
クトル、ΔＷは重みの修正量である。入力信号が与えら
れると、まず、３層である場合には、入力層はないもの
とすれば、隠れ層の前向き処理を行う。The above neurocomputer will be further explained using FIG. 5(F). FIG. 5(F) is a processing flow diagram when learning error pack propagation. Vector representation is used here. In the same figure, x (f)
is the neuron vector of the first layer, W is also the coupling coefficient,
That is, it is a weight matrix. r is a sigmoid function, e (
f) is an error vector propagated in the opposite direction from the output side of the first layer, and ΔW is a weight correction amount. When an input signal is given, first, in the case of three layers, forward processing of the hidden layer is performed, assuming that there is no input layer.

それがｕ＝Ｗｘ（ｆ）である。このＵに非線形関数を施
せば、次の層、すなわち（ｆｆｉ＋１）層の入力となる
。これは出力層の入力であるから、その前向き処理を行
う。そして教師信号を入力し、後向き処理になる。出力
層においては教師信号と出力信号の誤差ｅをｆの微分を
掛けて後向き処理にする。また中間層等の間の誤差は逆
伝播してくる誤差信号に微分をかけた変数に重み行列の
転置行列Ｗ７をかけて求められる。誤差ベクトルの各要
素にシグモイドの微分をかけた値に前のＷ７の要素を掛
けてこれよりΔＷを求め、Ｗを更新すればよい。このよ
うにして、出力層の後向き処理、及び隠れ層の後向き処
理が行われる。前向き処理で行う演算は、重み行列Ｗと
入力ベクトルＸとの積、この結果ベクトルの各要素のシ
グモイド関数の値の計算である。この計算は各ニューロ
ンで並列に計算できる。また後向き処理でも仕事は大き
く分けて２あり、１つ目は教師信号と出力信号との誤差
を順次変形しながら、後から前へ逆向きに伝播すること
、また２つ目はその誤差を基に重みを修正することであ
る。この逆向きの計算では重み行列Ｗの転置行列Ｗ７に
よる乗算が必要になる。転置行列ＷＴとベクトルの積は
前の実施例で述べている。すなわちパックプロパゲーシ
ョンの学習を実現する再の重要な点は重み行列の転置行
列ＷＴとベクトル乗算の効率な実現方法である。That is u=Wx(f). If a nonlinear function is applied to this U, it becomes the input for the next layer, that is, the (ffi+1) layer. Since this is the input of the output layer, forward processing is performed on it. A teacher signal is then input, and backward processing begins. In the output layer, the error e between the teacher signal and the output signal is multiplied by the differential of f to perform backward processing. Furthermore, the error between the intermediate layers and the like is obtained by multiplying the variable obtained by differentiating the back-propagated error signal by the transposed matrix W7 of the weight matrix. W can be updated by multiplying the value obtained by multiplying each element of the error vector by the sigmoid differential by the previous element of W7 to obtain ΔW. In this way, backward processing of the output layer and backward processing of the hidden layer are performed. The calculation performed in forward processing is the product of the weight matrix W and the input vector X, and the calculation of the value of the sigmoid function of each element of the resulting vector. This calculation can be performed in parallel in each neuron. In addition, there are two main tasks in backward processing: the first is to sequentially transform the error between the teacher signal and the output signal and propagate it backwards from the back to the front, and the second is to propagate the error based on the error. The solution is to modify the weights. This inverse calculation requires multiplication of the weight matrix W by the transposed matrix W7. The product of the transposed matrix WT and the vector is described in the previous embodiment. That is, an important point in realizing pack propagation learning is an efficient method of realizing vector multiplication with the transpose matrix WT of the weight matrix.

さらに第５図（Ｇ）と（Ｈ）を用いて前向き積和計算、
及び後向き積和計算の実施例を説明する。Further, using Figure 5 (G) and (H), calculate the forward sum of products,
An example of backward product-sum calculation will be described.

前向き積和演算は行列×ベクトルの計算で、特に行列は
重み行列Ｗである。本発明で、行列ベクトル積ｕ＝Ｗｘ
を計算する場合、例えば、次の弐？　・　・（９）に対して、重み行列の行とベクトルＸとの積が同時に行
われる。この処理方弐を第７図（８）を用いて説明する
。重み行列Ｗは長方行列である。例えば、３×４の行列
である。ヘクトルＸの各要素はトレイ上に入力される。The forward product-sum operation is a matrix×vector calculation, and in particular, the matrix is a weight matrix W. In the present invention, the matrix-vector product u=Wx
For example, if you want to calculate the following 2?・・For (9), the rows of the weight matrix and the vector X are simultaneously multiplied. This second processing method will be explained using FIG. 7(8). The weight matrix W is a rectangular matrix. For example, it is a 3×4 matrix. Each element of hector X is entered on the tray.

Ｔ１の時刻において、Ｘ，とＷ．、Ｘ２とＷ２■、Ｘ３
とＷ。が各々のユニットで計算される。Ｔ２に移るとベ
クトルＸの各要素は上に巡回シフトする。Ｔ２において
Ｗ＋２とＸ２との積がＵ，に足される。したがってＵ１
はこの時刻にはＸ　＋　Ｘ　Ｗ　＋　＋　＋　Ｘ　２　
Ｘ　Ｗ　＋■となる。また、第２のユニットではＷ２３
とＸ３が掛けられ、第３番目のユニントではＷ３４×Ｘ
４が掛けられる。Ｔ３において、ＷＩ３とｘ３が掛けら
れＵ，に足し込まれる。Ｗ２４とＸ４が掛けられ、Ｕ２
に加えられる．Ｗ３１とＸＩが掛けられＵ３に足し込ま
れる。この時ｘ２は演算の対象からはずされている。Ｔ
４におイテ、Ｗ１４とＸ４、Ｗ２，とＸＩ、Ｗ３２とＸ
２がそれぞれ同時に掛けられ（−Ｌ　，　Ｕ２　、ＴＪ
３にそれぞれ足し込まれる。この場合、ｘ３は演算の対
象外となっている。この演算の対象外を考慮すること乙
こよって長方行列とベクトルとの積が実行される。At time T1, X, and W. , X2 and W2 ■, X3
and W. is calculated for each unit. Moving to T2, each element of vector X is cyclically shifted upward. At T2, the product of W+2 and X2 is added to U. Therefore U1
is at this time X + X W + + + X 2
It becomes X W + ■. Also, in the second unit W23
is multiplied by X3, and in the third unit W34×X
Multiplyed by 4. At T3, WI3 and x3 are multiplied and added to U. W24 and X4 are multiplied and U2
is added to. W31 and XI are multiplied and added to U3. At this time, x2 is excluded from the calculation target. T
Ite to 4, W14 and X4, W2, and XI, W32 and X
2 are multiplied simultaneously (-L, U2, TJ
Each is added to 3. In this case, x3 is not subject to calculation. By taking into account the objects that are not subject to this operation, the product of a rectangular matrix and a vector is executed.

Ｗの部分ベクトルＷｉ０はＰＥ−ｔのローカルメモリ上
にＷｉｉが先頭になるようにスキューされて格納されて
いる。Ｘｉはトレイにのってリング上を反時計回りに一
回転する。ＵｉはＰＥ−ｔ内部のレジスタ上に累積され
る。The partial vector Wi0 of W is stored in the local memory of PE-t in a skewed manner so that Wii is at the beginning. Xi gets on the tray and rotates counterclockwise around the ring. Ui is accumulated on a register inside PE-t.

左端の状態でＵｉ＝Ｏの状態からスタートする。It starts from the state of Ui=O in the leftmost state.

ＰＥ−ｔは自分の目の前にあるＸｊとｗｉｊと掛け合わ
せ、その結果をＵｉに加算する。同時にＸｊは隣のトレ
イに隣接される（リング上を反時計回りに循環する）。PE-t multiplies Xj and wij in front of it, and adds the result to Ui. At the same time, Xj is adjacent to the next tray (circulating counterclockwise on the ring).

これを４回繰り返すと全てのＵｉが同時に求まる。By repeating this four times, all Ui's can be found at the same time.

Ｗｉｉがスキューされていること、Ｘｉが全てトレイ中
にある状態からスタートすること、Ｕｉが全て同時に求
まる。The Wii is skewed, the Xi starts with all in the tray, and the Ui are all found at the same time.

第５図（Ｈ）は後向き積和計算の説明である。FIG. 5(H) is an explanation of backward product-sum calculation.

これは転置行列と行ベクトル積、ｅ＝Ｗ”ｖを計算する
時のタイ稟ング図である。この場合、ベクトルＶは前の
層の誤差ベクトルに非線形関数の微分を掛けた要素から
なるベクトルである。ｅは求めらようとする次の層での
逆伝播用の誤差ベクトルである。本発明で重要なことは
、転置行列Ｗ？であっても、前向き積和計算において利
用されるメモリ上のＷと同じ配置にしたままで演算でき
ることである。This is a tie diagram when calculating the transposed matrix and the row vector product, e=W''v. In this case, the vector V is a vector consisting of the elements obtained by multiplying the error vector of the previous layer by the differential of the nonlinear function. e is the error vector for backpropagation in the next layer to be determined.What is important in the present invention is that even if the transposed matrix W? It is possible to perform calculations with the same arrangement as W above.

すなわち本発明では求めるべきｅのベクトルの巡回シフ
トによってなされる。演算するべき転置行列Ｗ７とベク
トルＶとの式は００式に従う。That is, in the present invention, this is done by cyclically shifting the vector of e to be determined. The formulas for the transposed matrix W7 and vector V to be calculated follow formula 00.

上の式において示されるように、行列Ｗは転置？れしか
も、長方行列である。ｅ１はＷ１１Ｘｖｌ十ｗ２，ｘＶ
２＋ｗ，，ｘＶ３である。この演算を行うために、第５
図（Ｈ）において、時間区域Ｔ１においては第１のユニ
ッｌ−　（ＤＳＰ）において、ＷＩＩとＶｌの積が演算
されている。これがＯであるｅ１に差し込まれる。そし
て、巡回シフトするとＴ２に移るが、ｅ１はＴ２時刻に
おいては演算の対象になっていない。そしてＴ３になる
と、３番目のユニットにおいて演算対象となっている。As shown in the above formula, is the matrix W transposed? Moreover, it is a rectangular matrix. e1 is W11Xvl ten w2, xV
2+w,,xV3. To perform this operation, the fifth
In the diagram (H), in the time period T1, the product of WII and Vl is calculated in the first unit l- (DSP). This is inserted into e1 which is O. Then, when the cyclic shift is performed, the process moves to T2, but e1 is not subject to calculation at time T2. Then, at T3, the third unit becomes the calculation target.

すなわちＷ３，にｖ３を掛けた値が前の値に足し込まれ
るため、Ｗ，，Ｘｖ．に足し込まれる。そのため時間区
域Ｔ３においては、ｅ，の結果はＷ．×Ｖ　１　＋Ｗ３
１　Ｘ　Ｖ　３となる。そしてＴ４に移ると、ｅ．は巡
回シフトとして、第２番目のユニットで演算対象となる
。ここで、ｅ，にはＷ２，Ｘｖ２が加えられるため、０
■式の行列の第１行目とベクトルＶとの内積演算が実行
され、その演算結果がｅ，に格納されることになる。That is, since the value obtained by multiplying W3, by v3 is added to the previous value, W,,Xv. It is added to. Therefore, in time area T3, the result of e, is W. ×V 1 +W3
1 X V 3. Then, moving to T4, e. is a cyclic shift and is the object of calculation in the second unit. Here, since W2 and Xv2 are added to e, 0
The inner product operation of the first row of the matrix of equation (2) and the vector V is executed, and the result of the operation is stored in e.

同様に第２行目とベクトルとの積はｅ２を追えばよい。Similarly, the product of the second row and the vector can be obtained by following e2.

Ｔ，時刻にはＷ２■ＸＶ２　、Ｔ２にはＷｌ２？ｖ，　
、Ｔ３では、ｅ２が遊びになり、Ｔ４でＷ３■Ｘｖ３の
積が求まれ、各々の積の和として計算される。ＷＴの第
３行目とベクトルＶとの積はｅ３を追えばよい。Ｔ，に
おいてはＷＩ３Ｘ　ｖ：ｌ　、ＴｚにおいてはそれにＷ
２３×ｖ２が足し込まれ、Ｔ３において、更にＷ，３ｘ
ｙ，が足し込まれる。Ｔ４はｅ４は遊びとなる。Ｗ７の
第４行目とベクトルＶとの積はｅ４を追えばよい。Ｔ１
時刻ではｅ４は遊びである。Ｔ２ではＷ３４×ｖ３　、
Ｔ３ではＷ２４ＸＶ２が足し込まれ、Ｔ４において更に
Ｗ，．Ｘｖ，が足し込まれて、計算ができる。このよう
に本発明では、Ｗの部分ベクトルＷ　ｔ　’″は前と同
様ＰＥ−ｔのローカル目上にＷｉｉが先頭になるように
スキューされて格納されている。前と入れ替わるのはｅ
ｉとＶｉである。つまり、ｅｉはトレイ上を反時計回り
に循環しながら累積され、ＶｉはＰＥ−１内部に常駐す
る。T, time is W2■XV2, T2 is Wl2? v,
, T3, e2 becomes idle, and at T4, the product of W3×Xv3 is determined and calculated as the sum of each product. The product of the third row of WT and the vector V can be obtained by following e3. WI3X v:l at T, and W to it at Tz
23×v2 is added, and at T3, further W,3x
y, is added. For T4, e4 is a play. The product of the fourth row of W7 and the vector V can be obtained by following e4. T1
At time, e4 is idle. In T2, W34×v3,
At T3, W24XV2 is added, and at T4, W, . Xv, is added and the calculation can be done. In this way, in the present invention, the partial vector W t ''' of W is stored in the local position of PE-t in a skewed manner so that Wii is at the beginning.
They are i and Vi. That is, ei is accumulated while circulating counterclockwise on the tray, and Vi resides inside PE-1.

左端の状態でｅ　ｊ＝ｏからスタートする。ＰＥ−ｉは
ＶｉとＷｉｊとを掛け合わせ、その結果を自分の目の前
にあるｅｊに加え込む。同時にこの更新されたｅｊは隣
のトレイに転送される（リング上を反時計回りに循環す
る）。これを４回繰り返すと全てのｅｊが同時に求まる
。Start from e j=o in the leftmost state. PE-i multiplies Vi and Wij and adds the result to ej in front of him. At the same time, this updated ej is transferred to the adjacent tray (circulating counterclockwise on the ring). By repeating this four times, all ej can be found at the same time.

このように本発明のニューロコンピュータは層が何層で
あっても実現でき、学習アルゴリズムの自由度が高いと
いう柔軟性を持つばかりでなく、ＤＳＰの速度そのまま
を利用でき、しかもそのＤＳＰの演算においてオーバヘ
ッドがなく、高速性があり、しかもＤＳＰによるＳＩＭ
Ｄが実行できる。In this way, the neurocomputer of the present invention can be realized with any number of layers, and not only has the flexibility of having a high degree of freedom in the learning algorithm, but also can utilize the speed of the DSP as it is, and moreover, in the calculation of the DSP. No overhead, high speed, and DSP-based SIM
D can be executed.

第６図は本発明の第５の実施例説明図であり、アナログ
データによる行列の積を求めるものである。図中、第２
図で示したものと同一のものは同一の記号で示してあり
、１ｄはデータ処理ユニット１の処理装置であり、例え
ばアナログ乗算器１ｅと積分器ｉｆで構威され、２ｄは
トレイ２のデータ保持回路であり、例えばサンプル／ホ
ールド回路２ｆで構成され、２ｅはトレイ２のデータ転
送回路であり、例えばアナログスイッチ２ｇとハッファ
アンプ２ｈで構成され、６はトレイ２にデ−タを設定す
る手段であり、例えばアナログスイッチ６ｄで構威され
る。FIG. 6 is an explanatory diagram of a fifth embodiment of the present invention, which calculates the product of matrices using analog data. In the figure, the second
Components that are the same as those shown in the figure are indicated by the same symbols, and 1d is a processing device of the data processing unit 1, which is composed of, for example, an analog multiplier 1e and an integrator if, and 2d is the data processing unit for tray 2. A holding circuit, for example, is composed of a sample/hold circuit 2f, 2e is a data transfer circuit for the tray 2, which is composed of, for example, an analog switch 2g and a huffer amplifier 2h, and 6 is a means for setting data in the tray 2. For example, an analog switch 6d may be used.

本実施例の動作は本発明の原理図（第１図）で説明した
動作と同しである。The operation of this embodiment is the same as that described in the principle diagram of the present invention (FIG. 1).

第７図は本発明の第６の実施例説明図であり、帯行列と
ベクトルとの乗算を示している。図中、第２図で示した
ものと同一のものは同一の記号で示してある。FIG. 7 is an explanatory diagram of a sixth embodiment of the present invention, showing the multiplication of a band matrix and a vector. In the figure, the same parts as those shown in FIG. 2 are indicated by the same symbols.

本実施例の動作を第７図（Ｂ）を参照しつつ説明する。The operation of this embodiment will be explained with reference to FIG. 7(B).

本発明では、ｍＸｎ　（ｎ≧ｍ≧１）で巾ｋの帯行列Ａ
と要素数ｎのベクトルＸとの乗算結果（要素数ｍのベク
トルｙ）を求める場合においで、第７図（Ａ）の如く、
各々２つの入力を持ち乗算機能と概乗算結果の累積機能
を有するｍ個のデータ処理ユニット１と、ｎ個のトレイ
２と、前記各データ処理ユニット１にせとぞくされた入
力データ供給手段とから或る構戒に於いて、第７図（Ｂ
）に示す手順で、第７図（Ｃ）及び第７図（Ｄ）のよう
な時系列で処理をするようにしている。従って、巾ｋの
帯行列とベクトルとの乗算がｋに比例する処理時間で実
行できる。In the present invention, a band matrix A of width k with mXn (n≧m≧1)
When calculating the multiplication result (vector y of m elements) by vector X with n elements, as shown in FIG. 7(A),
m data processing units 1 each having two inputs and having a multiplication function and an approximate multiplication result accumulation function, n trays 2, and input data supply means for each of the data processing units 1; In a certain precept, Figure 7 (B
), the processing is performed in chronological order as shown in FIGS. 7(C) and 7(D). Therefore, multiplication of a band matrix of width k by a vector can be performed in a processing time proportional to k.

本実施例に於いて重要な事は、ベクトルＸを１回転させ
ない事、及びベクトルＸをシフトレジスタ３上にセット
する際に、第１の実施例等と異なり、頂度帯が始まる位
置にずらしておくことである。すなわち、帯の開始位置
から処理を開始する場合、ある方向にずらしながら積和
演算を行えばｋに比例する時間で処理が終了する。しか
し、図示しないが何らかの事情で帯の途中に配置した状
態から処理を開始する場合は、始めにベクトルＸを一端
までずらせばよいことは明らかであり、その場合、シフ
トレジスタ３が双方向にシフト可能であることが意味を
持つのである。What is important in this embodiment is that the vector X is not rotated once, and that when setting the vector It is a good idea to keep it. That is, when starting the process from the start position of the band, if the product-sum operation is performed while shifting in a certain direction, the process will be completed in a time proportional to k. However, if for some reason (not shown) you start processing from a state where the vector is placed in the middle of the band, it is obvious that the vector It is meaningful that it is possible.

即ち、例えば帯の中央から処理を開始する場合は、初め
に右にｋ／２（小数点以下切り捨て）だけずらし、以後
逆方向（この場合左）にずらしながら積和演算を行えば
、合計３／２ｋに比例する時間で処理が終了する。That is, for example, when starting processing from the center of the band, first shift it to the right by k/2 (round down to the decimal point), then perform the product-sum operation while shifting it in the opposite direction (in this case, to the left), resulting in a total of 3/2. The process ends in a time proportional to 2k.

もし、シフトレジスタ３が双方向にシフト可能でなけれ
ば、ベクトルＸを１回転させねばならないため、帯行列
の巾ｋではなくその大きさｎに比例する時間が必要にな
る。大規模な帯行列の於いては、この差は非常に大きく
、帯行列とヘクトルとの乗算が帯行列の巾ｋに比例する
処理時間で実行可能となることは本発明の方式の利点で
ある。If the shift register 3 is not capable of shifting in both directions, the vector X must be rotated once, which requires a time proportional to the size n of the banded matrix rather than the width k. For large-scale banded matrices, this difference is very large, and the advantage of the method of the present invention is that the multiplication of a banded matrix by a hector can be performed in a processing time proportional to the width k of the banded matrix. .

第８図はトレイの構造を具体的に示す。FIG. 8 specifically shows the structure of the tray.

トレイは基本的には単なる１語のラッチであるが、ＤＳ
Ｐからのアクセスと、隣のトレイへの転送を１サイクル
で実行できる（ポストシフト）。Tray is basically just a one word latch, but DS
Access from P and transfer to the adjacent tray can be executed in one cycle (post shift).

機能の切り替えは、アドレス線の下位ビットにより、デ
ータのアクセスと同時に行い、速度を向上させている。Function switching is performed simultaneously with data access using the lower bits of the address line, improving speed.

一つのトレイはゲートアレイで約１　２　０　０Ｂａｓ
ｉｃセルの規模であり、１チップに２〜４個入れること
も可能である。One tray is about 1200Bas with gate array
It is on the scale of an IC cell, and it is possible to put 2 to 4 on one chip.

また、トレイ中にワークレジスタを数ワード内蔵するこ
とも可能である。It is also possible to incorporate several words of work registers in the tray.

第９図は本発明の実施例を用いて、実際に構威されたニ
ューロコンピュータのブロック図である。FIG. 9 is a block diagram of a neurocomputer actually constructed using an embodiment of the present invention.

Ｓａｎｄｙの基本構成はＤＳＰの一次元トーラス（リン
グ）結合によるＳＴＭＤ型マルチプロセッサである。The basic configuration of Sandy is an STMD type multiprocessor using a one-dimensional torus (ring) combination of DSPs.

特徴的なのは、結合トボロジーや動作は１次元シストリ
ックアレイと類似しているにも関わらず、ＳＩＭＤとし
て動作する事である。A distinctive feature is that it operates as a SIMD, even though its coupling topology and operation are similar to a one-dimensional systolic array.

各ＤＳＰと双方向バスで接続されている゜゜トレイ′゛
は、転送機能を有するラッチであり、相互にリング状に
接続され、全体でサイクリックシフトレジスタを構威し
ている。以後このシフトレジスタをリングと呼ぶ。The ゜゜tray'゛ connected to each DSP by a bidirectional bus is a latch having a transfer function, and is connected to each other in a ring shape, forming a cyclic shift register as a whole. Hereinafter, this shift register will be referred to as a ring.

各ＤＳＰは２Ｋ語の内部メモリと６４語の外付けＲＡＭ
を持ち、内部メモリは１サイクルで、外部メモリは１〜
２サイクルでアクセスできる。Each DSP has 2K words of internal memory and 64 words of external RAM
, the internal memory is 1 cycle, and the external memory is 1 to 1 cycle.
It can be accessed in 2 cycles.

外付けＲＡＭは、プログラムやデータの初期ロード用に
、共通ハスでホストコンピュータのＶＭＥＷハスに接続
される。外部入力もハソファメモリを介してホストコン
ピュータに接続されている。The external RAM is connected to the host computer's VMEW bus via a common bus for initial loading of programs and data. External inputs are also connected to the host computer via HaSofa memory.

第１０図は本発明の実施例における学習時の時間空間チ
ャートであり、縦方向はプロセッサの数を示し、横方向
は時間を示す。■は入力層のプロセッサの数、Ｈは隠れ
層のプロセッサの数、■はプロセッサの積和演算の時間
に対応する。FIG. 10 is a time-space chart during learning in the embodiment of the present invention, where the vertical direction shows the number of processors and the horizontal direction shows time. 2 corresponds to the number of processors in the input layer, H corresponds to the number of processors in the hidden layer, and 2 corresponds to the time for the product-sum operation of the processors.

入力信号が隠れ層の前向き積和に要する時間は、入力層
のプロセッサの数Ｉと１つのプロセッサの積和に対応す
る時間Ｉとの積に比例する。次に、シグモイドの計算が
行われる。出力層においても出力層の前向き積和（２Ｈ
Ｉ）とシグモイドが行われる。ここで、出力層のプロセ
ッサの数が隠れ層のプロセッサの数より少ないので、リ
ングの大きさ自体も小さくなる。次ぎに教師信号入力と
受信し、誤差計算を行い、誤差のバック・プロパゲーシ
ョンを行う。なお、この誤差計算は出力層のシグモイド
における誤差計算も服務出力層の後向き積和を行い、出
力層の重み更新を勾配ベクトル計算とローバスフィルタ
を介して行う。そして、隠れ層のシグモイドによる誤差
計算を経て、隠れ層においては、後向き積和は行わず隠
れ層の重み更新のみを行う。The time required for the input signal to perform the forward sum of products in the hidden layer is proportional to the product of the number I of processors in the input layer and the time I corresponding to the sum of products of one processor. Next, a sigmoid calculation is performed. Also in the output layer, the output layer's forward sum of products (2H
I) and sigmoid are performed. Here, since the number of processors in the output layer is smaller than the number of processors in the hidden layer, the size of the ring itself is also smaller. Next, it receives the teacher signal input, calculates the error, and performs back propagation of the error. In this error calculation, the error calculation in the sigmoid of the output layer is also performed by performing backward product sum of the service output layer, and the weight update of the output layer is performed via gradient vector calculation and a low-pass filter. Then, after calculating the error using the sigmoid in the hidden layer, only the weights of the hidden layer are updated without performing the backward sum of products.

〔Effect of the invention〕

以上説明した様に、本発明によれば従来の方法より広い
範囲の処理に対して、データ処理に伴うデータ転送によ
るオーバヘンド無しにデータを並列に処理出来る効果を
奏し、データ処理ユニットの台数に比例した高速なデー
タ処理が実現出来ることにより、行列演算あるいはニュ
ーロコンピュータ演算をアナログ信号について行うデー
タ処理装置の性能向上に寄与するところが大きい。As explained above, the present invention has the effect of being able to process data in parallel over a wider range of processing than conventional methods without any overhead due to data transfer associated with data processing, and is proportional to the number of data processing units. The ability to realize high-speed data processing greatly contributes to improving the performance of data processing devices that perform matrix operations or neurocomputer operations on analog signals.

[Brief explanation of drawings]

第１図（Ａ）は、本発明の原理構成図、第１図（Ｂ）は
、本発明の動作フローチャート、第１図（Ｃ）は、本発
明の動作概要図、第１図（Ｄ）は、本発明の動作タイム
チャート、第２図（Ａ）は、第１の実施例の構成図、第
２図（Ｂ）は、第１の実施例の動作フローチャート、第２図（Ｃ）は、第１の実施例の動作概要図、第２図（
Ｄ）は、第１の実施例の動作タイムチャート、第３図（Ａ）は、第２の実施例の構成図、第３図（Ｂ）
は、第２の実施例の動作フローチャート、第３図（Ｃ）は、第２の実施例の動作概要図、第３図（
Ｄ）は、第２の実施例の動作タイムチャート、第４図（Ａ）は、第３の実施例の構成図、第４図（Ｂ）
は、第３の実施例の動作フローチャート、第４図（Ｃ）は、第３の実施例の動作概要図、第４図（
Ｄ）は、第３の実施例の動作タイムチャート、第４図（Ｅ）は、第３の実施例の詳細動作フローチャー
ト、第５図（Ａ）は、第４の実施例の構成図、第５図（Ｂ）
は、第４の実施例のニューロンモデル、第５図（Ｃ）は、第４の実施例のネットワーク、第５図
（Ｄ）は、第４の実施例の前向き処理フローチャート、第５図（Ｅ）は、第４の実施例の学習処理フローチャー
ト、第５図（Ｆ）は、Ｓａｎｄｙでエラーバックプロパゲー
ション学習を行うときの処理フローチャート、第５図（
Ｇ）は、Ｓａｎｄｙで行列ベクトル積Ｕ一Ｗｘを計算す
るときのタイムチャート、第５図（Ｈ）は、転置行列で
の行列ベクトル積ｅ＝Ｗ”ｖを計算するときのタイムチ
ャート、第６図（Ａ）は、第５の実施例の構戒図、第６
図（Ｂ）は、第５の実施例の動作フローチャート、第６図（Ｃ）は、第５の実施例の動作概要図、第６図（
Ｄ）は、第５の実施例の動作タイムチャート、第７図（Ａ）は、第６の実施例の構成図、第７図（Ｂ）
は、第６の実施例の動作フローチャート、第７図（Ｃ）は、第６の実施例の動作概要図、第７図（
Ｄ）は、第６の実施例の動作タイムチャート、第８図は、トレイの構造を具体的に示す図、第９図は、
本発明の実施例を用いて実際に構成されたニューロコン
ピュータのブロック図、第１０図は、本発明の実施例に
おける学習時の時間空間チャート、第１１図（Ａ）は、共通バスＳ　ＩＭＤ方式の原理構成
図、第１１図（Ｂ）は、共通バスＳＭＤ方式による行列ベク
トル積の動作フローチャート、第１２図（Ａ）及び第１
２図（Ｂ）は、リングシストリノク方式による行列ベク
トル積の動作原理図、第１２（Ｃ）は、リングシストリック方式による行列ベ
クトル積の動作原理図である。１・・・データ処理ユニット、２・　・　・トレイ、３・・・シフトレジスタ、４・・・記憶装置、５・・・同期手段、６７１１１２２工２２２３２４８２８３８４８５９１９２９３・データ設定手段、・長さ変更手段、・データ処理ユニット１の入力、・データ処理ユニット１の第２の入力、・トレイ２の第
１の入力、・トレイニの第１の出力、・トレイ２の第２の出力、・トレイ２の第２の入力、・ＰＥ９　１の第１の入力、・ＰＥ９　１の第１の出力、・ＰＥ９　１の第２の入力、・ＰＥ９　１の第２の出力、・ＰＥ，・ＰＥ９　１の入出力、・共通バス．FIG. 1(A) is a principle block diagram of the present invention, FIG. 1(B) is an operation flowchart of the present invention, FIG. 1(C) is a schematic diagram of the operation of the present invention, and FIG. 1(D) is an operation time chart of the present invention, FIG. 2(A) is a configuration diagram of the first embodiment, FIG. 2(B) is an operation flowchart of the first embodiment, and FIG. 2(C) is an operation flowchart of the first embodiment. , a schematic diagram of the operation of the first embodiment, FIG.
D) is an operation time chart of the first embodiment, FIG. 3(A) is a configuration diagram of the second embodiment, FIG. 3(B)
is an operation flowchart of the second embodiment, FIG. 3(C) is an operation overview diagram of the second embodiment, and FIG.
D) is an operation time chart of the second embodiment, FIG. 4(A) is a configuration diagram of the third embodiment, FIG. 4(B)
is an operation flowchart of the third embodiment, FIG. 4(C) is an operation overview diagram of the third embodiment, and FIG.
D) is an operation time chart of the third embodiment, FIG. 4E is a detailed operation flowchart of the third embodiment, and FIG. 5A is a configuration diagram of the fourth embodiment. Figure 5 (B)
is a neuron model of the fourth embodiment, FIG. 5(C) is a network of the fourth embodiment, FIG. 5(D) is a forward processing flowchart of the fourth embodiment, and FIG. ) is a learning process flowchart of the fourth embodiment, FIG. 5(F) is a process flowchart when error back propagation learning is performed on Sandy, and FIG.
G) is a time chart when calculating the matrix-vector product U - Wx in Sandy, FIG. 5 (H) is a time chart when calculating the matrix-vector product e=W''v in a transposed matrix, and Figure (A) is a composition diagram of the fifth embodiment;
FIG. 6(B) is an operation flowchart of the fifth embodiment, FIG. 6(C) is a schematic diagram of the operation of the fifth embodiment, and FIG.
D) is an operation time chart of the fifth embodiment, FIG. 7(A) is a configuration diagram of the sixth embodiment, FIG. 7(B)
is an operation flowchart of the sixth embodiment, FIG. 7(C) is an operation overview diagram of the sixth embodiment, and FIG.
D) is an operation time chart of the sixth embodiment, FIG. 8 is a diagram specifically showing the structure of the tray, and FIG. 9 is:
A block diagram of a neurocomputer actually constructed using an embodiment of the present invention, FIG. 10 is a time-space chart during learning in an embodiment of the present invention, and FIG. 11(A) is a common bus SIMD system. 11(B) is an operational flowchart of matrix-vector product using the common bus SMD method, FIG. 12(A) and 1.
FIG. 2(B) is a diagram showing the principle of operation of matrix-vector product using the ring-systolic method, and FIG. 12(C) is a diagram showing the principle of operation of matrix-vector product using the ring-systolic method. 1... Data processing unit, 2... Tray, 3... Shift register, 4... Storage device, 5... Synchronization means, 6 7 11 12 2nd grade 22 23 24 82 83 84 85 91 92 93 - data setting means, - length changing means, - input of data processing unit 1, - second input of data processing unit 1, - first input of tray 2, - first output of traini, - tray 2nd output of PE9 1, ・2nd input of tray 2, ・1st input of PE9 1, ・1st output of PE9 1, ・2nd input of PE9 1, ・2nd input of PE9 1 Output, ・PE, ・PE9 1 input/output, ・Common bus.

Claims

[Claims] 1) a plurality of data processing units (1) each having at least one input (11), each having a first input (21) and an output (22) and each having a data storage and a data processing unit; Multiple trays (2
), wherein all or part of said trays (2) are each connected to a first input (11) of said data processing unit (1).
a second output (23) connected to the tray; and shifting means (3) comprising a first input (21) and an output (22) of the connecting tray (2). , data transfer on said shifting means (3) and said tray (
2) and the data processing unit (1) and data processing by the data processing unit (1) are performed synchronously, thereby performing matrix operations or neurocomputer operations on analog signals. A parallel data processing method. 2) The parallel data processing system according to claim 1, wherein the shift means (3) is a cyclic shift register. 3) The parallel data processing system according to claim 1 or 2, further comprising means for changing the length of the shift means (3). 4) The parallel data processing system according to claim 3, wherein the means for changing the length of the shift means (3) is input switching means. 5) The parallel data processing system according to claim 3, wherein the means for changing the length of the shift means (3) comprises external data supply means and input selection means. 6) said data processing unit (1) outputs a first output (21);
), said tray (2) having a second input (24) connected to said first output (21), means for writing data from said data processing unit (1) to said tray (2). A parallel data processing method according to any one of claims 1 to 5, characterized in that it has the following. 7) The data processing unit (1) and the tray (2)
7. The parallel data processing system according to claim 6, wherein the data transfer path between the two is a bus commonly used for input and output. 8) When further processing the data processing result, the processing result is transferred to the tray (2) using the writing means.
Parallel data processing method described in Section. 9) said tray (2) is provided with a third input (25) and an output (26) each interconnected, said shifting means (
9. The parallel data processing system according to claim 1, wherein 3) is a bidirectional shift register. 10) Parallel data according to claim 9, characterized in that the data transfer path between the trays (2) constituting the bidirectional shift register is a bus commonly used for input and output. Processing method. 11) The parallel data processing method according to claim 9 or 10, wherein data is transferred bidirectionally on the bidirectional shift register.