JPS58159169A

JPS58159169A - Parallel processing system

Info

Publication number: JPS58159169A
Application number: JP4220482A
Authority: JP
Inventors: Hiroshi Hatsuda; 發田　弘
Original assignee: NEC Corp; Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1982-03-17
Filing date: 1982-03-17
Publication date: 1983-09-21
Also published as: JPS6259345B2

Abstract

PURPOSE:To improve the parallelism without enlarging the scale of a memory switch, by converting the processors taking charge of the parallel processing into parallel processing processors consisting of plural processor elements. CONSTITUTION:Each of processors PP1'-PP16' performs the reading/writing to an optional one of data memories DM1-DM32 via a memory switch MS. A control processor CP contains memories CPM1 and CPM2 exclusive for control and can give an access to the memories DM1-DM32 via the switch MS. Furthermore the processor CP can have communication with each processor via an interface (a) and a control processor interface CPI' of each of processors PP1'-PP16'.

Description

【発明の詳細な説明】〔発明の属する技術分野〕本発明は、並列処理方式、特に、データ処理装置におけ
る並列処理方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical field to which the invention pertains] The present invention relates to a parallel processing method, and particularly to a parallel processing method in a data processing device.

一般に、演算処理を高速化する方法の１つとして並列処
理方式がある。Generally, one of the methods for speeding up arithmetic processing is a parallel processing method.

この並列処理方式は、処理すべきプログラムの中で並列
に実行できる部分を各々異なるプロセッサで実行し、Ｎ
台のプロセッサで理想的にはＮ倍の性能を得ようとする
ものである（実際には並列に実行できない部分や並列動
作を制御する丸めの余分な時間・・・・・・オーバヘッ
ド・・・・・・のため、Ｎ倍以下の性能しか得られない
。）〔従来技術〕従来の並列処理方式は、制御プロセッサと、それぞれが
データを記憶する複数のデータメモリと、前記制御プロ
セッサに並列に接続された複数のプロセッサと、前記複
数のプロセッサと前記複数のデータメモリとを並行して
相互に接続するためのメモリスイッチとを含み、前記複
数のプロセッサのそれぞれはプロセッサエレメントと、
前記プロセッサエレメントを前記制御プロセッサと接続
するための１ｔｌＪ（ｉ１４１プロセッサインターフェ
ースと、前記プロセッサエレメントを前記メモリスイッ
チと接続するだめのメモリスイッチインターフェースと
を含んで構成される。This parallel processing method executes parts of the program that can be executed in parallel using different processors, and N
Ideally, the goal is to obtain N times the performance with one processor (actually, parts that cannot be executed in parallel, extra time for rounding to control parallel operations, etc., overhead... (For this reason, the performance is only N times lower.) [Prior art] The conventional parallel processing method consists of a control processor, a plurality of data memories each storing data, and a control processor connected in parallel to the control processor. a plurality of connected processors; and a memory switch for interconnecting the plurality of processors and the plurality of data memories in parallel, each of the plurality of processors having a processor element;
The processor is configured to include an 1tlJ (i141 processor interface) for connecting the processor element to the control processor, and a memory switch interface for connecting the processor element to the memory switch.

次に、従来の並列処理方式について、図面を参照して詳
細に説明する。Next, a conventional parallel processing method will be explained in detail with reference to the drawings.

第１図は従来の並列処理システムの一例を示すシステム
構成図であり、第２図は第１図に示すプロセッサの一例
を示す詳細ブロック図である。FIG. 1 is a system configuration diagram showing an example of a conventional parallel processing system, and FIG. 2 is a detailed block diagram showing an example of the processor shown in FIG. 1.

第１図に示す並列処理方式は、ｔｌＩＩＪ（１４１プロ
セツサＣＰとこの制御プロセッサＣＰに専用の制御専用
メモリＣＰＭＩ　、ＣＰＭ２と、制御プロセッサＣＰに
並列接続されたプロセッサＰＰ１〜ＰＰ１６と、プログ
ラムおよびデータを記憶し九″メモリＭＭＩ〜ＭＭ３２
と、１６台のプロセッサと３２台のメモリとを相互に並
行して接続するために１６Ｘ３２＝５１２個の接続点を
もつメモリスイッチとを含んでいる。The parallel processing system shown in FIG. Shi9'' memory MMI~MM32
and a memory switch having 16×32=512 connection points for mutually connecting 16 processors and 32 memories in parallel.

プロセッサＰＰＩ〜ＰＰｌ６はいずれも同一の構成をな
し、第２図に示すように、プロセッサエレメントＰＥと
、メモリスイッチインターフェースＭ８エト、Ｉ制御プ
ブロッサインターフェースＣＰＩを含んでいる。メモリ
インターフェースＭ８Ｉは、プロセッサニレメン）ＰＨ
からデータるるいはプログラムの続出を行なう丸めのア
クセス要求をメモリスイッチＭＳを介してメモリＭＭＩ
〜ＭＭ３２に供給するとともにメモＩＪＭＭＩ〜ＭＭ３
２から読み出したデータをプロセッサエレメントＰＥに
供給するとともにプロセッサエレメントＰＥでの演算結
集などをメモリＭＭＩ−ＭＭ３２に記憶させるために供
給する。制御プロセッサインターフェースＣＰＩはイン
ターフェースａｔ−介して制御プロセッサＣＰと接続さ
れ、プログラム実行開始指示５ＴＡＲＴやプログラム実
行停止指示５ＴＯＰを制御プロセッサＣＰから供給され
て、プロセッサエレメントＰＷに供給したり、プロセッ
サエレメントＰＥからの処理終了通知ｇＮＤを制御プロ
セッサＣＰに供給する。All of the processors PPI to PP16 have the same configuration and, as shown in FIG. 2, include a processor element PE, a memory switch interface M8, and an I control processor interface CPI. Memory interface M8I is processor Nilemen) PH
A rounding access request for data flow or program continuation is sent to the memory MMI via the memory switch MS.
~Supply to MM32 and memo IJMMI~MM3
The data read from the processor element PE is supplied to the processor element PE, and the data read from the processor element PE is supplied to the memory MMI-MM32 to be stored therein. The control processor interface CPI is connected to the control processor CP via the interface at-, and receives program execution start instructions 5TART and program execution stop instructions 5TOP from the control processor CP, and supplies them to the processor elements PW, and also supplies them to the processor elements PW. A processing end notification gND is supplied to the control processor CP.

すなわち、メモリスイッチＭ８を介して１６台のプロセ
ッサＰＰＩ〜ＰＰ１６が３２台のメモリＭＭＩ〜ＭＭ３
２にアクセスできるようになっており、各プロセッサＰ
Ｐ１〜ＰＰｌ６は各々独立にプログラムを実行すること
が可能でめる。制御プロセッサＣＰはプロセッサＰＰＩ
〜ＰＰ１６とのインタフェースａを通してプログラム実
行開始指示５ＴＡＲＴ’ｉ供給したり、プロセッサが実
行全完了したときの処理終了通知ＥＮＤを受地する。That is, 16 processors PPI to PP16 are connected to 32 memories MMI to MM3 via memory switch M8.
2, each processor P
P1 to PP16 can each independently execute programs. Control processor CP is processor PPI
- It supplies a program execution start instruction 5TART'i through the interface a with the PP 16, and receives a processing end notification END when the processor has completely completed execution.

この制御プロセッサＣＰの制御の下でプロセッサＰＰＩ
〜ＰＰ１５は解くべきプログラム中の並列処理部分につ
いて分担して実行する。たとえば、ａｌ＋ｂｌ、　ａｌ
＋ｂｌ　、−・・・１　ａｍ＋ｂｎ　、という計算であ
ればｉ番目のプロセッサＰＰｉがａｌ−１−ｂｉを計算
する。Under the control of this control processor CP the processor PPI
-PP15 share and execute the parallel processing part in the program to be solved. For example, al+bl, al
+bl, -...1 am+bn, the i-th processor PPi calculates al-1-bi.

このような従来の並列処理システムの性能を高めるには
各プロセッサの性能を高くするか、プロセッサの台数を
増やす必要がある。In order to improve the performance of such conventional parallel processing systems, it is necessary to increase the performance of each processor or increase the number of processors.

しかしながら、プロセッサの性能を高めるとその装置寸
法が大きくなり多数並べることが困難になる。さらに、
プロセッサの台数を増やすとメモリを並行して使用でき
るようにするためにはメモリも拡大する必要が１メモリ
スイツチはプロセッサの台数とメモリの台数との積で増
大して複雑・大規模になり、やはり実現困難になる（た
とえばクロスバ　スイッチで考えるとプロセッサ台数と
メモリ台数を各々２倍にするとスイッチの規模は２　Ｘ
　２＝４倍になる）。こうした欠点のため大規模、超高
性能の並列処理システムはほとんど実用化されていない
。However, increasing the performance of the processor increases the size of the device, making it difficult to arrange a large number of them. moreover,
When the number of processors increases, the memory must also be expanded to be able to use memory in parallel.1Memory switches increase in size by the product of the number of processors and the number of memory, making them complex and large-scale. It will still be difficult to realize (for example, considering a crossbar switch, if you double the number of processors and the number of memory, the scale of the switch will be 2X
2 = 4 times). Because of these drawbacks, large-scale, ultra-high-performance parallel processing systems are rarely put into practical use.

すなわち、従来の並列処理方式は並列度を増大させるこ
とが困難であるという欠点があった。That is, the conventional parallel processing method has a drawback in that it is difficult to increase the degree of parallelism.

[Purpose of the invention]

本発明の目的は並列度を増大できる並列処理方式を提供
することにある。An object of the present invention is to provide a parallel processing method that can increase the degree of parallelism.

すなわち、本発明の目的は並列処理を分担する各プロセ
ッサをさらに複数のプロセッサエレメントからなる並列
処理プロセッサとすることによシメモリスイッチの規模
を大きくすることなく並列度を高めて上記欠点を解決し
大規模、超＾性能を有する並列処理システムを提供する
仁とにある。That is, an object of the present invention is to solve the above-mentioned drawbacks by increasing the degree of parallelism without increasing the scale of the memory switch by making each processor that shares parallel processing into a parallel processing processor consisting of a plurality of processor elements. We are a company that provides parallel processing systems with large-scale, ultra-high performance.

[Structure of the invention]

本発明の並列処理方式は、制御プロセッサと。 The parallel processing method of the present invention is based on a control processor.

それぞれがデータを記憶する複数のデータメモリと、前
記制御プロセッサに並列に接続された複数のプロセッサ
と、Ｗａ紀複叙のプロセッサと前記嶺叙のデータメモリ
とを並行して相互に接続するためのメモリスイッチとを
含み、ＭｉＪ記複数のプロセッサのそれぞれは、並列に
設けられた複数のプロセッサエレメントと、各プロセッ
サエレメントに共通に設けられプログラムを記憶するプ
ログラムメモリと、前記複数のプロセッサニレメントラ
前記制御プロセッサと接続するだめの制御プロセッサイ
ンターフェースと、前記複数のプロセッサエレメントを
前記メモリスイッチと接続するためのメモリスイッチイ
ンターフェースとを含んで構成ちれる。a plurality of data memories each storing data; a plurality of processors connected in parallel to the control processor; and a memory switch for mutually connecting the processor of the Waki and the data memory of the Ling in parallel. Each of the plurality of processors includes a plurality of processor elements provided in parallel, a program memory that is provided in common to each processor element and stores a program, and the control processor of the plurality of processors. and a memory switch interface for connecting the plurality of processor elements to the memory switch.

すなわち、本発明の並列処理方式は、複数のプロセッサ
エレメントと該複数のプロセッサエレメントで共有され
るプログラムｌメモリと、該複数のプロセッサエレメン
トから発生するデータメモリへのアクセス要求の中から
各データメモリアクセスタイミング毎に一つを選択して
処理する回路とから構成される演算処理装ｗ様数台と複
数のデータメモリと任意の上記演算処理装置から任意の
上記データメモリへのアクセスを可能にするメモリスイ
ッチとを備えて構成さｎる。In other words, the parallel processing method of the present invention allows each data memory access to be performed from among a plurality of processor elements, a program memory shared by the plurality of processor elements, and data memory access requests generated from the plurality of processor elements. A memory that enables access from any of the above-mentioned arithmetic processing units to any of the above-mentioned data memories, including several arithmetic processing units w consisting of a circuit that selects and processes one at each timing, a plurality of data memories, and a circuit that selects and processes one at each timing. and a switch.

さらに、本発明の並列処理方式は上述の構成に加えて、
制御プロセッサと該制御プロセッサから上記全プロセッ
サエレメントにプログラム実行開始を指示する４僅手段
と上記各プロセッサエレメントからプログラム実行終了
を上記制御プロセッサに通知する平膜とを備え、上記制
御プロセッサの制御下で一つのプログラム中の並列処理
部分を上記全プロセッサエレメントによ）並列に実行す
るように構成される。Furthermore, in addition to the above configuration, the parallel processing method of the present invention has the following features:
A control processor, comprising a control processor, four means for instructing all of the processor elements to start program execution from the control processor, and a flat membrane for notifying the control processor of the end of program execution from each of the processor elements, and under the control of the control processor. The parallel processing portion of one program is configured to be executed in parallel (by all the processor elements).

すなわち、本発明の並列処理方式は並列処理を分担する
各プロセッサを並列に動作する複数のプロセッサエレメ
ントで構成することにより、メモリスイッチの規模を大
きくすることなく実質的な並列処理プロセッサ台数を増
やしている。In other words, in the parallel processing method of the present invention, by configuring each processor that handles parallel processing with a plurality of processor elements that operate in parallel, the actual number of parallel processing processors can be increased without increasing the scale of the memory switch. There is.

すなわち、本発明の並列処理システムは、ｎ台のプロセ
ッサと、ｍ台すなわち、ｎ台るるいは２ｎ台などｎ台以
上のデータメモリと、このｎ台のプロセッサとｍ台のデ
ータメモリと１：接続する丸めのｎｘｍ個の懐続点ｔ４
するメモリスイッチとｔ含み、このｎ台の１０セツサの
それぞれの１台のプロセッサの内部構造を１台のプロセ
ッサエレメントと、この４台のプロセッサエレメントで
共用されるメモリで１台のプロセッサエレメントの実行
すべきプログラムを格納した１台のプログラムメモリと
、４台のプロセッサのそれぞれから前記ｍ台のデータダ
メモリへのアクセス安来を受けて処理するメモリスイッ
チインターフェースとを含んでいる。すなわち、このメ
モリスイッチインク−フェースはメモリのアクセスタイ
ミング毎に！ｉ＋１台のプロセッサエレメントのうちの任意の工上のプロセ
ッサエレメントからのアクセス要求の中から１つを選択
して、選択されたアクセス要求をメモリスイッチを介し
てデータｌメそりへ送出する。このアクセス要求が読出
要求であればデータメモリから送られてくるデータを要
求元のプロセラｆ　工し／メントに渡す。このように、
メモＩＪ　スイッチインターフェースで、データメモリ
のアクセ　　　□、＜インタフェースを１本に紋ってい
るのでメモリスイッチの現俣（プロセッサ奮接続するた
めのインタフェース数）ｔｌ／Ｊにすることができる。That is, the parallel processing system of the present invention has n processors, m data memories, such as n or 2n data memories, these n processors, m data memories, and 1: nxm connected rounded continuation points t4
The internal structure of one processor of each of these n 10 processors is implemented by one processor element, and the memory shared by these four processor elements is used to execute one processor element. The memory switch interface includes one program memory that stores a program to be executed, and a memory switch interface that receives and processes access to the m data memories from each of the four processors. In other words, this memory switch ink-face is activated at every memory access timing! One of the access requests from any of the i+1 processor elements on the ground is selected, and the selected access request is sent to the data memory via the memory switch. If this access request is a read request, the data sent from the data memory is passed to the requesting processor. in this way,
Memo IJ With the switch interface, data memory access □,< Since the interface is integrated into one, the current number of memory switches (number of interfaces for connecting processors) can be set to tl/J.

この場合データメモリへのアクセスが１台のプロセッサ
エレメント間で競合するのでこれが性能上のボトルネッ
クになる可能性があるが、その問題ｒｉ１台のプロセッ
サエレメントで共有するプログラム専用のプログラムｌ
モリヲ持たせることで軽減している。すなわち、通常の
コンピュータではプログラムもデータも同じメモリに格
納しているが本発明に開用するプロセッサではプログラ
ムはプログラム専用のプログラムメモリに格納されてい
るのでメモリスイッチインターフェースを介してのメモ
リへのアクセスはデータに対するものに限られ、通常の
コンピュータに比しアクセス頻度は最大１／２位に低減
される。In this case, access to data memory competes between one processor element, which may become a performance bottleneck, but the problem is that programs dedicated to programs shared by one processor element are
This is reduced by having Moriwo. In other words, in a normal computer, programs and data are stored in the same memory, but in the processor used in the present invention, programs are stored in a dedicated program memory, so access to the memory is not possible through the memory switch interface. The access frequency is limited to data, and the access frequency is reduced to at most 1/2 compared to a normal computer.

[Explanation and explanation of examples]

次に、本発明の実施例について、図面を参照して詳細に
説明する。Next, embodiments of the present invention will be described in detail with reference to the drawings.

第３図は本発明の一実施例を示すシステム構成図、第４
図は第３図に示すプロセッサの詳細ブロック図である。Figure 3 is a system configuration diagram showing one embodiment of the present invention;
The figure is a detailed block diagram of the processor shown in FIG. 3.

プロセッサＰＰＩ’〜ＰＰ１６’は内部に８台のプロセ
ッサニレメン）ＰＨ１〜ＰＥ８を含む並列処理方式のプ
ロセッサで各々８個のプログラムを並列に実行する能力
ｔ−！しているがプロセッサの台数やその中のプロセッ
サエレメントの台数はこの例に限定されるものではない
。The processors PPI' to PP16' have a parallel processing system including eight processors PH1 to PE8, each having the ability to execute eight programs in parallel. However, the number of processors and the number of processor elements therein are not limited to this example.

各プロセッサＰＰｌ／〜ＰＰ１６／はメモリスイッチＭ
Ｓを介して任意のデータ　メモリＤＭ　１−ＤＭ３２に
対してデータの続出、書込ができる。データメモリの台
数は第３図では３２台としているが、これはプロセッサ
の台数やデータメモリの性能、データメモリの使用＃５
１度によって定められこの例に限定されるものではない
。Each processor PPl/~PP16/ is a memory switch M
Data can be sequentially output and written to any data memory DM1-DM32 via S. The number of data memories is 32 in Figure 3, but this depends on the number of processors, the performance of the data memory, and the use of data memory #5.
It is determined by 1 degree and is not limited to this example.

また、メモリスイッチＭＳの構成については完全なりロ
スバ一方式をはじめとして多数の構成法があるがそのい
ずれかに限定されるものではない。Further, there are many configuration methods for the configuration of the memory switch MS, including a complete configuration and a loss bar type, but the configuration is not limited to any one of them.

ここでは−例として完全クロスバ一方式を仮定しており
複数のプロセッサから同時にデータメモリへのアクセス
要求が発生しても同一のデータメモリへのアクセスしな
いかざシ競合は起らないとしている。他の構成のメモリ
スイッチＭＳｔｌ−用いたとしても本発明の効果には関
係しない。Here, as an example, it is assumed that a complete crossbar system is used, and even if a plurality of processors simultaneously request access to the data memory, no contention will occur between accesses to the same data memory. Even if a memory switch MStl having another configuration is used, the effect of the present invention is not affected.

制御プロセッサＣＰは制御専用メモ１．ＩＣＰＭＩ。The control processor CP has a control-only memo 1. ICPMI.

ＣＰＭ２を有しさらにメモリスイッチＭ８を介してデー
タメモリＤＭＩ〜ＤＭ３２へもアクセスできる。CPM2, and can also access data memories DMI to DM32 via memory switch M8.

制御専用メモリの台数も本例では２台としているが、こ
れに限定される訳ではない。制御プロセッサＣＰはイン
タフニースミｆ介して各プロセッサＰＰ１′〜ＰＰ１５
’のそれぞれの制御プロセッサインターフェースＣＰＩ
’を介して各プロセッサと通信することができる。Although the number of control-only memories is two in this example, it is not limited to this. The control processor CP connects each processor PP1' to PP15 via an interface f.
' respective control processor interface CPI
'can communicate with each processor via '.

第４図は第３図に示すプロセッサの一例を示すブロック
図である。FIG. 4 is a block diagram showing an example of the processor shown in FIG. 3.

プロセッサエレメントｐｇｉ〜ＰＥ８は各々プログラム
を実行する能力を有するプロセッサエレメントでそのプ
ログラムはプロセッサエレメントＰＥＩ〜ＰＥ８に共通
に接続された専用のプログラムメモリＰＭに格納されて
いる。プログラムメモリコントローラＰＭＣはプログラ
ムメモリＰＭへのアクセスを制御するもので、プロセッ
サエレメントＰＥ１〜ＰＥｓからのアクセスの交通整理
などの制御を行なう。Each of the processor elements pgi to PE8 has the ability to execute a program, and the program is stored in a dedicated program memory PM commonly connected to the processor elements PEI to PE8. The program memory controller PMC controls access to the program memory PM, and controls traffic control for access from the processor elements PE1 to PEs.

メモリスイッチインターフェースＭＳＩ’は各プロセッ
サエレメントＰＥ１〜ＰＥ８が第３図に示すデータメモ
リＤＭ１〜ＤＭ３２にアクセスするための制御回路で、
複数のプロセッサエレメントＰＦＪ１〜ＰＥ８から同時
にアクセス要求があったときにはそれらの中から１つを
一定のアルゴリズムに従って選択し、選択されたアクセ
ス要求をメモリスイッチＭＳを経てデータメモリＤＭＩ
〜ＤＭ３２のいずれかへ送出する。読出動作であれば送
ったアドレスに従って該当するデータメモリから送られ
てくるデータを要求元のプロセッサエレメントに引き渡
す制御も行う。The memory switch interface MSI' is a control circuit for each processor element PE1 to PE8 to access the data memories DM1 to DM32 shown in FIG.
When there are access requests from multiple processor elements PFJ1 to PE8 at the same time, one of them is selected according to a certain algorithm, and the selected access request is sent to the data memory DMI via the memory switch MS.
~ DM32. In the case of a read operation, control is also performed to deliver the data sent from the corresponding data memory to the requesting processor element according to the sent address.

制御プロセッサインターフェースＣＰＩ’は制御プロセ
ッサＣＰと通信するだめの回路で各プロセッサエレメン
トＰｇｌ〜ＰＥ８と制御プロセッサ０２間の通信および
そのプロセッサＰＰ１’〜ＰＰ１６’自身と制御プロセ
ッサ０２間の通信を制御する（本方式ではソフトウェア
から見えるのは各プロセッサエレメントＰＲ１〜Ｐｇ８
であり、プロセッサＰＰＩ’〜ＰＰ１６’は物理的なか
たまり（装置単位）としてしか意味がないので、制御プ
ロセッサＣＰとの通信も論理的にはプロセッサエレメン
トと制御プロセッサ０２間が主である）。The control processor interface CPI' is a circuit for communicating with the control processor CP, and controls the communication between each processor element Pgl to PE8 and the control processor 02, as well as the communication between the processors PP1' to PP16' themselves and the control processor 02. In this method, each processor element PR1 to Pg8 is visible to the software.
Since the processors PPI' to PP16' have meaning only as a physical unit (device unit), communication with the control processor CP is logically mainly between the processor element and the control processor 02).

この通信の例としては各プロセッサエレメントＰＥ１〜
ＰＥ８にプログラム実行の開始を指示するプログラム実
行開始指示５ＴＡＲＴや、プログラム実行停止指示５Ｔ
ＯＰなどがある。プロセッサエレメントＰＥ１〜ＰＥ８
はプログラム実行開始指示５ＴＡＲＴを受けてプログラ
ムの実行を開始し、所定の条件を満した時あるいはプロ
グラム実行停止指示５ＴＯＰを受けたときに動作を中止
する。まえ、制御プロセッサインターフェースＣＰＩは
プロセッサエレメントＰＦｌｔ〜ＰＥ８から制御プロセ
ッサＣＰへインターフニースミｆ介して情報を伝える丸
めのｍＹ＠も行い、たとえば、プログラム実行開始指示
５ＴＡＲＴを受けて実行開始後、特定のプロセッサエレ
メントＰＥＩ〜ＰＢ８が実行を終了したなどめる条件を
満したらそれｔ制卸プロセッサ中Ｐに伝えるのも制御プ
ロセッサインターフェースＣＰＩ’である。As an example of this communication, each processor element PE1~
Program execution start instruction 5TART, which instructs PE8 to start program execution, and program execution stop instruction 5T.
There are OPs etc. Processor elements PE1 to PE8
starts executing the program upon receiving the program execution start instruction 5TART, and stops the operation when a predetermined condition is met or when the program execution stop instruction 5TOP is received. First, the control processor interface CPI also performs rounding mY@ to transmit information from the processor elements PFlt to PE8 to the control processor CP via the interface f. For example, after receiving the program execution start instruction 5TART and starting execution, It is also the control processor interface CPI' that notifies the controlling processor P when the elements PEI to PB8 have completed their execution.

各プロセッサニレメン）Ｐｇ１〜ＰＥ８の構成から読み
出す点が異なる。一般のコンピュータでは命令語とデー
タは同一のメモリに格納されるが本発明を用いた並列処
理システムではデータメモＩＪｃ）Ｍｌ〜Ｃ）ＭＢ２へ
のアクセス　パスの負荷を軽減する丸め命令語はプログ
ラムメモｌｊＰＭ４−Ｍ＃に格納している。これはデー
タについては各プロセッサニレメン）Ｐ’ＢＩ〜ＰＥ８
の相互間で受渡しする必要があるとともに各プロセッサ
ＰＰＩ／〜ＰＰ１６’　の相互間でも受渡しの必要があ
るので共通のデータメモリに格納せざるを得ないけれど
、プログラムはその必要性がなく、谷プロセッサエレメ
ントｐｇｔ〜ＰＪｌｉｍ８に共有されるが各プロセッサ
ＰＰ１１〜ＰＰ１６’ごとに設けられている専用のメモ
リ中に格納しておけるという性質を利用している。Each processor is different in that it is read from the configuration of Pg1 to PE8. In a general computer, instructions and data are stored in the same memory, but in the parallel processing system using the present invention, rounding instructions to reduce the load on access paths to the data memory IJc) Ml to C) MB2 are stored in the program memory. It is stored in ljPM4-M#. This is the data for each processor Nilemen) P'BI ~ PE8
Since it is necessary to transfer data between the processors PPI/~PP16', it has to be stored in a common data memory, but the program does not need to do this, and It utilizes the property that it can be stored in a dedicated memory that is shared by the elements pgt to PJlim8 but provided for each of the processors PP11 to PP16'.

谷プロセッサエレメントＰＥ１〜Ｐｇ８はプログラムメ
モリＰＭに格納されたプログラムに従ってデータメモリ
ＤＭＩ−ＤＭ３２からデータを読み出して処理し、結果
をデータメモリＤＭ１〜ＤＭ３２へ戻すという動作を繰
り返すことになる。The valley processor elements PE1 to Pg8 repeat the operation of reading data from the data memories DMI-DM32, processing them, and returning the results to the data memories DM1 to DM32 according to the program stored in the program memory PM.

第３図に示す並列処理システムにおいて、プログラムを
実行する時の動作は次のようになる。In the parallel processing system shown in FIG. 3, the operation when executing a program is as follows.

シめげる。Shimegeru.

演算開始前にデータＡｉ　、Ｂｉを制御プロセッサＣＰ
がデータメモリＤＭ１〜ＤＭ３２に入れる。たとえば、
データ人１〜Ａ口はデータメモリＤＭ１にデータＡ９〜
Ａ！・はデータメモｌＪＤＭ２に格納し、以下同様にし
てデータＡ４ｗ−Ａ１ｇはデータメモリＤＭ１６１；格
納する。同様に、データＢｌ−Ｂ、はデータメモリＤＭ
Ｉ７に、データＢ、〜Ｂ１ｍはデータメモリＤＭ１８に
、・・・・・・データＢｌ）Ｏ〜Ｂｉｔ１はデータメモ
リＤＭ３２に格納する。Before starting the calculation, the data Ai and Bi are controlled by the processor CP.
is stored in the data memories DM1 to DM32. for example,
Data person 1~A mouth is data A9~ in data memory DM1
A! * is stored in the data memory lJDM2, and data A4w to A1g are stored in the data memory DM161 in the same manner. Similarly, data Bl-B is data memory DM
I7, data B, ~B1m are stored in the data memory DM18, . . . data Bl)O~Bit1 are stored in the data memory DM32.

この例では、システム中には１６（プロセッサの数）×
８（各プロセッサ中のプロセッサエレメントの数）＝１
２８台のプロセッサエレメントがあｌｉ番目のプロセッ
サエレメントＰｇｉはＡ１ＸＢ１の計算をして演算結果
０１をデータメモリに格納する。このＦｔＦ算をやるた
めのプログラムはプロセッサエレメントＰＥＩ〜ＰＥ８
に共通なプログラムメモリＰＭの中に格納されており、
各プロセッサエレメントＰＥ１〜ＰＥ８の中の命令アド
レスレジスタにはそのプロセッサニレメン）ＰＨＩ〜Ｐ
Ｅ８が実行すべき最初の命令語のプログラムメモ１．Ｉ
ＰＭのアドレスが設定される。これは制御プロセッサＣ
Ｐの制御下でデータメモリＤＭＩ〜玉３２からメモリス
イッチＭＳおよびメモリスイッチインターフェースＭ８
Ｉ’を通して行なわれるか、６るいはインタフェース８
および制御プロセッサインターフェースＣＰＩ’を通し
て行なわれる。In this example, there are 16 (number of processors) x
8 (number of processor elements in each processor) = 1
Of the 28 processor elements, the li-th processor element Pgi calculates A1XB1 and stores the calculation result 01 in the data memory. The program for performing this FtF calculation is the processor elements PEI to PE8.
is stored in the program memory PM common to
The instruction address register in each processor element PE1 to PE8 is
Program memo of the first instruction word to be executed by E8 1. I
The PM address is set. This is the control processor C
Data memory DMI ~ ball 32 to memory switch MS and memory switch interface M8 under the control of P
I' or 6 or interface 8
and through the control processor interface CPI'.

以上の準備は制御プロセッサＣＰが行い、完了するとイ
ンターフェースａｔ−通して１２８台のすべてのプロセ
ッサエレメント宛のプログラム実行開始指示８ＴＡｌ−
ＬＴ　ｔ−プロセッサＰＰＩ’〜ＰＰ１６’ｉＣ送出す
る。これによって、すべてのプロセッサエレメントＰｇ
ｌ〜ＰＢ８は各々の命令アドレスレジスタの値に従って
、プログラムメモリＰＭから命令語を貌み出し、解読し
て実行する。The control processor CP performs the above preparations, and once completed, it issues a program execution start instruction 8TAl- to all 128 processor elements through the interface at-.
LT t-processor PPI' to PP16'iC is sent. This allows all processor elements Pg
1 to PB8 retrieve, decode, and execute an instruction word from the program memory PM according to the value of each instruction address register.

いま、プロセッサＰＰＩ’中のプロセッサエレメントＰ
ｇ１を例にとれば、データメモリＤＭＩから読み出した
データＡ１とデータメモリＤＭ１７から絖み出したデー
タＢ１に対しＡ、　Ｘ　Ｂ、の計算をして演算結果Ｃ１
１にデータメモリに格納する。Now, processor element P in processor PPI'
Taking g1 as an example, calculate A,
1 in the data memory.

同様にプロセッサエレメントＰＥ２はＡ、ＸＢ冨の計算
をして、演算結果Ｃ３を格納し、以下同様に、プロセッ
サエレメントＰＥ８はＡ　＠　Ｘ　Ａ　＠→Ｃ８の処理
をする。これらの処理は各プロセッサエレメントＰｇ１
〜ｐｇｓが並行に同時に実行する。Similarly, the processor element PE2 calculates A and XB values and stores the calculation result C3, and the processor element PE8 similarly processes A@XA@→C8. These processes are carried out by each processor element Pg1.
~pgs run concurrently in parallel.

ムを実行するとしているがそれは異なるプログラムであ
ってもよいし、たとえ同一プログラムでおっても条件分
岐が入る場合には各プロセッサエレメント毎に途中から
異なる命令シーケンスを実行することになる可能性かめ
る。Although the program is assumed to be executed, it may be a different program, or even if the program is the same, if a conditional branch is inserted, there is a possibility that each processor element will execute a different instruction sequence from the middle. .

ここで、プログラムメモリＰＭに格納されているプログ
ラムについてすこし説明する。Here, the programs stored in the program memory PM will be briefly explained.

プログラムメモリＰＭに記憶されたプログラムカ各プロ
セッサエレメントＰＥ１〜ＰＥ８毎に異なるものである
場合は特に問題はないが唯一つのプログラムをすべての
プロセッサエレメントＰＥ１〜ＰＥ８が共用する場合に
はそれを可能にするために特別の工夫が必要でおる。加
算２乗算といりた演算処理の動作やそのＩＩ序は各プロ
セッサエレメントＰＲ１〜ＰＥ８に共通であって本使用
するデータメモリＤＭＩ〜ＤＭ３２中に記憶されている
データはプロセッサエレメントＰＥ１〜ＰＥ８毎に異な
るからである。このためには九とえばインデックスレジ
スタなどを用いてプログラム中の命令語のオペランド　
アドレスを修正して使用するなどが考えられる。たとえ
ば［Ａ番地のデータをアキエムレータに加算せよ」とい
う命令語の場合各プロセッサエレメントＰｇ１〜Ｐｇ８
は自分のインデックレジスタ中にプロセッサエレメント
誉号ｒｉＪｔ記憶し、上記命令胎ｒ実行するときには該
インデックス　レジスタで番地Ａｔ−修飾し１１＋ｔＪ
番地のデータ金アキエムレータに加算すればよい。これ
により各プロセッサエレメントＰＥＩ〜ＰＥ８はすべて
同じ加算動作をするが用いるデータは互いに異なように
できる。There is no particular problem if the program stored in the program memory PM is different for each processor element PE1 to PE8, but if one program is shared by all processor elements PE1 to PE8, it is possible. Therefore, special measures are required. The operation of arithmetic processing such as addition and multiplication and its II order are common to each processor element PR1 to PE8, and the data stored in the data memories DMI to DM32 used are different for each processor element PE1 to PE8. It is from. For this purpose, for example, index registers are used to store the operands of the instruction words in the program.
Possible options include modifying the address and using it. For example, in the case of the command "Add the data at address A to the achiemulator", each processor element Pg1 to Pg8
stores the processor element honorary number riJt in its own index register, and when executing the above instruction r, modifies the address At- with the index register and reads 11+tJ.
All you have to do is add the address data to the Akyemulator. This allows the processor elements PEI to PE8 to all perform the same addition operation, but use different data.

各プロセッサエレメントＰＲ１〜ＰＥ８から各データメ
モＩＪＤＭ１〜ＤＭ３２へのアクセス要求（Ａｉ、Ｂｉ
を読み出しえり、Ｃ４を格納するだめの要求）はメモリ
スイッチインターフェースＭＳＩ’で交通整理され、競
合した場合は１つだけ選択されて他は待九されるので、
各プロセッサエレメントＰＥ１〜ＰＥ８の命令実行のタ
イきングはずれてくる可能性がある。同様に、プログラ
ムメモリＰＭへのアクセスについてもプロセッサエレメ
ントＰＫｌ〜ＰＲ８の相互間で競合が発生するが、これ
はプログラムメモリＰＭの制御部であるプログラムメモ
リコントローラＰＭＣが交通整理する。Access requests from each processor element PR1 to PE8 to each data memory IJDM1 to DM32 (Ai, Bi
Requests to read C4 and store C4 are traffic-arranged by the memory switch interface MSI', and if there is a conflict, only one is selected and the others are put on hold.
The timing of instruction execution of each processor element PE1 to PE8 may be shifted. Similarly, competition occurs between the processor elements PK1 to PR8 regarding access to the program memory PM, but the program memory controller PMC, which is a control unit of the program memory PM, coordinates traffic.

したがって、たとえ同一のプログラムを実行していても
、すべてのプロセッサエレメントｐｇｉ〜ＰＥ８がまっ
たく同期して同時刻に同じ動作・処理をしている訳では
ない。Therefore, even if the same program is executed, all processor elements pgi to PE8 do not perform the same operations and processes at the same time in complete synchronization.

演算処理ＡｔｘＢｓ−＋ｃｓの処理が完了すると制御プ
ロセッサインターフェースＣＰＩ’およヒインターフエ
ースａを通りて制御プロセッサＣＰにこの旨通知される
。制御プロセッサＣＰは１２８台すべてのプロセッサエ
レメントＰＢ１〜ＰＥ８からの完了通知を待ってΣＣｉ
の処理をする。演算結ｓｓｔ果ＣｉはデータメモＩＪＤＭＩ〜ＤＭ３２の中に格納さ
れているから制御プロセッサＯＰはメモリスイッチＭ８
を介してデータメモＩＪＤＭＩ〜ＤＭ３２にアクセスし
て演算結果ＣＩを読出順に加算する。この動作は一般的
コンピエータにおける加算と同じで制御プロセッサＣＰ
内のプログラムにより、演算結果ＣＩ　＋　ＣＩ　＋・
・・・・・ＣＩＭｌに逐−読み出して加算する。この加
算が終了すれば求める答となる。When the processing of the arithmetic processing AtxBs-+cs is completed, the control processor CP is notified of this through the control processor interface CPI' and the interface a. The control processor CP waits for completion notifications from all 128 processor elements PB1 to PE8 and then executes ΣCi.
process. Since the calculation results sst and the results Ci are stored in the data memories IJDMI to DM32, the control processor OP switches the memory switch M8.
The data memos IJDMI to DM32 are accessed via the data memos IJDMI to DM32 and the calculation results CI are added in the order of reading. This operation is the same as addition in a general compiler, and the control processor CP
By the program inside, the calculation result CI + CI +・
...Sequentially read and add to CIM1. When this addition is completed, the desired answer is obtained.

各プロセッサニレメン）Ｐｇｌ〜ＰＥ８から制御プロセ
ッサＣＰへの通知は上記のように各プロセッサエレメン
トＰＥＩ〜ＰＥ８が終る毎に制御プロセッサＣＰに通知
してもよいがプロセッサＰＰＩ’〜ＰＰ１６’の内でま
とめて通知することで制御プロセッサＣＰとの間の通値
菫を減らすことも考えられよう。Notifications from each processor element Pgl to PE8 to the control processor CP may be sent to the control processor CP every time each processor element PEI to PE8 completes as described above, but they are grouped within the processors PPI' to PP16'. It may be possible to reduce the number of exchanges with the control processor CP by notifying the processor CP.

一また、上記のようにΣＣｉのＩｉｔ算をすべて制御−１プロセッサＣＰが実行するのではなくプロセッサエレメ
ントＰＥ１〜ＰＥ８が途中まで行う方法も考えられる。Alternatively, instead of having the control-1 processor CP execute all of the Iit calculations of ΣCi as described above, a method may be considered in which the processor elements PE1 to PE8 perform part of the calculation.

すなわち、たとえばｃｌ＋ｃ、＋・・・・・・＋Ｃ，は
プロセッサＰＰＩ’の中で次のようにすればよい。（Ｃ
，十Ｃ鵞）＋　（Ｃｍ＋Ｃ４）ｒ　（Ｃｓ＋Ｃ５）ｒ（
Ｃｙ＋Ｃｓ　）の４つの計算ｔ−４つのプロセッサエレ
メントＰＥ１〜ＰＥ８を使って並列に行い、その結果を
それぞれＤＩ＋　Ｄ！＋　Ｄ急＋　Ｄ４とすると次に（
Ｄｓ＋Ｄｓ　）　＋　（ＤＩ＋Ｄ４　）ｅ並列に行い、
その結果を各々Ｂ　ｔ　、　Ｅ　意　とすると最後にＢ
　Ｉ＋Ｅ　雪を計算する。これ會各ブロセッサＰＰＩ’
〜ＰＰ１６’内でやれば制御プロセッサＣＰは１６台の
プロセッサＰＰｌ’〜ＰＰ１６’の残した１６個の演算
結果の総和を計算するたけでよい（前の例では制御プロ
セッサＣＰは１２７回の加算をやることになるがこの方
法なら１５回の加算ですむ）。That is, for example, cl+c, +...+C may be executed as follows in the processor PPI'. (C
, 10C) + (Cm+C4)r (Cs+C5)r(
Cy+Cs) are performed in parallel using t-4 processor elements PE1 to PE8, and the results are calculated as DI+D! + D sudden + D4, then (
Ds+Ds) + (DI+D4)e performed in parallel,
Letting the results be B t and E y respectively, finally B
I+E Calculate snow. Each processor PPI'
~ If it is done within PP16', the control processor CP only needs to calculate the sum of the 16 operation results left by the 16 processors PP1' to PP16' (in the previous example, the control processor CP performs 127 additions). (This method only requires 15 additions).

いずれの方法で計算するかはすべてグログラムによって
制＠ちれ・るので使う人の選択による。Which method to use for calculation is entirely controlled by the glogram, so it is up to the user's choice.

このように、第３図に示す実施例では８台のプロセッサ
エレメントＰＨＩ−ＰＥ８をそれぞれ含んだ１６台のプ
ロセッサＰＰ１’〜ＰＰ１６’で１２８の並列演算がで
きるが実際に１２８台の独立したプロセッサをおい友と
するとメモリスイッチＭＳの規模は１２８Ｘ３２になる
のに比し本例では１６Ｘ３２で済み装置実現上有利にな
る（コスト。In this way, in the embodiment shown in FIG. 3, 128 parallel operations can be performed by 16 processors PP1' to PP16' each including eight processor elements PHI-PE8, but in reality, 128 independent processors are used. For example, the scale of the memory switch MS would be 128x32, but in this example it is only 16x32, which is advantageous in terms of device implementation (cost).

装置の大きさ、性能などの面で）。(in terms of device size, performance, etc.).

〔Effect of the invention〕

本発明の並列処理方式は、制御プロセッサに並列接続さ
れ複数のデータメモリとメモリスイッチを介して相互に
並行して接続されるプロセッサのそれぞれが、単一のプ
ロセッサエレメントからなる代りに１並列に動作する複
数のプロセッサエレメントを並列に設けることにょシ、
メモリスイッチ側から見た場合には単一のプロセッサエ
レメントしか有していないように見えながら時分割で複
数のプロセッサエレメントをメモリスイッチに接続する
ことができるため、並列駄を増大できるという効果かめ
る。In the parallel processing method of the present invention, each of the processors connected in parallel to a control processor and connected to each other in parallel via a plurality of data memories and memory switches operates in parallel instead of each consisting of a single processor element. By providing multiple processor elements in parallel,
When viewed from the memory switch side, it appears that the memory switch has only a single processor element, but since a plurality of processor elements can be connected to the memory switch in a time-sharing manner, the effect of increasing parallelism can be realized.

すなわち、本発明の並列処理方式は、複数のプロセッサ
エレメントを内蔵するプロセラサラ並列におき、制御プ
ロセッサの制御下に並列動作されるように構成すること
で大きな並列度の並列演算を実現容易にし、かつ並列演
算できない部分は制御プロセッサで処理することで融通
性が増し応用分野が拡大するという効果を有する。In other words, the parallel processing method of the present invention facilitates the realization of parallel operations with a large degree of parallelism by configuring a processor parallel system that includes a plurality of processor elements to operate in parallel under the control of a control processor. Parts that cannot be computed in parallel are processed by a control processor, which has the effect of increasing flexibility and expanding the field of application.

[Brief explanation of drawings]

第１図は従来の一例を示すシステム構成図、第２図は第
１図に示すプロセッサの詳細ブロック図、第３図は本発
明の一実施例を示すシステム構成図、第４図は第３図に
示すプロセッサの詳細ブロック図である。ｃｐ・・・・・・制御プロセッサ、ＰＰＩ〜ＰＰ１６　
、　ＰＰＩ／〜ＰＰ　１６’・・・・・・プロセッサ、
ＣＰＭＩ、Ｃｆ’Ｍ２・・・・・・制御専用メモリ、Ｍ
８・・・・・・メモリスイッチ、ＭＭＩ〜ＭＭ３２・・
・・・・メモリ、Ｍ８Ｉ　、Ｍ８Ｉ’　中・・・メモリ
スイッチインターフェース、ＣＰＩ、ＣＰＩ’・・川・
制御プロセッサインターフェース、ｐｇ、ｐＥｔ〜ＰＥ
８・・・・・・プロセッサニレメン）、ＤＭ１〜ＤＭ３
２・・・・・・データメモリ、ＰＭ・・・・・・プログ
ラムメモり、ＰＭＣ・・・・・・プログラムメモリコン
トローラ、ａ・・・・・・インターフェース。第１　聞Ｐ第？園FIG. 1 is a system configuration diagram showing a conventional example, FIG. 2 is a detailed block diagram of the processor shown in FIG. 1, FIG. 3 is a system configuration diagram showing an embodiment of the present invention, and FIG. FIG. 2 is a detailed block diagram of the processor shown in the figure. cp...Control processor, PPI to PP16
, PPI/~PP 16'...processor,
CPMI, Cf'M2... Control dedicated memory, M
8...Memory switch, MMI to MM32...
...Memory, M8I, M8I'...Memory switch interface, CPI, CPI'...River...
Control processor interface, pg, pEt to PE
8... Processor Niremen), DM1 to DM3
2...Data memory, PM...Program memory, PMC...Program memory controller, a...Interface. 1st question Pth? garden

Claims

[Claims]

a control processor, a plurality of data memories each storing data, a plurality of processors connected in parallel to the control processor, and a rounding circuit interconnecting the plurality of processors and the plurality of data memories in parallel. each of the plurality of processors includes a plurality of processor elements provided in parallel, a program memory that is provided in common to each processor element and stores a program, and a memory switch that connects the plurality of processor elements to the memory switch. A parallel processing method comprising: a control processor interface for connecting to a control processor; and a memory switch interface for connecting the plurality of processor elements to the memory switch.