JPS6313216B2

JPS6313216B2 -

Info

Publication number: JPS6313216B2
Application number: JP50228083A
Authority: JP
Inventors: Arufuretsudo Jon Desanteisu; Jozefu Shiigufuriido Shibingaa
Original assignee: YUNISHISU CORP
Current assignee: YUNISHISU CORP
Priority date: 1983-06-08
Filing date: 1983-06-08
Publication date: 1988-03-24
Also published as: JPS59501131A

Description

請求の範囲１主メモリを含み、オペレータおよび主メモリ
アドレスを含むオブジエクトコードからなる命令
コードを実行するためのデータ処理システムであ
つて、順次的な形態で前記命令コードを受けて、オペ
レータが前の処理結果を必要としない論理的非従
属性を示すときを決定して前記論理的非従属のオ
ペレータおよびそれに続く論理的に従属のオペレ
ータからなるストリングを論理的に非従属の待ち
行列にする手段と、前記形成された論理的非従属の待ち行列をスト
アする手段と、前記ストア手段に接続され、同時に実行するた
めに、前記待ち行列の別々のを個々に受ける複数
の処理手段と、前記処理手段の各々はローカルメ
モリを含み、前記命令コードに含まれる主メモリアドレスを
受けて、対応する主メモリ記憶位置からデータを
取出す主メモリアドレス手段と、前記主メモリに接続され、前記待ち行列の各ス
トリングに対応してローカルメモリアドレスを前
記主メモリへ与え、それにより前記主メモリから
取出されたデータを前記ローカルメモリの対応す
る記憶位置へ転送することができるようにするロ
ーカルメモリアドレス手段とを備える、データ処
理システム。２前記ローカルメモリアドレス手段は、特定の
論理的に非従属な待ち行列を構成するオペレータ
の対応するストリングに前記ローカルメモリアド
レスを付加えるようにされている、請求の範囲第
１項記載のデータ処理システム。３主メモリを含み、オペレータおよび主メモリ
アドレスを有するオブジエクトコードの順次的コ
ードを受けて実行するためのデータ処理システム
におけるデータ処理方法であつて、前記処理シス
テムは各々がローカルメモリを有する複数の処理
手段を含み、前記順次的なオブジエクトコードを受け、オペレータが前の処理の結果を必要としない論
理的非従属性を示すときを決定し、論理的に非従属のオペレータおよびそれに続く
論理的に従属のオペレータからなるストリングを
論理的に非従属の待ち行列にし、オブジエクトコードに含まれる主メモリアドレ
スを受けて前記主メモリの対応する記憶位置から
データを取出し、前記待ち行列の各ストリングに対応してローカ
ルメモリアドレスを前記主メモリへ与え、それに
より前記主メモリから、取出されたデータを対応
するローカルメモリの記憶位置に転送し、前記待ち行列の別々のものを、同時実行のた
め、異なつた個々の処理手段へ転送する、データ
処理方法。４前記ローカルメモリアドレスを、特定の論理
的非従属の待ち行列を構成するオペレータのうち
の対応するストリングに付加えることをさらに含
む、請求の範囲第３項に記載のデータ処理方法。５前記データ処理システムはジヨブ待ち行列を
含み、オペレータからなる前記論理的非従属の待
ち行列および対応するローカルメモリアドレス
を、続いて前記処理手段の別々のものに分配する
ために前記ジヨブ待ち行列へ転送することをさら
に含む、請求の範囲第４項記載のデータ処理方
法。明細書本願と直接または間接に関連する米国特許出願は
以下のとおりである。 1982年６月８日にAlfred J.De Santis等によつ
て出願された「多重処理エレメントのための従属
自由コードを発生する機構」という名称の出願番
号第386339号（特表昭59―501133号）。 1982年６月８日にAlfred J.De Santisによつて
出願された「従属自由コードのためのデータアイ
テムを再ネーミングするシステムおよび方法」と
いう名称の出願番号第386420号（特表昭59―
501132号）。発明の背景発明の分野この発明は、従属自由コードを発生するための
機構に関し、より特定的には複数の同時処理エレ
メントを用いるためのそのような機構に関する。先行技術の説明今日でもなほとんどのコンピユータは、本質的
に順次的である命令言語によつて駆動されまたそ
のような命令言語を実行するVon Neumanタイ
プのものである。さらに、そのような順次的な言
語は、個々の命令の間に多くの従属性を含み、そ
れによつて個々の命令は順序を無視して実行され
ることではできない。たとえば以下のようなシー
ケンスを考察する。Ｃ：＝Fn（Ａ、Ｂ）Ｄ：＝Fn＋ｉ（Ｃ、Ｅ）この２つの関数FnおよびFn＋ｉは、関数Fnの
結果が次の関数Fn＋ｉに対する入力として用い
られるので、論理的な従属性を有している。さらに、順次言語の欠点は、シーケンスまたは
ループが繰返されるときにメモリ取出しおよびコ
ード処理に冗長性が存在するということであり、
もしそれが改良されるならば、プロセツサの処理
高は増大されるであろう。処理システムの処理高を増大させる１つの方法
は、複数のプロセツサを多重処理モードで用いる
ことである。しかしながら、個々のプロセツサは
なお命令を順次的に実行しなければならず、唯一
の同時処理は、個々のプロセツサがプログラムの
別々のセグメントを実行しているときかまたは全
く別のプログラムを実行しているときにのみ存在
する。そのような多重処理システムは、たとえ
ば、Mott等の米国特許第3319226号および
Anderson等の米国特許第3419849号に開示されて
いる。処理高を増大させるさらに他の試みは、命令実
行の種々の副関数がオーバラツプするパイプライ
ニングを採用することである。これらのステツプ
を一連の命令とオーバラツプさせることによつ
て、命令実行は各クロツク時間行なわれることが
でき、それによつてプロセツサの処理高を増大さ
せる。処理高を増大させるためのこれらすべての方法
は、上述したような命令間の論理的従属性のゆえ
に、順次的な命令実行のために設計されている。
論理的な従属性のため、種々の命令が互いに従属
性なく実行されて一群のまたは多数の処理エレメ
ントによる処理を容易に適合させる真の同時処理
は達成されることができない。応用言語は、各ステートメントが本質的に互い
に無関係であり、したがつてそのような応用ステ
ートメントを縮小するように設計された処理エレ
メントの回路網によつて同時に実現されることが
できるという点において、命令言語とは異なつて
いる。そのような応用言語プロセツサの例は、
Bolton等の米国特許出願第281064号および
Hagenmaier等の米国特許出願第281065号におい
て与えられている。これらの両出願は、1981年７
月７日に出願され、本願の譲受人に譲渡された。
そのような応用言語は、それらが設計によつて
Von Neuman的意味における順次的でないとい
う点において、命令言語とは異なつている。しか
しながら、今日用いられるほとんどのプログラム
ライブラリは、命令言語で書かれており、またそ
れらのライブラリを用いるべきデータ処理システ
ムの更新またはさらに別の世代は、命令言語を実
行するようにされなければならない。処理高が増大され得る１つの方法は、前の処理
の結果に依存しないオブジエクトコードのセグメ
ントを認識し、それらのセグメントを複数の処理
エレメントによつて同時に処理され得る非従属シ
ーケンスまたは待ち行列に形成することである。
このことはもちろん、オペランドがメモリに存在
するときのそのもとの値を破壊することなく処理
がオペランドに対して実行され得るといつた方法
のオペランドの取扱いを必要とする。異なつた記
号名が、この目的のために任意のデータアイテム
を参照するのに指定され得る。コードまたは記号
のそのような待ち行列の配列は、処理装置による
同時処理をさらに適合させる。この発明の目的は、従属自由命令コードを発生
するための改良された機構を提供することであ
る。この発明の他の目的は、多重処理エレメントに
よる実行のため従属自由命令コードを提供するこ
とである。この発明のさらに他の目的は、従属自
由命令コードを複数の処理エレメントに同時的態
様で与えるための改良された機構を提供すること
である。この発明のさらに他の目的は、冗長なメモリ取
出しがなく、かつそのコードが一連のそのような
コードの処理のために再処理されなくてもよい特
性である命令コードを発生するための機構を提供
することである。発明の概要上述した目的を達成するために、この発明は、
オブジエクトコードのストリングを受けそれらを
高レベルのタスクに形成し、論理的に非従属であ
るそのようなタスクのシーケンスを決定し、それ
によつてそれらが別々に実行されるようにする、
データプロセツサのためのキヤツシユ機構に向け
られている。このキヤツシユ機構は、種々のタス
クによつて必要とされるすべてのメモリアクセス
を行ない、かつこれらのタスクを、種々のデータ
アイテムがストアされなかつたローカルメモリに
対する対応するポインタまたはリフアレンスとと
もにストアする。このキヤツシユ機構は、記号翻
訳テーブルを利用し、そこではタスクは、ローカ
ルメモリに対する種々のリフアレンスまたはポイ
ンタを表わす記号とともに待ち行列の態様でスト
アされる。この方法において、種々のデータアイ
テムは別々のタスクに用いるための別々の記号ま
たは記号名を割当てられることができ、したがつ
て種々のタスク間の依存性を限定しかつデータ変
更を制御する。この発明の特徴は、一群の処理エレメントに対
するキヤツシユ機構を与えることであり、そのキ
ヤツシユ機構は順次的なオブジエクトコードのス
トリングをタスクの待ち行列に形成し、各待ち行
列は他のものとは論理的に非従属である。Claim 1: A data processing system for executing an instruction code comprising a main memory and comprising an operator and an object code including a main memory address, the data processing system comprising: a main memory; means for determining when to indicate a logical non-dependence that does not require the processing result of a logically non-dependent operator and placing a string consisting of said logically non-dependent operator and a subsequent logically dependent operator in a logically non-dependent queue; means for storing said formed logically independent queues; a plurality of processing means connected to said storing means and each receiving a separate one of said queues for simultaneous execution; each of the means includes a local memory, main memory address means for receiving a main memory address included in said instruction code and retrieving data from a corresponding main memory storage location; local memory addressing means for providing a local memory address to said main memory corresponding to a string, thereby enabling data retrieved from said main memory to be transferred to a corresponding storage location in said local memory. , data processing systems. 2. Data processing according to claim 1, wherein said local memory address means are adapted to append said local memory address to a corresponding string of operators constituting a particular logically independent queue. system. 3. A data processing method in a data processing system for receiving and executing a sequential code of object code including a main memory and having an operator and a main memory address, the processing system comprising a plurality of main memory devices each having a local memory. processing means, receiving said sequential object code, determining when an operator exhibits a logical non-dependency that does not require the result of a previous processing, and processing means for processing logically non-dependent operators and subsequent logical make strings of operators subordinate to logically independent queues, retrieve data from corresponding storage locations in said main memory in response to a main memory address contained in an object code, and write each string in said queue to a logically independent queue; correspondingly providing a local memory address to said main memory, thereby transferring retrieved data from said main memory to a corresponding local memory storage location, and distributing separate ones of said queues for simultaneous execution; A data processing method that involves transferring data to different individual processing means. 4. The data processing method of claim 3, further comprising appending the local memory address to a corresponding string of operators configuring a particular logically independent queue. 5. said data processing system includes a job queue, said logically independent queue of operators and corresponding local memory addresses to said job queue for subsequent distribution to different ones of said processing means; 5. The data processing method according to claim 4, further comprising transferring. Specification The following U.S. patent applications are directly or indirectly related to this application: Application No. 386339 entitled "Mechanism for Generating Dependent Free Codes for Multiprocessing Elements" filed by Alfred J. De Santis et al. on June 8, 1982 (Japanese Patent Application No. 59-501133) ). Application No. 386420 entitled "System and Method for Renaming Data Items for Dependent Free Codes" filed by Alfred J. De Santis on June 8, 1982
No. 501132). BACKGROUND OF THE INVENTION Field of the Invention This invention relates to mechanisms for generating dependent free code, and more particularly to such mechanisms for using multiple simultaneous processing elements. Description of the Prior Art Even today, most computers are of the Von Neuman type, driven by and executing instruction languages that are sequential in nature. Furthermore, such sequential languages contain many dependencies between individual instructions, such that individual instructions cannot be executed out of order. For example, consider the following sequence. C:=Fn(A,B) D:=Fn+i(C,E) These two functions Fn and Fn+i have logical dependence because the result of the function Fn is used as the input to the next function Fn+i. are doing. Additionally, a drawback of sequential languages is that there is redundancy in memory fetching and code processing when a sequence or loop is repeated;
If it were improved, the processor throughput would be increased. One way to increase the throughput of a processing system is to use multiple processors in multiprocessing mode. However, the individual processors must still execute instructions sequentially, and the only simultaneous processing occurs when the individual processors are executing separate segments of the program, or when they are executing completely different programs. It exists only when it exists. Such multiprocessing systems are described, for example, in U.S. Pat. No. 3,319,226 to Mott et al.
As disclosed in Anderson et al., US Pat. No. 3,419,849. Yet another attempt to increase throughput is to employ pipelining, in which the various subfunctions of instruction execution overlap. By overlapping these steps with a series of instructions, instruction execution can occur every clock time, thereby increasing processor throughput. All these methods for increasing throughput are designed for sequential instruction execution due to the logical dependencies between instructions as described above.
Because of the logical dependencies, true simultaneous processing cannot be achieved, in which the various instructions are executed independently of each other to easily accommodate processing by a group or multiple processing elements. An application language is an application language in that each statement is essentially independent of each other and can therefore be simultaneously implemented by a network of processing elements designed to reduce such application statements. It is different from command language. An example of such an applied language processor is
U.S. Patent Application No. 281,064 to Bolton et al.
Given in U.S. Patent Application No. 281,065 to Hagenmaier et al. Both of these applications were filed in July 1981.
The application was filed on May 7th and assigned to the assignee of the present application.
Such application languages are
It differs from command languages in that it is not sequential in the Von Neuman sense. However, most program libraries used today are written in imperative languages, and updates or even further generations of data processing systems that use those libraries must be made to implement imperative languages. One way in which processing power can be increased is by recognizing segments of object code that are independent of the results of previous processing and placing those segments into non-dependent sequences or queues that can be processed simultaneously by multiple processing elements. It is to form.
This, of course, requires handling of the operand in such a way that operations can be performed on the operand without destroying its original value as it resides in memory. Different symbolic names may be specified to refer to any data item for this purpose. The arrangement of such a queue of codes or symbols further accommodates simultaneous processing by the processing unit. It is an object of this invention to provide an improved mechanism for generating dependent free instruction codes. Another object of the invention is to provide dependent-free instruction code for execution by multiple processing elements. Yet another object of the invention is to provide an improved mechanism for providing dependent free instruction codes to multiple processing elements in a simultaneous manner. Yet another object of the invention is to provide a mechanism for generating instruction codes which are characterized by the fact that there are no redundant memory fetches and that the code does not have to be reprocessed for the processing of a series of such codes. It is to provide. Summary of the invention In order to achieve the above-mentioned object, this invention
taking a string of object code and forming them into higher-level tasks and determining sequences of such tasks that are logically non-dependent so that they are executed separately;
It is directed to a cache mechanism for data processors. This cache mechanism performs all memory accesses required by the various tasks and stores these tasks with corresponding pointers or references to local memory where the various data items have not been stored. This cache mechanism utilizes a symbol translation table in which tasks are stored in a queue with symbols representing various references or pointers to local memory. In this way, different data items can be assigned different symbols or symbolic names for use in different tasks, thus limiting dependencies between different tasks and controlling data modification. A feature of the invention is to provide a cache mechanism for a group of processing elements that forms a string of sequential object codes into queues of tasks, each queue being logically distinct from the others. It is essentially non-dependent.

[Brief explanation of the drawing]

この発明の上述の目的およびその他の目的、効
果および特徴は、図面を参照して行なう以下の詳
細な説明から一層明らかとなろう。第１図は、この発明が設計されるためのオブジ
エクトコードのストリングおよびそのオブジエク
トコードから形成される対応する論理的非従属待
ち行列である。第２図は、この発明によるシステムの概略ブロ
ツク図である。第３図は、この発明により形成される待ち行列
のフオーマツトを示す。第４図は、この発明に利用される記号翻訳テー
ブルモジユールの概略ブロツク図である。第５図は、この発明に用いられる処理エレメン
トの概略ブロツク図である。第６図は、この発明を示すタイミング図であ
る。発明の概略説明上述の目的、効果および特徴を達成するため
に、この発明は３つの異なつた見地、すなわち多
重処理エレメントによる改良されたコード処理、
リフアレンス処理および並列的実行を有する。コ
ード処理において、この発明はまず連結によつて
命令ストリングを予備処理し、一連の連結された
命令の間の関係を調べて、それらの命令を互いに
つないで従属命令の待ち行列を形成する。連結さ
れた命令が互いにつながれるべきかどうかを決定
するために用いられる機構は、続く連結された命
令に対する出力を与える１つの連結された命令へ
の従属である。一旦非従属性が位置決めされる
と、待ち行列が形成される。一旦待ち行列が形成
されると、この発明による機構はその全待ち行列
を１つのステツプで処理することによつて効果的
である。連結された命令を通常的に再処理するた
め数サイクルを必要とするのが、今では１サイク
ルでなされ、また待ち行列は一連のシーケンスの
実行に対し再発生される必要がない。さらに、コードの処理の間、前に参照されかつ
処理エレメントに対しローカルであるオペランド
リフアレンスは認識され得る。このことは各リフ
アレンスを受けかつそのアイテムがプロセツサの
ローカルメモリにあるかどうかをみるために翻訳
テーブルをスキヤンすることによつて達成され
る。もしリフアレンスがプロセツサのローカルメ
モリに常駐しなければ、この発明はそのリフアレ
ンスに記号を割当て、任意の待ち行列に対応する
個々の記号は１つの処理エレメントに対する後続
の転送のためそこに付加される。一旦対応する待
ち行列が形成されると、それらは複数の処理エレ
メントによつて同時に実行されることができる。今日の処理システムの設計において、スタツク
配向プロセツサを用いる傾向が多く、そこにおい
ては、プツシユダウンスタツク、または先入れ後
出しスタツクが与えられて、特定の高レベルプロ
グラム言語によつて用いられる再帰的手順および
ネステイツド処理を適合させる。このようなスタ
ツク配向プロセツサが与えられると、親制御プロ
グラムおよび処理システムの一部を形成する他の
ルーチンは、アルゴル60のような本質的に再帰的
である特定の高レベル言語で書かれることができ
る。この形式の特定のプロセツサモジユールは、
Barton等の特許第3461423号、3546677号、およ
び3548384号に開示されている。スタツク機構の機能、先入れ後出し機構は、命
令および関連のパラメータを、特定の高レベル言
語のネストされた構造を反射する方法で操作する
ことである。そのようなスタツクは主メモリに概
念的に常駐し、プロセツサのスタツク機構はスタ
ツク内のトツプデータアイテムに対するリフアレ
ンスを含むようにされている。この方法におい
て、データアイテムの多くの種々のスタツクはメ
モリ内に常駐し、プロセツサはそれらをプロセツ
サ内に存在するスタツクレジスタのトツプに対す
るアドレスに従つてアクセスし、種々のスタツク
はそのレジスタの内容の変化によつて別々のとき
アクセスされることができる。もしプロセツサがそのようなスタツク機構を与
えられなければ、プロセツサは再帰タイプの言語
を、その一般目的のレジスタをそれらがハードウ
エアスタツク機構であるにもかかわらずアドレス
することによつて実行する。この発明の好ましい実施例は高レベル再帰的言
語で書かれたプログラムを実行するためのそのよ
うなスタツク配向プロセツサに向けられている
が、この発明の内容は再帰的なものとは別の高レ
ベル言語プログラムの形式を実行する設計された
他の形式のプロセツサにも用いることができる。一旦プログラムがこの高レベル言語で書かれる
と、それはプロセツサのコンパイラによつてオブ
ジエクトコードまたは機械言語コードのストリン
グにコンパイラされ、その形式は特定のプロセツ
サの設計に従つて設計されならびに制御される。
上述したように、今日設計されるほとんどのプロ
セツサはなおVon Neumanタイプのものであり、
それは本来的に順次的でありかつ多くの論理的従
属性を含む。この発明が「デコンパイル」された高レベル言
語コードの形式で従属自由コードをいかに与える
かということを一般的に示すために、ここで第１
図を参照する。第１図の左欄は、Ｃ［Ｉ、Ｊ］：＝
Ａ［Ｉ、Ｊ］＋Ｂ［Ｉ、Ｊ］の計算のための機械言
語コードのストリングを表わす。この計算は多く
のアドレスに対するものであるので、第１図の左
端に示された機械言語コードのストリングはルー
プの一連のシーケンスまたはシリーズにおいて実
行される。このコードのストリングは４つのコードのグル
ープまたはサブセツトに分割され、その各々のグ
ループは第１図の中央部分のブロツク図によつて
示されるように他のものと大部分論理的に非従属
である。一般的にこの発明の機構は、次の処理が
前の処理またはストアされた処理と非従属である
とき、論理的に非従属のストリングの端部を決定
する。この発明において、機構は、第１図の右欄に示
されるように、値呼出しまたはメモリ取出しを実
行しかつオペレータの待ち行列またはデータアイ
テム（またはデータアイテムに対するローカルア
ドレス）を形成する。これらのオペレータおよび
そのデータアイテムは互いに連結され、以下に説
明する方法で処理エレメントに転送され得る。こ
のような連結された命令は、以後タスクとして参
照される。第１図の例において、４つの別々の待ち行列は
従属連結命令の論理的に非従属なグループであ
り、以下に説明するように別々の処理エレメント
によつて同時に実行され得る。第１図の左欄にお
けるコードのストリングはループのシーケンスに
おいて実行されるべきであるので、第１図の右欄
における新しく発生された待ち行列は再発生され
る必要はない。各一連のループにとつて必要なこ
とは、新たな値およびアレイアイテムがメモリか
ら取出されるということである。また、新たなポ
インタ値は、ストアされる変数に割当てられなけ
ればならない。 The above objects and other objects, effects, and features of the present invention will become more apparent from the following detailed description with reference to the drawings. FIG. 1 is a string of object codes and the corresponding logically independent queues formed from the object codes for which the present invention is designed. FIG. 2 is a schematic block diagram of a system according to the invention. FIG. 3 shows the format of a queue created in accordance with the present invention. FIG. 4 is a schematic block diagram of a symbol translation table module used in the present invention. FIG. 5 is a schematic block diagram of processing elements used in the present invention. FIG. 6 is a timing diagram illustrating the invention. SUMMARY DESCRIPTION OF THE INVENTION To achieve the above objects, advantages and features, the present invention utilizes three different aspects: improved code processing with multiple processing elements;
Has reference processing and parallel execution. In code processing, the invention first preprocesses an instruction string by concatenation, examines the relationship between a series of concatenated instructions, and concatenates the instructions together to form a queue of dependent instructions. The mechanism used to determine whether concatenated instructions should be concatenated together is dependence on one concatenated instruction that provides output for subsequent concatenated instructions. Once a non-dependency is located, a queue is formed. Once a queue is formed, the mechanism according to the invention is effective by processing the entire queue in one step. What would normally require several cycles to reprocess concatenated instructions is now done in one cycle, and the queue does not need to be regenerated for the execution of a sequence. Furthermore, during processing of the code, operand references that have been previously referenced and are local to the processing element can be recognized. This is accomplished by receiving each reference and scanning the translation table to see if the item is in the processor's local memory. If a reference does not reside in the processor's local memory, the invention assigns a symbol to the reference, and the individual symbol corresponding to any queue is appended thereto for subsequent transfer to a processing element. Once corresponding queues are created, they can be executed simultaneously by multiple processing elements. In the design of today's processing systems, there is a tendency to use stack-oriented processors, where push-down stacks, or first-in, last-out stacks are given, and the recursion used by certain high-level programming languages is used. Adapt procedures and nested processing. Given such a stack-oriented processor, the parent control program and other routines that form part of the processing system can be written in certain high-level languages that are recursive in nature, such as Algol. can. A particular processor of this form is
As disclosed in Barton et al. patents 3461423, 3546677, and 3548384. The function of the stack mechanism, a first-in, last-out mechanism, is to manipulate instructions and associated parameters in a manner that reflects the nested structure of a particular high-level language. Such a stack resides conceptually in main memory, and the processor's stack mechanism is adapted to contain a reference to the top data item in the stack. In this way, many different stacks of data items reside in memory, and the processor accesses them according to their addresses to the tops of stack registers present in the processor, with the various stacks containing the contents of those registers. Can be accessed at different times depending on the change. If the processor is not provided with such a stacking mechanism, it implements a recursive type of language by addressing its general purpose registers even though they are a hardware stacking mechanism. Although the preferred embodiment of this invention is directed to such a stack-oriented processor for executing programs written in high-level recursive languages, the subject matter of this invention is to Other types of processors designed to execute types of language programs may also be used. Once a program is written in this high-level language, it is compiled by the processor's compiler into a string of object code or machine language code, the format of which is designed and controlled according to the design of the particular processor.
As mentioned above, most processors designed today are still of the Von Neuman type;
It is sequential in nature and contains many logical dependencies. To generally demonstrate how the invention provides dependent-free code in the form of "decompiled" high-level language code, the first
See diagram. The left column of Figure 1 shows C[I, J]:=
Represents a string of machine language code for the calculation of A[I,J]+B[I,J]. Since this calculation is for many addresses, the string of machine language code shown at the far left of FIG. 1 is executed in a sequence or series of loops. This string of codes is divided into four groups or subsets of codes, each group being largely logically independent of the others as shown by the block diagram in the center portion of FIG. . In general, the mechanism of the present invention determines the ends of a logically independent string when the next process is independent of the previous process or stored process. In this invention, the mechanism performs value calls or memory fetches and creates operator queues or data items (or local addresses for data items), as shown in the right column of FIG. These operators and their data items may be concatenated with each other and transferred to processing elements in the manner described below. Such concatenated instructions are hereinafter referred to as tasks. In the example of FIG. 1, the four separate queues are logically non-dependent groups of dependent concatenated instructions that can be executed simultaneously by separate processing elements as described below. Since the string of code in the left column of FIG. 1 should be executed in a sequence of loops, the newly generated queue in the right column of FIG. 1 does not need to be regenerated. All that is required for each series of loops is that new values and array items are retrieved from memory. Also, a new pointer value must be assigned to the variable being stored.

[Detailed description of the invention]

この発明によるプロセツサシステムは第２図に
示されており、キヤツシユ機構１０はオペレータ
の個々の待ち行列および複数の小さい処理エレメ
ント１１ａ，ｂおよびｃに対するデータリフアレ
ンス、ならびに待ち行列処理エレメント１３ａを
供給するための機構であり、それらの各々はそれ
自身のローカルメモリ１２ａ，ｂおよびｃならび
にローカルメモリ１３ｂにそれぞれ与えられる。
キヤツシユ機構１０は、主メモリ（図示せず）と
直接に通信し、個々の処理エレメントはまた直接
ストレージモジユール１４によつて主メモリと通
信する。機構１０は４つのユニツトから構成されてお
り、その４つのユニツトは待ち行列タスクモジユ
ール１０ａ、命令リフアレンスモジユール10b、
記号翻訳モジユール１０ｃ、およびジヨブ待ち行
列１０ｄを含む。ここで、これらの個々のユニツ
トの機能を概略的に説明する。オブジエクトコー
ドまたは機械言語コードの個々のストリングは待
機タスクモジユール１０ａによつてメモリから受
取られ、待機タスクモジユール１０ａは、各命令
を直列的に受けてそれらをタスクの待ち行列にア
センブルするバツフアまたはキヤツシユメモリで
あり、タスクの待ち行列の長さは一連の連結され
たキヤラクタの間の論理的依存性による。待機タ
スクモジユール１０ａは、命令のつながれたグル
ープが以前の計算の結果を必要としないときを決
定するのに充分なデコード回路を含む。つながれ
たタスクのそのような待ち行列がアセンブルされ
てしまうと、そのオペランドリフアレンスは命令
リフアレンスモジユール１０ｂに転送され、命令
リフアレンスモジユール１０ｂは個々の命令およ
び割当記号によつて要求される任意のメモリ取出
しを実行する。待機タスクモジユール１０ａはま
た、記号翻訳モジユール１０ｃに待ち行列番号を
割当てる。命令リフアレンスモジユール１０ｂは絶対メモ
リアドレスが論理的に保持されているかどうかを
決定する連想メモリであり、もし保持されていな
ければ命令基準モジユール１０ｂはそのアドレス
を主メモリに送り、そのアドレスをストアし、そ
こに記号を割当てることによつてそのメモリアク
セスを行なう。この連想メモリは次に、記号翻訳
モジユール１０ｃに対応するタスクとともにその
記号を転送する。記号翻訳モジユール１０ｃはそ
の記号に対しポインタ（ローカルメモリアドレ
ス）を割当て、そのポインタを主メモリに転送
し、それによつて主メモリはローカルメモリ内に
データアイテムをストアすることができる。オブ
ジエクトコードのストリングの最初の実行の間、
一連の命令に対する待ち行列は記号翻訳モジユー
ル１０ｃにおいて形成される。これらの待ち行列
が形成される一方、個々のタスクおよびポインタ
はジヨブ待ち行列１０ｄに転送される。記号翻訳モジユール１０ｃは、待機タスクモジ
ユール１０ａによつて参照され得る種々の待ち行
列記憶位置を有するテーブルルツアツプメモリで
ある。これらの記憶位置は、処理エレメントのロ
ーカルメモリに保持されたつながれた命令および
アイテムの記号のリストを含む。各待ち行列が読
出されるとき、待ち行列に対する記号は、以下に
詳細に説明するように、記号によつて参照される
アイテムの実際の記憶位置に対するポインタを含
むルツクアツプテーブルに対する読出しアドレス
として用いられる。第１図のオブジエクトコード
ストリングの最初の処理の終わりに、ジヨブ待ち
行列１０ｅは個々の作り出された待ち行列を含
み、その作り出された待ち行列はタスクおよびポ
インタによつて同時実行のため各処理エレメント
１１ａ，１１ｂ、および１１ｃに直列的に送られ
得る。一方、実行のため必要とされる個々のデー
タアイテムは主メモリから取出されてローカルメ
モリ１２ａ，１２ｂおよび１２ｃの適当な記憶位
置でストアされており、その記憶位置はジヨブ待
ち行列１０ｄにおけるポインタによつアクセスさ
れる。オブジエクトコードの最初のループまたは実行
の完了により、すべてのタスク処理が完了されて
しまうまで記号翻訳モジユール１０ｃからジヨブ
待ち行列１０ｄへ以前に作り出された待ち行列を
供給することによつて、一連のループが実行され
得る。第２図のジヨブ待ち行列１０ｄに待ち行列が常
駐するときそのフオーマツトは第３図に示されて
いる。左から右へ読出される各フイールドは、乗
算命令、加算命令、減算命令、およびＩ、Ｊおよ
びＣフイールドに対するポインタが続くインデツ
クス命令である。これらは第１図における第１の
待ち行列（Q₀）に対応し、そこでは８ビツトリ
テラルは各乗算および加算命令の一部となる。このようにして形成された待ち行列は、将来の
実行のため命令を保持するのみならず、スタツク
環境ならびにそのアドレスおよび実行されるべき
次の待ち行列の記憶位置を識別する。ステツブご
とに１つの待ち行列を利用可能処理エレメントに
与えること以外、コード処理のため他のいかなる
処理ステツプも必要とされない。第２図の記号翻訳モジユール１０ｃは、第４図
に詳細に示されている。第４図に示されるよう
に、このモジユールはテーブルルツクアツプ機構
であり、待ち行列記号テーブル１６の列はつなが
れたタスクに対する記憶位置ならびに第２図の命
令リフアレンスモジユール１０ｂによつて割当て
られる記号名を表わし、また対応する行は第２図
の待機タスクモジユール１０ａによつて割当てら
れる各待ち行列番号を表わす。上述したように、
記号翻訳モジユールにおけるこのようにして形成
された待ち行列は、行なわれるべき計算の各一連
のループに対し、第２図のジヨブ待ち行列１０ｄ
に対する転送のため、ポインタテーブル１７にお
けるポインタをアクセスする準備ができている。第４図において、種々の記号は間接ローカルメ
モリリフアレンスであり、したがつてそこにスト
アされたアイテムは異なつたポインタを与えられ
得るということに注意されたい。このことは、２
つの利点をもたらす。まず第１は、異なつたポイ
ンタを再ネーミングまたは割当てしてそれを表わ
すことによつて、任意のデータアイテムがローカ
ルメモリ内の１つ以上の記憶位置にストアされる
ということである。第２の利点は、任意の変数が
１つの記憶位置にストアされてそのポインタを変
化することなくそこから出ていくことができる一
方、その変数に対して行なわれる処理の結果が同
一の記号名であるが異なつたポインタを有する別
の記憶位置にストアされ得るということである。第２図の個々の処理エレメントは、第５図に示
されている。要約すれば、それらは複数のマイク
ロプログラム化されたマイクロプロセツサから形
成されており、マイクロプロセツサはインテル
8086のような商業的に利用可能なものであり、ま
たはそれらはFaber等の米国特許第3983539号に
開示されたカストマイズされたマイクロプログラ
ム化されたプロセツサであつてもよい。個々のプ
ロセツサは異なつた関数を実行するようにされる
ので、それらはその各関数を実行するのに必要と
されるのみの論理回路を含む特別目的のマイクロ
プロセツサであつてもよい。各回路１８は、演算
論理ユニツト、シフトユニツト、乗算ユニツト、
インデキシングユニツト、ストリングプロセツ
サ、およびデコードユニツトである。さらに、シ
ーケンシングユニツト１９は第２図のジヨブ待ち
行列１０ｄから命令を受けて、制御ストア２０に
ストアされたマイクロ命令をアクセスする。制御
ストアからのマイクロ命令は命令バスIBを介し
て各ユニツトに供給され、ユニツトによつて発生
された任意の状態信号は状態バスCBを介して転
送される。対応するローカルメモリからのデータ
は、ＡバスABで受けられ、実行された結果はＢ
バスBBに供給される。第１図に戻つて、第２図の待機タスクモジユー
ル１０ａによつて受けられているコードストリン
グにおける種々の命令およびそのモジユールによ
つて形成されるより高レベルの命令またはタスク
のより詳細な説明をここで行なう。第１図の左欄
に示されるように、コードストリングの最初の３
つの命令はデータアイテムＩの値呼出しまたはメ
モリ取出し、８ビツト値、および乗算命令であ
る。それらは次のタスク、すなわち第１図の右欄
の第１のタスクによつて示されるリテラル値をＩ
に乗算するタスク、につながれている。処理は、
加算タスクおよび減算タスクに対し続けられる。
名前呼出し命令は、データアイテムアドレスをス
タツクのトツプにおく命令であり、インデツクス
命令はメモリ内にあるデイスクリプタにおけるポ
インタを挿入する結果となる。このようにして、
第１の待ち行列Q₀が形成される。 Q₁の配列は、名前呼出し命令の後命令NXLV
が実行されてインデツクス処理およびデータの取
出しを生じるということ以外、同様である。この
ようにして、第２の待ち行列O₁が形成される。
第３の待ち行列Q₂の配列において、加算命令が
存在し、この加算命令によつて、スタツクのトツ
プにおける値を破壊するメモリの破壊記憶
（STOD）の前にＡおよびＢに対しこのように計
算された値が加算される。第１図の中央のブロツク図からわかるように、
Q₂の最後の２つのタスクまたはつながれた命令
の実行は、その値がローカルメモリにストアされ
る計算Q₀およびQ₁の結果を必要とする。その記
憶位置およびそれらの各ローカルメモリは、リフ
アレンスが実際にそこにストアされているかどう
かを示すためのインデツクスフラグに与えられ
る。この方法において、処理エレメントが同時的
な方法で処理されるとき、Q₂のルーチンは必要
な値が計算されかつローカルメモリにストアされ
てしまう前に第２のまたは最終の加算タスクに達
することが可能である。対応する処理エレメント
はそれらの値がまだ利用可能でないということを
検出し、その値が利用可能になるまでそれらの記
憶位置をアクセスし続ける。第４の待ち行列またはQ₃は、値Ｊを取出しそ
れに１を加算し、スタツクのトツプにおける値を
残したままメモリ内の非破壊ストアを行なう前
に、スタツクのトツプにそのアドレスを挿入す
る。最後の４つの命令は、メモリから値Ｋを取出
し、それを値Ｊ（LSEQ）と比較し、もし値Ｋが
値Ｊよりも大きければ次の命令、すなわち偽への
分枝により、プログラムカウンタが再ロードさ
れ、そのルーチンが繰返される。一方、コードス
トリングにおける最後の命令は、ルーチンの終了
を生じる無条件分枝である。第６図は個々の待ち行列に対する待ち行列実行
時間のタイミング図であり、特定のタスクに対す
る各クロツク時間は２つの番号によつて表わされ
ている。第１の番号は実行されている特定のルー
プまたはシーケンスを表わし、第２の番号は命令
を実行している特定の処理エレメントを表わす。
待ち行列の配列の結果となるコードストリングの
最初の伝送ならびにタスクの実行はほぼ17クロツ
ク時間を必要とし、一方一連のループは実行のた
め５クロツク時間のみを必要とする。というの
は、タスクはQTMおよびIRMにおいて完全に再
処理される必要がないので、個々の従属自由待ち
行列が同時的に実行されるからである。一般的に、待機タスクモジユールはタスク、こ
れらのタスクの待機、待ち行列実行、タグ予測お
よび分枝訂正に対する命令の同時的なステツプを
実行する。命令基準モジユールは、再ネーミン
グ、記号マネージメントおよび取替の機能を実行
する。記号翻訳モジユールは、並列アクセス、ポ
インタ割当およびタスク割当を与える。小さな処
理エレメントは、独特な処理エレメントが頻繁で
ないタスク実行およびストリングの関数部分に対
し用いられている間、頻繁なタスク実行のため設
けられる。第２図の直接リフアレンスモジユール
１５は、ノンスタツクリフアレンスの評価のため
設けられる。最後にコンパイルされたオブジエクトコードを受けて
そのコードのシーケンスをより高レベルのタスク
に形成し、かつそれがオブジエクトコードストリ
ングの前の実行の結果を必要としないという意味
において他の待ち行列と論理的に無関係であるよ
うなタスクの待ち行列を形成する、データプロセ
ツサのための機構が説明されてきた。この方法に
おいて、そのような待ち行列のシーケンスは、同
時実行のため、非従属処理エレメントに供給され
得る。記号翻訳テーブルが設けられて、それによつて
データアイテムが参照され、その記号はローカル
メモリに対する任意のポインタを割当てられ、そ
のポインタは変化されて、それによつてデータア
イテムは１つ以上のメモリ記憶位置に常駐するこ
とができ、またデータアイテムはそのアイテムに
対する処理の結果が他の記憶位置にストアされ得
る一方でメモリ内に残ることができる。この発明の一実施例が説明されたが、請求の範
囲に記載された発明の精神から逸脱することなく
変更および修正が可能であるということは当業者
によつて明らかであろう。 A processor system according to the invention is shown in FIG. 2, in which a cache mechanism 10 provides data references for individual queues of operators and a plurality of small processing elements 11a, b and c, as well as a queue processing element 13a. each of which is provided with its own local memories 12a, b and c and local memory 13b, respectively.
The cache mechanism 10 communicates directly with main memory (not shown), and the individual processing elements also communicate with the main memory by direct storage modules 14. The mechanism 10 consists of four units: a queue task module 10a, an instruction reference module 10b,
It includes a symbol translation module 10c and a job queue 10d. The functions of these individual units will now be briefly explained. The individual strings of object code or machine language code are received from memory by the standby task module 10a, which has a buffer that serially receives each instruction and assembles them into a queue of tasks. or cache memory, where the length of a task's queue depends on logical dependencies between a series of linked characters. Waiting task module 10a includes sufficient decoding circuitry to determine when a concatenated group of instructions does not require the results of a previous computation. Once such a queue of linked tasks has been assembled, its operand references are transferred to the instruction reference module 10b, which is requested by the individual instructions and assignment symbols. Perform arbitrary memory fetches. The waiting task module 10a also assigns a queue number to the symbol translation module 10c. The instruction reference module 10b is an associative memory that determines whether an absolute memory address is logically held; if not, the instruction reference module 10b sends the address to main memory and stores the address. The memory is accessed by assigning a symbol to it. This associative memory then transfers the symbol along with the corresponding task to the symbol translation module 10c. The symbol translation module 10c allocates a pointer (local memory address) to the symbol and transfers the pointer to main memory so that main memory can store the data item therein. During the first execution of the string of object code,
A queue for a series of instructions is formed in the symbol translation module 10c. While these queues are formed, individual tasks and pointers are transferred to job queue 10d. The symbol translation module 10c is a table top up memory having various queue storage locations that can be referenced by the waiting task module 10a. These storage locations contain a list of linked instructions and item symbols maintained in the local memory of the processing element. When each queue is read, the symbol for the queue is used as a read address into a lookup table containing a pointer to the actual storage location of the item referenced by the symbol, as described in more detail below. . At the end of the first processing of the object code string of FIG. It can be fed serially to elements 11a, 11b, and 11c. Meanwhile, individual data items required for execution are retrieved from main memory and stored at appropriate locations in local memories 12a, 12b and 12c, which locations are indicated by pointers in job queue 10d. Accessed once. Completion of the first loop or execution of the object code causes the sequence of operations by feeding previously created queues from symbol translation module 10c to job queue 10d until all task processing has been completed. A loop may be executed. The format in which a queue resides in job queue 10d of FIG. 2 is shown in FIG. Each field read from left to right is a multiply instruction, an add instruction, a subtract instruction, and an index instruction followed by pointers to the I, J, and C fields. These correspond to the first queue (Q ₀ ) in FIG. 1, where an 8-bit literal becomes part of each multiply and add instruction. The queue thus formed not only holds the instruction for future execution, but also identifies the stack environment and its address and storage location of the next queue to be executed. No other processing steps are required for code processing other than providing a queue of available processing elements, one per step. The symbol translation module 10c of FIG. 2 is shown in more detail in FIG. As shown in FIG. 4, this module is a table lookup mechanism, and the columns of the queue symbol table 16 are storage locations for chained tasks as well as symbols assigned by the instruction reference module 10b of FIG. 2, and the corresponding row represents each queue number assigned by the waiting task module 10a of FIG. As mentioned above,
The queue thus formed in the symbolic translation module is similar to the job queue 10d of FIG.
The pointer in pointer table 17 is ready for access for transfer to. Note in FIG. 4 that the various symbols are indirect local memory references, so the items stored therein can be given different pointers. This means that 2
brings two advantages. First, any data item may be stored in one or more locations in local memory by renaming or assigning different pointers to represent it. A second advantage is that any variable can be stored in one storage location and leave it without changing its pointer, while the result of any operation performed on that variable has the same symbolic name. However, it can be stored in another storage location with a different pointer. The individual processing elements of FIG. 2 are illustrated in FIG. In summary, they are formed from multiple microprogrammed microprocessors, and the microprocessors are
8086, or they may be customized microprogrammed processors as disclosed in Faber et al. US Pat. No. 3,983,539. Since the individual processors are adapted to perform different functions, they may be special purpose microprocessors containing only as much logic circuitry as is needed to perform each function. Each circuit 18 includes an arithmetic logic unit, a shift unit, a multiplication unit,
These are an indexing unit, a string processor, and a decoding unit. Additionally, sequencing unit 19 receives instructions from job queue 10d of FIG. 2 and accesses microinstructions stored in control store 20. Microinstructions from the control store are provided to each unit via instruction bus IB, and any status signals generated by the units are transferred via status bus CB. The data from the corresponding local memory is received on the A bus AB, and the executed result is on the B bus.
Supplied to bus BB. Returning to FIG. 1, a more detailed description of the various instructions in the code string being received by the wait task module 10a of FIG. 2 and the higher level instructions or tasks formed by that module. will be performed here. As shown in the left column of Figure 1, the first 3
The three instructions are a value recall or memory fetch of data item I, an 8-bit value, and a multiply instruction. They input the literal value indicated by the next task, the first task in the right column of FIG.
The task of multiplying by, is connected to. The processing is
Continued for addition and subtraction tasks.
A name call instruction is an instruction that places a data item address at the top of the stack, and an index instruction results in the insertion of a pointer in a descriptor located in memory. In this way,
A first queue Q ₀ is formed. The array of Q ₁ is the instruction NXLV after the name call instruction.
is executed to result in index processing and data retrieval. In this way, a second queue O ₁ is formed.
In the array of the third queue _Q2 , there is an add instruction which causes A and B to be stored in this way before the storage of memory (STOD) which destroys the value at the top of the stack. The calculated values are added. As can be seen from the block diagram in the center of Figure 1,
Execution of the last two tasks or chained instructions of Q ₂ requires the results of calculations Q ₀ and Q ₁ whose values are stored in local memory. Its storage location and their respective local memory are given an index flag to indicate whether the reference is actually stored there. In this way, when the processing elements are processed in a concurrent manner, the routine in Q ₂ can reach the second or final addition task before the required values have been calculated and stored in local memory. It is possible. The corresponding processing elements detect that their values are not yet available and continue to access their storage locations until the values become available. The fourth queue, or _Q3 , takes the value J, adds 1 to it, and inserts its address at the top of the stack before performing a non-destructive store in memory, leaving the value at the top of the stack. The last four instructions retrieve the value K from memory, compare it with the value J (LSEQ), and if the value K is greater than the value J, a branch to the next instruction, i.e. false, sets the program counter to It will be reloaded and the routine will repeat. On the other hand, the last instruction in the code string is an unconditional branch that causes the routine to terminate. FIG. 6 is a timing diagram of queue execution times for individual queues, with each clock time for a particular task represented by two numbers. The first number represents the particular loop or sequence that is being executed, and the second number represents the particular processing element that is executing the instructions.
The initial transmission of the code string resulting in the queue arrangement and execution of the task requires approximately 17 clock hours, while the series of loops requires only 5 clock hours to execute. This is because the tasks do not have to be completely reprocessed in QTM and IRM, and the individual dependent free queues are executed concurrently. In general, the wait task module executes concurrent steps of instructions for tasks, waiting for these tasks, queuing execution, tag prediction, and branch correction. The command reference module performs renaming, symbol management, and replacement functions. The symbol translation module provides parallel access, pointer allocation and task allocation. Small processing elements are provided for frequent task execution while unique processing elements are used for infrequent task execution and the functional portion of the string. The direct reference module 15 of FIG. 2 is provided for evaluating non-stack reference. Finally, it takes the compiled object code and forms that sequence of code into a higher-level task, and is not a queue of other queues in the sense that it does not require the results of previous executions of the object code string. A mechanism has been described for a data processor to form a queue of tasks that are logically unrelated. In this manner, sequences of such queues may be provided to non-dependent processing elements for concurrent execution. A symbol translation table is provided by which a data item is referenced, the symbol is assigned an arbitrary pointer to local memory, and the pointer is changed such that the data item is transferred to one or more memory storage locations. The data item may reside in memory, and the data item may remain in memory while the results of operations on that item may be stored in other storage locations. Although one embodiment of the invention has been described, it will be apparent to those skilled in the art that changes and modifications can be made without departing from the spirit of the invention as claimed.