JP3723311B2

JP3723311B2 - Parallel processor

Info

Publication number: JP3723311B2
Application number: JP03669097A
Authority: JP
Inventors: 士朗小林
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 1997-02-20
Filing date: 1997-02-20
Publication date: 2005-12-07
Anticipated expiration: 2017-02-20
Also published as: JPH10232777A

Description

【０００１】
【発明の属する技術分野】
本発明は、プロセッサの構成に関し、特に、複数の演算器を有し、その複数の演算器を用いて並列の演算を行うことができる並列演算プロセッサに関するものである。
【０００２】
【従来の技術】
従来から、コンピュータ・アーキテクチャにおいて、複数の演算器を有し、その演算器を並列に動作することにより、並列演算することは行われている。このような並列動作する例としては、たとえば、種類の異なる演算器（たとえば乗累算器と算術論理演算器）を有するデジタル信号処理プロセッサ（ＤＳＰ）がある。
【０００３】
このような複数の演算器を有するプロセッサの場合は、並列演算を行うために命令語に複数の演算器に対応したフィールドを設け、このフィールドにより同時に複数の演算器の動作を制御している。
【０００４】
【発明が解決しようとする課題】
このような構成のプロセッサにおいて、並列演算のネックとなるのは、メモリからデータや命令を取り出すためのメモリ・バスである。このメモリ・バスを複数設けることにより、ネックを少なくすることは行われている。しかし、このメモリ・バスを設けることは、データ長（たとえば１６ビット）の導線を設けることであり、チップ上に大きな面積を占めることになり、また外部接続のためのピン等を設ける必要がある。このため、メモリ・バスを増設することは、プロセッサ・チップの面積を増大させ、また、価格を増加することを意味する。
【０００５】
また、命令語に並列動作させるために、複数のフィールドを設けることは、命令語を長くし、命令語メモリの効率を下げることを意味する。
【０００６】
本発明の目的は、メモリ・バスを増設することなく、並列動作することができるプロセッサを提供することである。
【０００７】
また、本発明の目的は、各演算器ごとのフィールドを設けることなく、並列演算を行うことのできるプロセッサを提供することである。
【０００８】
【課題を解決するための手段】
上記目的を達成するために、本発明は、相互にバスの競合を起こさない少なくとも２種類の演算命令を有する並列演算プロセッサであって、一方の種類の命令と他方の種類の命令を格納したプログラム・メモリと、前記２種類の命令のうちの前記一方の種類の命令を格納する命令キャッシュと、前記プログラム・メモリから読み出された前記一方の種類の命令と前記他方の種類の命令を選択するデコーダと、前記命令キャッシュから読み出された前記一方の種類の命令と、前記デコーダにより選択された前記一方の種類の命令を選択するセレクタと、該セレクタにより選択された前記一方の種類の命令を実行する第１の演算部と、前記デコーダにより選択された前記他方の種類の命令を実行する第２の演算部とを有し、前記命令キャッシュから読み出された前記一方の種類の命令を実行する間、前記プログラム・メモリからの前記他方の種類の命令を独立に実行することを特徴とする。
【０００９】
この発明では、命令キャッシュからの命令と、プログラム・メモリからの命令とを全く独立に実行することができるため、１つのプロセッサのなかに、あたかも２つの独立のプロセッサが存在するような処理が可能になる。
【００１０】
また、前記バスは少なくとも２本あり、前記一方の種類の演算命令は、前記２本のバスのうち片方のみを用いる命令であるとすることもできる。
【００１１】
前記命令キャッシュに格納される命令は、繰り返して用いるループ・プログラムであるとすることもできる。
【００１２】
そのうえ、第１の演算部は乗累算部であり、その乗累算部は、複数の乗累算器と、前記乗累算器間に挿入した遅延部と、ローカル・データ・メモリとを有し、前記バスからのデータと前記ローカル・データ・メモリからのデータとを演算することもできる。
【００１３】
乗累算部を有するプロセッサにおいては、フィルタの演算に多く利用されており、並列に動作することのできる機会が多く、命令キャッシュを用いることにより並列に処理できることが多く、本発明を有効に利用できる。
【００１４】
【発明の実施の形態】
本発明の実施形態を、図面を参照して詳細に説明する。
【００１５】
図１は、本発明の並列演算プロセッサの演算部の実施形態の一例を示すブロック図である。図１に示した並列演算プロセッサは、積和を計算することができる乗累算部および算術論理演部を有する信号処理プロセッサ（ＤＳＰ）を示している。
【００１６】
図１において、１０１および１０２はバンク構成のデータ・メモリ１およびデータ・メモリ２であり、それぞれメモリ・バス１１０３およびメモリバス２１０４に接続されて、データ・メモリ１１０１およびデータ・メモリ２１０２とは、独立にアクセスできる構成となっている。１１０は複数の乗累算器を有する乗累算部、１２０は算術論理演算器を有する算術論理演算部である。乗累算部１１０の構成については後で詳しく説明する。算術論理演算部１２０は、通常のプロセッサが有する演算器の機能を備えている。
【００１７】
１３０は乗累算部１１０に対する命令デコーダで、１４０は算術演算部１２０に対する命令デコーダである。１５０は命令語を一時的に格納する命令キャッシュである。１７０は、命令語が格納されており、上記データ・メモリとは独立に読み出すことができるプログラム・メモリである。１６０は、乗累算部用デコーダに入力する命令語を、命令キャッシュ１５０から入力するか、プログラム・メモリ１７０から入力するかを選択するセレクタである。１０５は、命令語が格納されているプログラム・メモリ１７０から読み出された命令語を乗累算部用命令デコーダ１３０、算術論理演算部用命令デコーダ１４０または命令キャッシュ１５０のどれに入力するかを選択するためのデコーダである。
【００１８】
さて、乗累算部１１０の構成および動作を詳しく説明する。
【００１９】
乗累算部１１０は、すくなくても２つ以上の乗累算器１〜ｎの１１５〜１１７を備えている。各乗累算器は、ａとｂの入力に対して、ａｂ＋ｃの積和を計算することができる（ｃは乗累算器中のレジスタに記憶している値である）。ローカル・データ・メモリ１１１は、１０個程度のデータワード分を記憶できる容量を有するローカル・メモリで、各乗累算器の入力の一方に接続されている。また、各乗累算器間には１サイクルの遅延ができる遅延回路１１２〜１１３が挿入されており、ローカルメモリからのデータを遅延している。
【００２０】
乗累算部１１０の動作を説明する。デジタル信号処理でよく利用されているフィルタの場合を例にして説明する。
【００２１】
フィルタに用いられる計算式は、ｙ_t を出力、ｘ_t を入力、αを係数とするとき、
【００２２】
【数１】

で表される。この計算式を、上述の乗累算部１１０で行うことを説明する。なお、ｋは、正の整数である。
【００２３】
さて、計算式の係数α₀ ，α₁ ，α₂ ，α₃ ，α₄ ，・・・α_k をローカルメモリにまず格納しておく。これは、データ・メモリ１またはデータ・メモリ２からローカルメモリへの転送命令を用意しておき、この転送命令を用いることにより行われる。
【００２４】
入力データであるｘ_t ，ｘ_t+1 ，ｘ_t+2 ，ｘ_t+3 ，ｘ_t+4 ，・・・ｘ_t+k は、データ・メモリ１１０１からメモリ・バス１１０３を介して順次読み出され、乗累算部１１０に入力される。乗累算部１１０に入力したデータは、乗累算器１１１５，乗累算器２１１６，乗累算器ｎ１１７に１サイクル遅れて入力される。また、係数α₀ ，α₁ ，α₂ ，α₃ ，α₄ ，・・・α_k は、ローカルメモリ１１１から順次読み出されて、乗累算器１，乗累算器２，・・・乗累算器ｎに、同時に入力される。
【００２５】
このように、入力されるデータを各乗累算器で計算すると、ｔのときからｋサイクル後に、乗累算器１，２，…，ｎには、それぞれｙ_t ，ｙ_t-1 ，…，ｙ_t-n として、
【００２６】
【数２】

【００２７】
【数３】

【００２８】
【数４】

が計算される。なお、ｘ_t-n ，…，ｘ_t-2 ，ｘ_t-1 は、以前に入力したデータが各遅延回路１１２〜１１３に残っていたものである。
【００２９】
このように、ローカル・メモリおよび遅延回路を用意することにより、データの読み出しは、２本用意されているメモリ・バスの一方のみを利用することで、２入力の演算をｎ重の並列で行うことができる。しかも、例えば同じフィルタの演算を繰り返し行うときは、最初にフィルタの演算に用いる係数をローカル・メモリに転送すれば、後はその転送された係数を用いることができるので、ローカル・メモリへの転送は、大したオーバーヘッドにはならない。
【００３０】
さて、命令キャッシュ１５０の動作について説明する。この命令キャッシュ１５０は、セレクタ１６０により切り替えて、プログラム・メモリ１７０に代わって、乗累算部用の命令デコーダ１３０に対して命令を供給できるような構成である。この命令キャッシュには、乗累算部１１０を用いて、データについては１つのデータ・メモリすなわち１つのメモリ・バスのみを用いる演算する命令を格納する。この様な命令は、例えば、前に説明したようなローカル・メモリを用いた演算を行う命令である。
【００３１】
乗累算部１１０が命令キャッシュ１５０からの命令により、メモリ・バスの１つを用いて演算を行っているのに並行して、プログラム・メモリ１７０から他のメモリ・バスを用いて、算術論理演算部で行う演算例えばビット処理やシステム制御処理を行う命令を読み出し、実行することができる。
【００３２】
このように、命令キャッシュ１５０からの命令と、プログラム・メモリ１７０からの命令とを全く独立に実行することができるため、１つのプロセッサのなかに、あたかも２つの独立のプロセッサが存在するような処理が可能になる。
【００３３】
このローカル・メモリに対して格納される命令としては、例えば、乗累算部１１０を用いて上述の計算式を計算するようなループのプログラムの命令がよい。この様な場合、ループのプログラムを制御するためのリピート（繰り返し）命令により、命令キャッシュを用いて繰り返しを行うかを指定することが多い。
【００３４】
図２（ａ）は、そのリピート命令のフォーマットの１例を示している。
【００３５】
リピート命令は、例えば、図２（ａ）で示すように、命令を識別する命令コード、リピートを行う範囲を示すプログラム・メモリのアドレス（Ａｄｄ）、リピート回数（ｃｏｕｎｔ）、命令キャッシュを用いるか否かを示すフラグ（Ｆ）で構成されている。
【００３６】
図２（ｂ）を用いて、どのように図２（ａ）に示したリピート命令と上述の命令キャッシュを用いて、並列に演算を行うかを説明する。プログラム・メモリ１７０から読み出した命令がリピート命令であり、命令キャッシュを用いてリピートを行うフラグが立っているとする。このリピート命令で指定されたリピートの範囲が（Ａ）である。プロセッサの制御部は、引き続き命令語をプログラム・メモリから読み出して実行するが、それとともに、読み出したリピートの範囲（Ａ）の命令語を命令キャッシュ１５０に格納する。そして、リピート範囲（Ａ）のプログラムの読み出しが終了すると、フラグが立っている場合は、そのまま、引き続きプログラム・メモリからの次のアドレスの命令語を読み出して実行する。
【００３７】
一方、命令キャッシュからも、リピート命令で指定された回数から１回少ない回数、繰り返し命令語が読み出されて実行される。この命令キャッシュから読み出されている時間の間、プロセッサは、命令キャッシュからの命令とプログラム・メモリからの命令により、並列に動作している。
【００３８】
したがって、リピート命令のフラグで命令キャッシュを利用して並列動作を行うことを指定した場合は、ループのプログラムの次のプログラムは、ループを行っている時間の間、並列動作していることを意識して作成する必要がある。図２（ｂ）において、プログラム・メモリ内の（Ｂ）と示した部分がその並列動作部分のプログラムに対応している。
【００３９】
なお、上記の例では、リピート命令によりフラグを用いて明示的に命令キャッシュを利用することを指定した。しかし、例えば乗累算部を用いるループの場合に必ず命令キャッシュを利用するときは、リピート範囲の乗累算部を用いる命令語を命令キャッシュに必ず転送することにすると、上記で説明したフラグは必要なくなる。
【００４０】
上記では、乗累算部を有するデジタル信号処理プロセッサ（ＤＳＰ）で説明した。これは、デジタル信号処理プロセッサ（ＤＳＰ）においては、乗累算部１１０の説明で例示したフィルタの演算に多く利用されており、並列に動作することのできる機会が多く、命令キャッシュ１５０を用いることにより並列に処理できることが多くなるからである。
【００４１】
しかし、上述の命令キャッシュの構成は、複数の演算部を有する汎用のプロセッサにも応用できる。この場合、少なくても命令キャッシュを設けた演算部と他の１つの演算部において、互いにバスの競合がない演算が可能であることが必要である。
【００４２】
【発明の効果】
上記の説明のように、本発明は、命令キャッシュからの命令と、プログラム・メモリからの命令とを全く独立に実行することができるため、１つのプロセッサのなかに、あたかも２つの独立のプロセッサが存在するような処理が可能になる。
【００４３】
乗累算部を有するプロセッサにおいては、フィルタの演算に多く利用されており、並列に動作することのできる機会が多く、命令キャッシュを用いることにより並列に処理できることが多く、本発明を有効に利用できる。
【図面の簡単な説明】
【図１】本発明の実施形態を示すブロック図である。
【図２】本発明のプロセッサの動作を説明する図である。
【符号の説明】
１０１，１０２データ・メモリ
１０３，１０４メモリ・バス
１０５デコーダ
１１０乗累算部
１１１ローカル・データ・メモリ
１１２，１１３遅延回路
１１５〜１１７乗累算器
１２０算術論理演算部
１３０乗累算部用命令デコーダ
１４０算術論理演算用命令デコーダ
１５０命令キャッシュ
１６０セレクタ
１７０プログラム・メモリ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a configuration of a processor, and more particularly to a parallel arithmetic processor that has a plurality of arithmetic units and can perform parallel arithmetic using the arithmetic units.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, in a computer architecture, parallel computation has been performed by having a plurality of computing units and operating the computing units in parallel. As an example of such parallel operation, for example, there is a digital signal processor (DSP) having different types of arithmetic units (for example, a multiplier accumulator and an arithmetic logic unit).
[0003]
In the case of such a processor having a plurality of computing units, a field corresponding to the plurality of computing units is provided in the instruction word in order to perform parallel computation, and the operations of the plurality of computing units are controlled simultaneously by this field.
[0004]
[Problems to be solved by the invention]
In the processor having such a configuration, a bottleneck for parallel operation is a memory bus for fetching data and instructions from the memory. By providing a plurality of memory buses, the bottleneck has been reduced. However, providing this memory bus means providing a conductor having a data length (for example, 16 bits), which occupies a large area on the chip, and it is necessary to provide pins for external connection. . For this reason, adding a memory bus means increasing the area of the processor chip and increasing the price.
[0005]
In addition, providing a plurality of fields in order to operate in parallel with an instruction word means that the instruction word is lengthened and the efficiency of the instruction word memory is lowered.
[0006]
An object of the present invention is to provide a processor capable of operating in parallel without adding a memory bus.
[0007]
It is another object of the present invention to provide a processor that can perform parallel operations without providing a field for each arithmetic unit.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a parallel operation processor having at least two types of operation instructions that do not cause bus contention with each other, the program storing one type of instruction and the other type of instruction - a memory, the two types of instruction cache for storing the one kind of command out of commands, selects an instruction of the instruction and the other kind of the one of the type read from the program memory A decoder, the one type of instruction read from the instruction cache, a selector for selecting the one type of instruction selected by the decoder, and the one type of instruction selected by the selector. It has a first arithmetic unit for executing, a second arithmetic unit for executing the instruction type of the other side selected by the decoder or the instruction cache While executing the one type of instruction read, and executes the other hand the type of instruction from said program memory independently.
[0009]
In the present invention, since instructions from the instruction cache and instructions from the program memory can be executed completely independently, it is possible to perform processing as if there were two independent processors in one processor. become.
[0010]
Further, there may be at least two buses, and the one type of operation instruction may be an instruction using only one of the two buses.
[0011]
The instruction stored in the instruction cache may be a loop program used repeatedly.
[0012]
In addition, the first arithmetic unit is a multiply-accumulate unit, and the multiply-accumulate unit includes a plurality of multiplier-accumulators, a delay unit inserted between the multiplier-accumulators, and a local data memory. It is also possible to calculate data from the bus and data from the local data memory.
[0013]
A processor having a multiply-accumulate unit is often used for filter operations, has many opportunities to operate in parallel, and can often process in parallel by using an instruction cache. it can.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described in detail with reference to the drawings.
[0015]
FIG. 1 is a block diagram showing an example of an embodiment of an arithmetic unit of the parallel arithmetic processor of the present invention. The parallel arithmetic processor shown in FIG. 1 is a signal processor (DSP) having a multiply-accumulate unit and an arithmetic logic unit capable of calculating a sum of products.
[0016]
In FIG. 1, reference numerals 101 and 102 denote a bank-structured data memory 1 and data memory 2, which are connected to a memory bus 1 103 and a memory bus 2104, respectively, and the data memory 1 101 and the data memory 2 102 Can be accessed independently. 110 is a multiplication / accumulation unit having a plurality of multiplication / accumulation units, and 120 is an arithmetic / logic operation unit having an arithmetic / logic operation unit. The configuration of the multiplication and accumulation unit 110 will be described in detail later. The arithmetic logic unit 120 has a function of an arithmetic unit included in a normal processor.
[0017]
130 is an instruction decoder for the multiplication and

accumulation unit

110, and 140 is an instruction decoder for the arithmetic operation unit 120. An instruction cache 150 temporarily stores instruction words. Reference numeral 170 denotes a program memory in which an instruction word is stored and can be read independently of the data memory. Reference numeral 160 denotes a selector that selects whether an instruction word to be input to the multiplier / accumulation unit decoder is input from the instruction cache 150 or the program memory 170. 105 indicates whether the instruction word read from the program memory 170 in which the instruction word is stored is input to the instruction decoder 130 for the multiply-accumulate unit, the instruction decoder 140 for the arithmetic logic unit, or the instruction cache 150. It is a decoder for selecting.
[0018]
Now, the configuration and operation of the multiplication and accumulation unit 110 will be described in detail.
[0019]
The multiplication / accumulation unit 110 includes at least 115 to 117 of the accumulators 1 to n, at least. Each multiplier-accumulator can calculate a product sum of ab + c for the inputs of a and b (c is a value stored in a register in the multiplier-accumulator). The local data memory 111 is a local memory having a capacity capable of storing about 10 data words, and is connected to one input of each multiplier accumulator. Further, delay circuits 112 to 113 capable of delaying one cycle are inserted between the multipliers and accumulators to delay data from the local memory.
[0020]
The operation of the multiplication and accumulation unit 110 will be described. An example of a filter often used in digital signal processing will be described.
[0021]
The calculation formula used for the filter is when y _t is output, x _t is input, and α is a coefficient.
[0022]
[Expression 1]

It is represented by It will be described that this calculation formula is performed by the above-described multiplication and accumulation unit 110. Note that k is a positive integer.
[0023]
Now, coefficients α ₀ , α ₁ , α ₂ , α ₃ , α ₄ ,... Α _k are first stored in the local memory. This is performed by preparing a transfer instruction from the data memory 1 or the data memory 2 to the local memory and using this transfer instruction.
[0024]
Input data x _t , x _{t + 1} , x _{t + 2} , x _{t + 3} , x _{t + 4} ,... X _{t + k} are transferred from the data memory 1 101 through the memory bus 1 103. The data are sequentially read and input to the power accumulation unit 110. The data input to the multiplier / accumulator 110 is input to the multiplier / accumulator 1 115, the multiplier / accumulator 2 116, and the multiplier / accumulator n 117 with a delay of one cycle. Also, the coefficients α ₀ , α ₁ , α ₂ , α ₃ , α ₄ ,... Α _k are sequentially read out from the local memory 111, and the multiplier accumulator 1, multiplier accumulator 2,. Simultaneously input to the multiplier accumulator n.
[0025]
In this way, when the input data is calculated by each multiplier accumulator, after k cycles from t, the multiplier accumulators 1, 2,..., N have y _t , y _t−1 ,. , Y _tn ,
[0026]
[Expression 2]

[0027]
[Equation 3]

[0028]
[Expression 4]

Is calculated. Note that x _tn ,..., X _t−2 , x _t−1 are data in which the previously input data remains in the delay circuits 112 to 113.
[0029]
As described above, by preparing the local memory and the delay circuit, data is read out by using only one of the two prepared memory buses, and two-input operations are performed in n-fold in parallel. be able to. Moreover, for example, when performing the same filter operation repeatedly, if the coefficient used for the filter operation is first transferred to the local memory, the transferred coefficient can be used later, so that the transfer to the local memory is possible. Is not much overhead.
[0030]
Now, the operation of the instruction cache 150 will be described. The instruction cache 150 can be switched by the selector 160 and can supply an instruction to the instruction decoder 130 for the multiply-accumulate unit in place of the program memory 170. In this instruction cache, a multiplication / accumulation unit 110 is used to store an instruction to perform an operation using only one data memory, that is, one memory bus. Such an instruction is, for example, an instruction for performing an operation using a local memory as described above.
[0031]
In parallel with the multiplication / accumulation unit 110 performing an operation using one of the memory buses according to an instruction from the instruction cache 150, an arithmetic logic is performed using another memory bus from the program memory 170. It is possible to read out and execute an operation to be performed by the operation unit, for example, an instruction to perform bit processing or system control processing.
[0032]
As described above, since the instruction from the instruction cache 150 and the instruction from the program memory 170 can be executed independently of each other, it is as if there are two independent processors in one processor. Is possible.
[0033]
As an instruction stored in the local memory, for example, an instruction of a loop program that calculates the above-described calculation formula using the multiply-accumulate unit 110 is preferable. In such a case, it is often specified whether to repeat using an instruction cache by a repeat (repeat) instruction for controlling a loop program.
[0034]
FIG. 2A shows an example of the format of the repeat instruction.
[0035]
For example, as shown in FIG. 2A, the repeat instruction uses an instruction code for identifying an instruction, an address (Add) of a program memory indicating a repeat range, the number of repeats (count), and whether an instruction cache is used. It is comprised by the flag (F) which shows.
[0036]
Using FIG. 2 (b), how to perform operations in parallel using the repeat instruction shown in FIG. 2 (a) and the above instruction cache will be described. It is assumed that the instruction read from the program memory 170 is a repeat instruction and a flag for performing repeat using the instruction cache is set. The repeat range designated by this repeat instruction is (A). The control unit of the processor continues to read and execute the instruction word from the program memory, and simultaneously stores the instruction word in the read range (A) in the instruction cache 150. When reading of the program in the repeat range (A) is completed, if the flag is set, the instruction word at the next address from the program memory is continuously read and executed as it is.
[0037]
On the other hand, the instruction word is also read from the instruction cache and executed once less than the number designated by the repeat instruction. During the time being read from this instruction cache, the processor is operating in parallel with instructions from the instruction cache and instructions from the program memory.
[0038]
Therefore, if the repeat instruction flag specifies that the instruction cache is used for parallel operation, the program following the loop program is aware that it is operating in parallel for the duration of the loop. Need to be created. In FIG. 2B, the portion indicated by (B) in the program memory corresponds to the program of the parallel operation portion.
[0039]
In the above example, the use of an instruction cache is explicitly specified using a flag by a repeat instruction. However, for example, when using an instruction cache in a loop using a multiply-accumulate unit, if an instruction word using a multiply-accumulate unit in a repeat range is always transferred to the instruction cache, the flag described above is No longer needed.
[0040]
In the above description, a digital signal processor (DSP) having a multiply-accumulate unit has been described. This is often used in the digital signal processor (DSP) for the calculation of the filter exemplified in the explanation of the multiply-accumulate unit 110, and there are many opportunities to operate in parallel, and the instruction cache 150 is used. This is because the number of processes that can be performed in parallel increases.
[0041]
However, the above-described instruction cache configuration can be applied to a general-purpose processor having a plurality of arithmetic units. In this case, it is necessary that at least the arithmetic unit provided with the instruction cache and the other one arithmetic unit can perform an operation without bus contention.
[0042]
【The invention's effect】
As described above, the present invention can execute instructions from the instruction cache and instructions from the program memory completely independently, so that two independent processors are included in one processor. Processing that exists is possible.
[0043]
A processor having a multiply-accumulate unit is often used for filter operations, has many opportunities to operate in parallel, and can often process in parallel by using an instruction cache, thereby effectively using the present invention. it can.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of the present invention.
FIG. 2 is a diagram for explaining the operation of the processor of the present invention.
[Explanation of symbols]
101, 102 Data memory 103, 104 Memory bus 105 Decoder 110 Multiply accumulation unit 111

Local data memory

112, 113 Delay circuit 115-117 Multiply accumulator 120 Arithmetic logic operation unit 130 Multiply accumulator instruction decoder 140 Instruction decoder 150 for arithmetic logic operation Instruction cache 160 Selector 170 Program memory

Claims

A parallel arithmetic processor having at least two types of arithmetic instructions that do not cause bus contention with each other,
A program memory storing one type of instruction and the other type of instruction ;
An instruction cache for storing instructions said one kind of said two kinds of instructions,
A decoder for selecting the one type of instruction and the other type of instruction read from the program memory;
A selector that selects the one type of instruction read from the instruction cache, and the one type of instruction selected by the decoder;
A first arithmetic unit that executes the one type of instruction selected by the selector ;
And a second arithmetic unit for executing the instruction type of the other side selected by the decoder,
Parallel processor characterized during, performing said other side types of instructions from the program memory independently of executing the one type of instruction read from the instruction cache.

2. The parallel arithmetic processor according to claim 1, wherein there are at least two buses, and the one type of arithmetic instruction is an instruction using only one of the two buses.

3. The parallel arithmetic processor according to claim 1, wherein the instruction stored in the instruction cache is a loop program used repeatedly.

4. The parallel arithmetic processor according to claim 1, wherein the first arithmetic unit is a multiply-accumulate unit.

5. The parallel arithmetic processor according to claim 4, wherein the multiplication / accumulation unit includes a plurality of multiplication / accumulation units, a delay unit inserted between the multiplication / accumulation units, and a local data memory. A parallel processor for calculating data from the local data memory and data from the local data memory.