JP6060853B2

JP6060853B2 - Processor and processor processing method

Info

Publication number: JP6060853B2
Application number: JP2013171051A
Authority: JP
Inventors: 修作内堀
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2017-01-18
Anticipated expiration: 2033-08-21
Also published as: JP2015041176A

Description

本発明は、プロセッサおよびプロセッサの処理方法に関する。 The present invention relates to a processor and a processing method of the processor.

プロセッサのサイクル時間の短縮および命令処理レートの向上を小さい回路規模で実現する技術として、パイプラインがよく知られている。パイプラインは、スーパーコンピュータ、メインフレーム、および、組み込み用を含めた多数のプロセッサにおいて採用されている。 Pipeline is well known as a technique for realizing a reduction in processor cycle time and an improvement in instruction processing rate with a small circuit scale. Pipelines are employed in many processors including supercomputers, mainframes, and embedded.

例えば、パイプラインを採用した一般的なプロセッサの一例を図１４に示す。図１４に示すプロセッサ９００は、ｎステージからなる命令の各ステージを実行するｎ個の論理回路９０１（９０１＿１〜９０１＿ｎ）を備える。各クロックサイクルにおいて、論理回路９０１＿１は、データを取得してステージ１の処理を実行する。次の各クロックサイクルにおいて、論理回路９０１＿２は、論理回路９０１＿１から送られてくるステージ１の処理結果に基づいて、ステージ２の処理を実行する。このようにして、各論理回路９０１は、前のステージの処理結果に基づき該当するステージの処理を実行し、処理を終了すると、次に送られてくる前のステージの処理結果に基づき該当するステージの処理を実行することを繰り返す。また、プロセッサ９００は、各ステージの処理のタイミングを同期させるため、論理回路９０１間にレジスタ（図示せず）を設けるのが一般的である。これにより、プロセッサ９００は、最も時間のかかるステージの処理時間を基準として次々とデータを論理回路９０１に送りこむことができ、命令の全ステージを処理してから次の命令を開始する場合に比べて、処理を高速化できる。さらに、このようなプロセッサ９００において、ステージ段階数を増やして各ステージの処理時間を短縮することにより、少ない回路規模の増大で、プロセッサのサイクル時間をより短縮し、命令処理レートをより向上させることができる。 For example, FIG. 14 shows an example of a general processor that employs a pipeline. A processor 900 illustrated in FIG. 14 includes n logic circuits 901 (901_1 to 901_n) that execute each stage of an instruction including n stages. In each clock cycle, the logic circuit 901_1 acquires data and executes the processing of stage 1. In each next clock cycle, the logic circuit 901_2 performs the process of stage 2 based on the process result of stage 1 sent from the logic circuit 901_1. In this way, each logic circuit 901 executes the processing of the corresponding stage based on the processing result of the previous stage, and when the processing is completed, the corresponding stage based on the processing result of the previous stage sent next time. Repeat the process. Further, the processor 900 generally provides a register (not shown) between the logic circuits 901 in order to synchronize the processing timing of each stage. As a result, the processor 900 can send data to the logic circuit 901 one after another on the basis of the processing time of the most time-consuming stage, as compared with the case where the next instruction is started after processing all the stages of the instruction. , Can speed up the process. Further, in such a processor 900, by increasing the number of stage stages and shortening the processing time of each stage, the cycle time of the processor can be further shortened and the instruction processing rate can be further improved with a small increase in circuit scale. Can do.

また、このようなパイプライン動作によりベクトルデータを処理するプロセッサの一例が、特許文献１に記載されている。特許文献１に記載されたプロセッサでは、パイプライン制御部が、主記憶装置から読みだされたデータを整列してベクトルレジスタに書き込み、演算部が、ベクトルレジスタから次々と読み出されるデータしてパイプライン処理を用いて所定のベクトル演算を行う。 An example of a processor that processes vector data by such a pipeline operation is described in Patent Document 1. In the processor described in Patent Document 1, the pipeline control unit arranges the data read from the main storage device and writes the data to the vector register, and the arithmetic unit reads the data read from the vector register one after another as the pipeline. A predetermined vector operation is performed using the processing.

特開平８−３０５６８５号公報JP-A-8-305658

しかしながら、図１４を参照して説明したプロセッサおよび特許文献１に記載されたプロセッサには、以下の課題がある。 However, the processor described with reference to FIG. 14 and the processor described in Patent Document 1 have the following problems.

上述したプロセッサにおいては、近年のサイクル時間の短縮およびプロセス技術の向上により、ステージ間においてレジスタ段階数が増え、セットアップやクロックスキュー等のオーバーヘッドが大きくなり、さらなるスピードアップが困難になってきている。また、上述したプロセッサにおいては、パイプラインの段階数をより深くステージの遅延をより少なくするため、ステージ間のバランスおよびパイプラインハザード時のフラッシュ処理を考慮する必要がある。そのため、回路設計が複雑になってきている。また、上述したプロセッサにおいては、パイプラインによる高速化に伴い消費電力が増大してきている。その一方で、プロセス技術の向上によりＬＳＩ（Large Scale Integration）に集積できる回路規模が増大するにつれ、少ない回路規模で高速化を実現出来るというパイプラインのメリットが薄れてきている。 In the above-described processor, due to the recent reduction in cycle time and improvement in process technology, the number of register stages increases between stages, and overheads such as setup and clock skew increase, making it difficult to further speed up. Further, in the above-described processor, in order to deepen the number of pipeline stages and reduce the delay of the stage, it is necessary to consider the balance between stages and the flush process at the time of pipeline hazard. For this reason, circuit design has become complicated. Further, in the above-described processor, power consumption is increasing with the speeding up by the pipeline. On the other hand, as the circuit scale that can be integrated into LSI (Large Scale Integration) increases due to the improvement of process technology, the merit of the pipeline that can realize high speed with a small circuit scale is diminishing.

本発明は、上述の課題を解決するためになされたもので、パイプラインにおけるステージ間のオーバヘッド、回路設計の複雑化および消費電力の増大を抑えながら、パイプラインと同等以上の高速化を実現するプロセッサを提供することを目的とする。 The present invention has been made to solve the above-described problems, and realizes a speed equal to or higher than that of a pipeline while suppressing the overhead between stages in the pipeline, the complexity of circuit design, and the increase in power consumption. An object is to provide a processor.

本発明のプロセッサは、主記憶装置から読み込まれる各データを保持するデータ保持部と、前記データに基づく一連の処理を終了してから次のデータに基づく前記一連の処理を開始する複数の演算器を用いて、前記各データに基づく一連の処理を並列に実行する演算部と、を備える。 The processor of the present invention includes a data holding unit that holds each data read from the main storage device, and a plurality of arithmetic units that start the series of processes based on the next data after finishing the series of processes based on the data And an arithmetic unit that executes a series of processes based on the respective data in parallel.

また、本発明のプロセッサの処理方法は、主記憶装置から読み込まれてデータ保持部に保持される各データに基づく一連の処理を終了してから次のデータに基づく前記一連の処理を開始する複数の演算器を用いて、前記各データに基づく一連の処理を並列に実行する。 Further, the processor processing method of the present invention includes a plurality of processes for starting the series of processes based on the next data after finishing a series of processes based on each data read from the main storage device and held in the data holding unit. A series of processing based on each data is executed in parallel using the computing unit.

本発明は、パイプラインにおけるステージ間のオーバヘッド、回路設計の複雑化および消費電力の増大を抑えながら、パイプラインと同等以上の高速化を実現するプロセッサを提供することができる。 The present invention can provide a processor that realizes a speed equal to or higher than that of a pipeline while suppressing overhead between stages in the pipeline, complexity of circuit design, and increase in power consumption.

本発明の第１の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 1st Embodiment of this invention. 本発明の第２の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 2nd Embodiment of this invention. 本発明の第３の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 3rd Embodiment of this invention. 本発明の第４の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 4th Embodiment of this invention. 本発明の第５の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 5th Embodiment of this invention. 本発明の第６の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 6th Embodiment of this invention. 本発明の第７の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 7th Embodiment of this invention. 本発明の第８の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 8th Embodiment of this invention. 本発明の第９の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 9th Embodiment of this invention. 本発明の第９の実施の形態としてのプロセッサの他の態様の構成を示すブロック図である。It is a block diagram which shows the structure of the other aspect of the processor as the 9th Embodiment of this invention. 本発明の第１０の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 10th Embodiment of this invention. 本発明の第１１の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 11th Embodiment of this invention. 本発明の第１２の実施の形態としてのプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor as the 12th Embodiment of this invention. 関連技術のプロセッサの構成を示すブロック図である。It is a block diagram which shows the structure of the processor of related technology.

以下、本発明の実施の形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施の形態）
本発明の第１の実施の形態としてのプロセッサ１の構成を図１に示す。 (First embodiment)
FIG. 1 shows the configuration of the processor 1 as the first embodiment of the present invention.

図１において、プロセッサ１は、データ保持部１１と、演算部１２とを備える。 In FIG. 1, the processor 1 includes a data holding unit 11 and a calculation unit 12.

データ保持部１１は、例えば、レジスタにより構成され、主記憶装置８００から読み込まれる各データを保持する。なお、データ保持部１１には、演算部１２に対して連続的にデータを供給できるよう主記憶装置８００から逐次データが読み込まれて保持されるものとする。 The data holding unit 11 is configured by a register, for example, and holds each data read from the main storage device 800. The data holding unit 11 sequentially reads and holds data from the main storage device 800 so that data can be continuously supplied to the arithmetic unit 12.

演算部１２は、複数の演算器１３を有する。演算部１２は、これらの演算器１３を用いて、データ保持部１１に保持される各データに基づく一連の処理を並列に実行する。なお、図１には、４つの演算器１３を示したが、本発明における演算部が有する演算器の数を限定するものではない。 The computing unit 12 includes a plurality of computing units 13. The computing unit 12 uses these computing units 13 to execute a series of processes based on each data held in the data holding unit 11 in parallel. Although four arithmetic units 13 are shown in FIG. 1, the number of arithmetic units included in the arithmetic unit in the present invention is not limited.

演算器１３は、１つのデータに対する一連の処理の実行を終了してから、次のデータに対する一連の処理を開始する。 The arithmetic unit 13 ends the execution of a series of processes for one data and then starts a series of processes for the next data.

このように構成されたプロセッサ１は、演算器１３によって、データ保持部１１に保持されるあるデータに基づく一連の処理を進行中に、他の演算器１３によって、他のデータに基づく一連の処理を開始することが可能である。そして、各演算器１３は、１つのデータに対する一連の処理の実行を終了してから、次のデータに対する一連の処理を開始する。 In the processor 1 configured as described above, a series of processing based on other data is performed by another computing unit 13 while a series of processing based on certain data held in the data holding unit 11 is in progress by the computing unit 13. It is possible to start. Each computing unit 13 ends the execution of a series of processes for one data, and then starts a series of processes for the next data.

これにより、本発明の第１の実施の形態としてのプロセッサ１は、パイプラインにおけるステージ間のオーバヘッド、回路設計の複雑化および消費電力の増大を抑えながら、パイプラインと同等以上の高速化を実現することができる。 As a result, the processor 1 according to the first embodiment of the present invention achieves a speed equivalent to or higher than that of the pipeline while suppressing overhead between stages in the pipeline, complexity of circuit design, and increase in power consumption. can do.

その理由は、演算部が備える各演算器が、１つのデータに対する一連の処理を終了してから次のデータに対する一連の処理を開始し、並列に動作するからである。このように、本実施の形態における演算器は、一連の処理を１ステージで実行するので、複数ステージで実行する場合に比べて動作周波数を削減し、消費電力の増大を抑えることができる。また、本実施の形態における演算器は、一連の処理を１ステージで実行するので、ステージ間にレジスタを必要としない。したがって、本実施の形態では、ステージ間のレジスタ段階数増に伴うセットアップおよびクロックスキュー等のオーバーヘッドの問題が発生しない。また、本実施の形態では、ステージ間のバランスをとる必要がなく、パイプラインハザード時のフラッシュ処理を考慮する必要がない。したがって、本実施の形態としてのプロセッサは、回路設計が容易である。このように、本実施の形態は、パイプラインにおけるステージ段階数を増やす代わりに、演算部が有する演算器の数を増やすことにより、パイプラインにおける課題を回避している。なお、プロセス技術の向上によりＬＳＩに集積できる回路規模が増大しているため、演算器の数を増やすことは比較的容易に実現可能である。その結果、本実施の形態は、演算器の数を増やすことにより並列に進行する一連の処理の数を増大させることができ、パイプラインにおける課題を回避しながら、高速化を実現することができる。 The reason is that each arithmetic unit included in the arithmetic unit starts a series of processes for the next data after completing a series of processes for one data, and operates in parallel. As described above, since the arithmetic unit in the present embodiment executes a series of processes in one stage, the operating frequency can be reduced and increase in power consumption can be suppressed as compared with the case where the arithmetic unit is executed in a plurality of stages. In addition, since the arithmetic unit in this embodiment executes a series of processes in one stage, no register is required between the stages. Therefore, in the present embodiment, overhead problems such as setup and clock skew associated with an increase in the number of register stages between stages do not occur. In this embodiment, it is not necessary to balance between stages, and it is not necessary to consider flush processing at the time of pipeline hazard. Therefore, the circuit design of the processor as this embodiment is easy. As described above, this embodiment avoids problems in the pipeline by increasing the number of arithmetic units included in the arithmetic unit instead of increasing the number of stage stages in the pipeline. Since the circuit scale that can be integrated into an LSI has increased due to the improvement in process technology, it is relatively easy to increase the number of arithmetic units. As a result, this embodiment can increase the number of a series of processes that proceed in parallel by increasing the number of computing units, and can achieve high speed while avoiding problems in the pipeline. .

（第２の実施の形態）
次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。本実施の形態では、ベクトルプロセッサに対して本発明のプロセッサを適用した一例について説明する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the processor of the present invention is applied to a vector processor will be described.

まず、本発明の第２の実施の形態としてのベクトルプロセッサ２の構成を図２に示す。 First, FIG. 2 shows the configuration of a vector processor 2 as a second embodiment of the present invention.

図２において、ベクトルプロセッサ２は、ベクトルレジスタ部２１と、演算部２２とを備える。 In FIG. 2, the vector processor 2 includes a vector register unit 21 and a calculation unit 22.

ベクトルレジスタ部２１は、本発明におけるデータ保持部の一実施形態を構成する。ベクトルレジスタ部２１は、主記憶装置８００から読み込まれるベクトルデータを保持する。例えば、ベクトルレジスタ部２１は、複数のベクトルレジスタにより構成されている。 The vector register unit 21 constitutes an embodiment of the data holding unit in the present invention. The vector register unit 21 holds vector data read from the main storage device 800. For example, the vector register unit 21 includes a plurality of vector registers.

ここで、ベクトルレジスタ部２１には、各演算部２２に対して連続的にベクトルデータのデータ要素を供給できるよう主記憶装置８００から逐次ベクトルデータが読み込まれて保持される。例えば、図２に示すように、主記憶装置８００は複数のメモリ８０１およびメモリインタリーブ部８０２を有していてもよい。この場合、メモリインタリーブ部８０２は、複数のメモリ８０１に記憶されているベクトルデータをインタリーブしてベクトルレジスタ部２１に保持してもよい。また、この場合、メモリインタリーブ部８０２は、演算部２２によりベクトルレジスタ部２１に書き戻された演算の結果を、メモリ８０１に書き戻してもよい。このような主記憶装置８００には、ベクトルレジスタ部２１との間でデータを入出力する公知の各種技術を適用可能である。なお、図２には、４つのメモリ８０１を示したが、本発明のプロセッサに接続される主記憶装置が有するメモリの数を限定するものではない。 Here, the vector register unit 21 sequentially reads and holds the vector data from the main storage device 800 so that the data elements of the vector data can be continuously supplied to the respective arithmetic units 22. For example, as illustrated in FIG. 2, the main storage device 800 may include a plurality of memories 801 and a memory interleave unit 802. In this case, the memory interleaving unit 802 may interleave vector data stored in the plurality of memories 801 and hold the vector data in the vector register unit 21. In this case, the memory interleaving unit 802 may write back the result of the calculation written back to the vector register unit 21 by the calculation unit 22 in the memory 801. Various known techniques for inputting / outputting data to / from the vector register unit 21 can be applied to such a main storage device 800. Although FIG. 2 shows four memories 801, the number of memories included in the main memory connected to the processor of the present invention is not limited.

演算部２２は、ｎ個の演算器２３（２３＿１〜２３＿ｎ）と、インタリーブ部２４とを有する。ここで、ｎは、１以上の整数であり、１つの演算器２３により１つのデータに基づく演算がパイプラインで処理されると仮定した場合の段階数ｎに等しい。なお、図２には、４つの演算器２３を示したが、本発明における演算の段階数および演算部が有する演算器の数を限定するものではない。 The computing unit 22 includes n computing units 23 (23_1 to 23_n) and an interleaving unit 24. Here, n is an integer equal to or greater than 1, and is equal to the number of stages n when it is assumed that an operation based on one data is processed in a pipeline by one arithmetic unit 23. In FIG. 2, four arithmetic units 23 are shown, but the number of arithmetic steps and the number of arithmetic units included in the arithmetic unit in the present invention are not limited.

各演算器２３は、ベクトルデータの各データ要素に対する演算を１ステージにまとめて実行する。つまり、各演算器２３は、あるデータ要素に対する演算を終了してから次のデータに対する演算を開始する。なお、各演算器２３には、後述のインタリーブ部２４によりデータ要素が供給される。また、各演算器２３は、演算の結果をインタリーブ部２４に入力する。 Each computing unit 23 executes the computation for each data element of the vector data in one stage. That is, each arithmetic unit 23 ends the calculation for a certain data element and then starts the calculation for the next data. Each arithmetic unit 23 is supplied with a data element by an interleave unit 24 described later. Each computing unit 23 inputs the result of the computation to the interleaving unit 24.

図２の例では、各演算器２３は、各データ要素に対して、浮動小数点加算演算における指数部比較、仮数部シフト、仮数部加算、および正規化を１ステージとして実行する。また、図２の例では、一連の処理（浮動小数点加算）が４段階からなるため、ベクトルプロセッサ２の演算部２２は、４つの演算器２３を有している。 In the example of FIG. 2, each arithmetic unit 23 executes exponent part comparison, mantissa part shift, mantissa part addition, and normalization in a floating-point addition operation as one stage for each data element. In the example of FIG. 2, since the series of processing (floating point addition) is composed of four stages, the arithmetic unit 22 of the vector processor 2 includes four arithmetic units 23.

インタリーブ部２４は、ベクトルレジスタ部２１から各演算器２３に対してデータ要素をインタリーブする。すなわち、インタリーブ部２４は、あるクロックサイクルでベクトルレジスタ部２１から１つ目のデータ要素を演算器２３＿１に入力し、次のクロックサイクルで２つ目のデータ要素を演算器２３＿２に入力する。以降、インタリーブ部２４は、３〜ｎ個目の各データ要素を、続く各クロックサイクルで次々と、演算器２３＿３〜２３＿ｎに入力する。そして、インタリーブ部２４は、ｎ＋１個目以降のデータ要素を、前のデータ要素に対する演算を終了した演算器２３＿１から順に逐次入力していく。 The interleaving unit 24 interleaves data elements from the vector register unit 21 to each computing unit 23. That is, the interleaving unit 24 inputs the first data element from the vector register unit 21 to the computing unit 23_1 in a certain clock cycle, and inputs the second data element to the computing unit 23_2 in the next clock cycle. Thereafter, the interleaving unit 24 inputs the third to nth data elements to the computing units 23_3 to 23_n one after another in each subsequent clock cycle. Then, the interleaving unit 24 sequentially inputs the (n + 1) th and subsequent data elements in order from the arithmetic unit 23_1 that has completed the operation on the previous data element.

なお、上述の浮動小数点加算では、各演算器２３は、２つのオペランドを必要とする。この場合、インタリーブ部２４は、ベクトルレジスタ部２１から、２つのオペランドを含むデータ要素を次々と読み出して各演算器２３に入力する。実際には、インタリーブ部２４は、ベクトルレジスタ部２１が有するあるベクトルレジスタの１つの要素および他のベクトルレジスタの１つの要素を２つのオペランドとして演算器２３に入力してもよい。以下では、説明を簡単にするため、各演算器２３にベクトルレジスタ部２１から入力されるデータ要素は、演算に必要な数のオペランドを含むものとする。 In the above-described floating point addition, each computing unit 23 requires two operands. In this case, the interleave unit 24 reads data elements including two operands one after another from the vector register unit 21 and inputs them to each computing unit 23. Actually, the interleave unit 24 may input one element of a vector register included in the vector register unit 21 and one element of another vector register to the computing unit 23 as two operands. Hereinafter, in order to simplify the description, it is assumed that the data elements input from the vector register unit 21 to each computing unit 23 include the number of operands necessary for the computation.

また、インタリーブ部２４は、各演算器２３から入力される演算の結果をベクトルレジスタ部２１に書き戻す。例えば、インタリーブ部２４は、オペランドとなるデータ要素が保持されたベクトルレジスタとは異なるベクトルレジスタに結果データを書き戻してもよい。 Further, the interleave unit 24 writes back the result of the operation input from each calculator 23 to the vector register unit 21. For example, the interleave unit 24 may write the result data back into a vector register different from the vector register in which the data element that is the operand is held.

以上のように構成されたベクトルプロセッサ２の動作を以下に説明する。以下の説明では、各演算器２３は、入力されるデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化の４段階を１ステージとして浮動小数点加算を実行するものとする。また、演算部２２は、浮動小数点加算の段階数４に等しい４つの演算器２３＿１〜２３＿４を有するものとする。 The operation of the vector processor 2 configured as described above will be described below. In the following description, each computing unit 23 performs floating-point addition on the input data element with four stages of exponent part comparison, mantissa part shift, mantissa part addition, and normalization as one stage. To do. Further, it is assumed that the arithmetic unit 22 includes four arithmetic units 23_1 to 23_4 which are equal to the number of stages 4 of floating point addition.

最初のクロックサイクルで、インタリーブ部２４は、ベクトルレジスタ部２１から読み出された１つ目のデータ要素を、演算器２３＿１に入力する。演算器２３＿１は、入力されたデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化を行う。そして、演算器２３＿１は、４クロックサイクル後の５クロックサイクル目に、結果をインタリーブ部２４に入力する。インタリーブ部２４は、入力された結果データをベクトルレジスタ部２１へ書き戻す。 In the first clock cycle, the interleave unit 24 inputs the first data element read from the vector register unit 21 to the calculator 23_1. The computing unit 23_1 performs exponent part comparison, mantissa shift, mantissa addition, and normalization on the input data element. Then, the arithmetic unit 23_1 inputs the result to the interleave unit 24 in the fifth clock cycle after the fourth clock cycle. The interleave unit 24 writes the input result data back to the vector register unit 21.

次に２クロックサイクル目で、インタリーブ部２４は、ベクトルレジスタ部２１から読み出された２つ目のデータ要素を、演算器２３＿２に入力する。演算器２３＿２は、入力されたデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化を行う。そして、演算器２３＿２は、４クロックサイクル後の６クロックサイクル目に、結果をインタリーブ部２４に入力する。インタリーブ部２４は、入力された結果データをベクトルレジスタ部２１へ書き戻す。 Next, in the second clock cycle, the interleave unit 24 inputs the second data element read from the vector register unit 21 to the computing unit 23_2. The computing unit 23_2 performs exponent part comparison, mantissa shift, mantissa addition, and normalization on the input data element. Then, the arithmetic unit 23_2 inputs the result to the interleaving unit 24 in the sixth clock cycle after the fourth clock cycle. The interleave unit 24 writes the input result data back to the vector register unit 21.

次に３クロックサイクル目で、インタリーブ部２４は、ベクトルレジスタ部２１から読み出された３つ目のデータ要素を、演算器２３＿３に入力する。演算器２３＿３は、入力されたデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化を行う。そして、演算器２３＿３は、４クロックサイクル後の７クロックサイクル目に、結果をインタリーブ部２４に入力する。インタリーブ部２４は、入力された結果データをベクトルレジスタ部２１へ書き戻す。 Next, in the third clock cycle, the interleave unit 24 inputs the third data element read from the vector register unit 21 to the calculator 23_3. The computing unit 23_3 performs exponent part comparison, mantissa shift, mantissa addition, and normalization on the input data element. The computing unit 23_3 inputs the result to the interleaving unit 24 in the seventh clock cycle after the fourth clock cycle. The interleave unit 24 writes the input result data back to the vector register unit 21.

次に４クロックサイクル目で、インタリーブ部２４は、ベクトルレジスタ部２１から読み出された４つ目のデータ要素を、演算器２３＿４に入力する。演算器２３＿４は、入力されたデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化を行う。そして、演算器２３＿４は、４クロックサイクル後の８クロックサイクル目に、結果をインタリーブ部２４に入力する。インタリーブ部２４は、入力された結果データをベクトルレジスタ部２１へ書き戻す。 Next, in the fourth clock cycle, the interleave unit 24 inputs the fourth data element read from the vector register unit 21 to the computing unit 23_4. The computing unit 23_4 performs exponent part comparison, mantissa shift, mantissa addition, and normalization on the input data element. Then, the arithmetic unit 23_4 inputs the result to the interleaving unit 24 in the eighth clock cycle after the fourth clock cycle. The interleave unit 24 writes the input result data back to the vector register unit 21.

次に５クロックサイクル目で、インタリーブ部２４は、ベクトルレジスタ部２１から読み出された５つ目のデータ要素を、演算器２３＿１に入力する。この時、演算器２３＿１は１クロック目で入力されたデータ要素の浮動小数点加算を完了している。そこで、演算器２３＿１は、入力されたデータ要素に対して、指数部比較、仮数部シフト、仮数部加算、および正規化を行う。そして、演算器２３＿１は、４クロックサイクル後の９クロックサイクル目に、結果をインタリーブ部２４に入力する。インタリーブ部２４は、入力された結果データをベクトルレジスタ部２１へ書き戻す。 Next, at the fifth clock cycle, the interleave unit 24 inputs the fifth data element read from the vector register unit 21 to the arithmetic unit 23_1. At this time, the arithmetic unit 23_1 has completed the floating-point addition of the data element input at the first clock. Therefore, the arithmetic unit 23_1 performs exponent part comparison, mantissa part shift, mantissa part addition, and normalization on the input data element. Then, the arithmetic unit 23_1 inputs the result to the interleave unit 24 in the ninth clock cycle after the fourth clock cycle. The interleave unit 24 writes the input result data back to the vector register unit 21.

以降、ベクトルプロセッサ２は、同様の動作を繰り返す。 Thereafter, the vector processor 2 repeats the same operation.

次に、本発明の第２の実施の形態の効果について述べる。 Next, the effect of the second exemplary embodiment of the present invention will be described.

本発明の第２の実施の形態としてのベクトルプロセッサは、パイプラインにおけるステージ間のオーバヘッド、回路設計の複雑化および消費電力の増大を抑えながら、ｎ段階のパイプラインと同等の高速化を実現することができる。 The vector processor according to the second embodiment of the present invention achieves the same speed as an n-stage pipeline while suppressing the overhead between stages in the pipeline, the complexity of circuit design, and the increase in power consumption. be able to.

その理由は、演算部が、各演算器により実行される演算をパイプライン処理した場合の段階数と同数の演算器を有し、インタリーブ部が、各クロックサイクルでそれぞれの演算器にベクトルデータの要素を逐次供給するからである。そして、各演算器は、１つの要素に対する演算の一連の処理を１ステージとして実行するからである。これにより、本実施の形態のベクトルプロセッサは、ベクトル演算をパイプライン処理した場合と同様に各データ要素に対する演算を連続的に開始して並行して実行することになる。しかも、本実施の形態のベクトルプロセッサは、ベクトル演算をパイプライン処理する場合のようにステージ間のレジスタを必要としない。したがって、本実施の形態は、ステージ間におけるセットアップ、クロックスキュー等のオーバーヘッドがない。また、本実施の形態のベクトルプロセッサでは、その回路設計において、ステージ間のバランスやパイプラインハザードを考慮する必要がない。 The reason is that the arithmetic unit has the same number of stages as the number of stages when the operations executed by the respective arithmetic units are pipeline processed, and the interleave unit stores the vector data in each arithmetic unit in each clock cycle. This is because the elements are supplied sequentially. This is because each arithmetic unit executes a series of operations for one element as one stage. As a result, the vector processor according to the present embodiment starts operations for each data element continuously and executes them in parallel as in the case where the vector operations are pipelined. Moreover, the vector processor according to the present embodiment does not require registers between stages as in the case of pipeline processing of vector operations. Therefore, this embodiment has no overhead such as setup between stages and clock skew. In the vector processor of this embodiment, it is not necessary to consider the balance between stages and the pipeline hazard in the circuit design.

また、本実施の形態のベクトルプロセッサでは、同等のベクトル演算を１つの演算器を用いてｎ段のパイプラインで処理する場合と比較して、回路規模がｎ倍になるが、動作周波数が１／ｎとなる。ここで、プロセッサの消費電力が、動作周波数×回路規模×電圧×電圧に比例するとして、本実施の形態の消費電力は、パイプライン処理の場合とほぼ変わらない。 Further, in the vector processor of the present embodiment, the circuit scale is n times as compared with the case where equivalent vector operations are processed by an n-stage pipeline using one arithmetic unit, but the operating frequency is 1 / N. Here, assuming that the power consumption of the processor is proportional to the operating frequency × circuit scale × voltage × voltage, the power consumption of the present embodiment is almost the same as in the case of pipeline processing.

このように、本実施の形態は、パイプラインのステージ段階数を増やす代わりに、並列に実行可能な演算器数を増やすことにより、パイプラインにおける課題を回避しながら、パイプラインとほぼ変わらない消費電力で同等の高スループットを実現している。 In this way, this embodiment consumes almost the same as the pipeline while avoiding problems in the pipeline by increasing the number of arithmetic units that can be executed in parallel instead of increasing the number of stages in the pipeline. Equivalent high throughput with power.

（第３の実施の形態）
次に、本発明の第３の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第２の実施の形態に対して、ベクトルデータの要素毎に演算の種類を切り替え可能とした点が異なる。なお、本実施の形態の説明および各図面において、本発明の第２の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Third embodiment)
Next, a third embodiment of the present invention will be described in detail with reference to the drawings. This embodiment differs from the second embodiment of the present invention in that the type of calculation can be switched for each element of vector data. In the description of the present embodiment and each drawing, the same reference numerals are given to the same components as those in the second embodiment of the present invention, and the detailed description in the present embodiment is omitted.

本発明の第３の実施の形態としてのベクトルプロセッサ３の構成を図３に示す。図３において、ベクトルプロセッサ３は、ベクトルレジスタ部３１と、演算部３２とを備える。 The configuration of the vector processor 3 as the third embodiment of the present invention is shown in FIG. In FIG. 3, the vector processor 3 includes a vector register unit 31 and a calculation unit 32.

ベクトルレジスタ部３１は、本発明におけるデータ保持部の一実施形態を構成する。ベクトルレジスタ部３１は、例えば、複数のベクトルレジスタにより構成され、ベクトル命令と、ベクトルデータとを保持する。ベクトル命令およびベクトルデータは、主記憶装置８００から読み込まれる。 The vector register unit 31 constitutes an embodiment of the data holding unit in the present invention. The vector register unit 31 is constituted by a plurality of vector registers, for example, and holds vector instructions and vector data. Vector instructions and vector data are read from the main storage device 800.

ベクトルデータは、演算に必要となる数のオペランドを含むデータ要素からなる。 Vector data consists of data elements including the number of operands necessary for the operation.

ベクトル命令は、ベクトルデータの各要素に対して行う演算種類を示す命令要素からなる。 The vector instruction is composed of instruction elements indicating types of operations performed on each element of vector data.

演算部３２は、ｎ個の演算器３３（３３＿１〜３３＿ｎ）と、インタリーブ部３４とを有する。ここで、ｎは、１以上の整数であり、１つの演算器３３による演算がパイプラインで処理されると仮定した場合の段階数ｎに等しい。なお、図３には、４つの演算器３３を示したが、本発明における演算の段階数および演算部が有する演算器の数を限定するものではない。 The computing unit 32 includes n computing units 33 (33_1 to 33_n) and an interleaving unit 34. Here, n is an integer greater than or equal to 1, and is equal to the number of stages n when it is assumed that the operation by one arithmetic unit 33 is processed in the pipeline. In FIG. 3, four arithmetic units 33 are shown, but the number of arithmetic steps and the number of arithmetic units included in the arithmetic unit in the present invention are not limited.

各演算器３３は、命令要素をデコードして演算の種類を特定し、特定した種類の演算をデータ要素に対して行う一連の処理を１ステージで実行する。例えば、各演算器３３は、浮動小数点加算および浮動小数点乗算を選択的に実行可能となっていてもよい。 Each computing unit 33 decodes the instruction element to specify the type of operation, and executes a series of processes for performing the specified type of operation on the data element in one stage. For example, each computing unit 33 may be capable of selectively executing floating point addition and floating point multiplication.

なお、各演算器３３には、後述のインタリーブ部３４によりデータ要素および命令要素が供給される。また、各演算器３３は、演算の結果をインタリーブ部３４に入力する。 Each arithmetic unit 33 is supplied with a data element and an instruction element by an interleave unit 34 described later. Each computing unit 33 inputs the result of the computation to the interleave unit 34.

インタリーブ部３４は、ベクトルレジスタ部３１から各演算器３３に対してデータ要素および命令要素を逐次供給する。すなわち、インタリーブ部３４は、あるクロックサイクルで１つ目の命令要素および１つ目のデータ要素を演算器３３＿１に入力する。以降、インタリーブ部３４は、２〜ｎ個目の各命令要素および各データ要素を、続く各クロックサイクルで次々と、演算器３３＿２〜２３＿ｎに入力する。そして、インタリーブ部３４は、以降の命令要素およびデータ要素を、前の命令要素およびデータ要素に基づく一連の処理を終了した演算器３３＿１から順に逐次供給していく。 The interleaving unit 34 sequentially supplies data elements and instruction elements from the vector register unit 31 to each computing unit 33. That is, the interleave unit 34 inputs the first instruction element and the first data element to the computing unit 33_1 in a certain clock cycle. Thereafter, the interleaving unit 34 inputs the second to n-th instruction elements and data elements to the computing units 33_2 to 23_n one after another in each subsequent clock cycle. Then, the interleave unit 34 sequentially supplies subsequent instruction elements and data elements sequentially from the arithmetic unit 33_1 that has completed a series of processes based on the previous instruction elements and data elements.

なお、浮動小数点加算または浮動小数点乗算の場合、各演算器３３は、２つのオペランドを必要とする。そこで、インタリーブ部３４は、ベクトルレジスタ部３１から、命令要素、および、２つのオペランドを含むデータ要素を読み出して各演算器３３に入力するものとする。 In the case of floating point addition or floating point multiplication, each arithmetic unit 33 requires two operands. Therefore, the interleave unit 34 reads out an instruction element and a data element including two operands from the vector register unit 31 and inputs them to each computing unit 33.

また、インタリーブ部３４は、各演算器３３から入力される演算の結果をベクトルレジスタ部３１に書き戻す。なお、インタリーブ部３４は、ベクトルレジスタ部３１において、ベクトル命令およびオペランドとなるベクトルデータが保持されたベクトルレジスタとは異なるベクトルレジスタに演算の結果を書き戻す。 Further, the interleave unit 34 writes back the result of the calculation input from each calculator 33 to the vector register unit 31. The interleaving unit 34 writes back the operation result in a vector register different from the vector register in which the vector instruction and the vector data serving as the operand are stored in the vector register unit 31.

以上のように構成されたベクトルプロセッサ３の動作を以下に説明する。以下の説明では、各演算器３３は、命令要素のデコードおよびデータ要素に対する浮動小数点乗算または浮動小数点加算を１ステージで実行するものとする。また、演算部３２は、演算器３３が行う処理の段階数４に等しい４つの演算器３３＿１〜３３＿４を有するものとする。 The operation of the vector processor 3 configured as described above will be described below. In the following description, it is assumed that each arithmetic unit 33 executes instruction element decoding and floating point multiplication or floating point addition on data elements in one stage. In addition, the calculation unit 32 includes four calculation units 33_1 to 33_4 that are equal to the number of stages 4 of processing performed by the calculation unit 33.

最初のクロックサイクルで、インタリーブ部３４は、ベクトルレジスタ部３１から読みだされた１つ目の命令要素および１つ目のデータ要素を、演算器３３＿１に入力する。演算器３３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿１は、４クロックサイクル後の５クロックサイクル目に、結果をインタリーブ部３４に入力する。インタリーブ部３４は、入力された結果データをベクトルレジスタ部３１へ書き戻す。 In the first clock cycle, the interleave unit 34 inputs the first instruction element and the first data element read from the vector register unit 31 to the calculator 33_1. The computing unit 33_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 33_1 inputs the result to the interleave unit 34 in the fifth clock cycle after the fourth clock cycle. The interleave unit 34 writes the input result data back to the vector register unit 31.

２クロックサイクル目で、インタリーブ部３４は、ベクトルレジスタ部３１から読みだされた２つ目の命令要素および２つ目のデータ要素を、演算器３３＿２に入力する。演算器３３＿２は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿２は、４クロックサイクル後の６クロックサイクル目に、結果をインタリーブ部３４に入力する。インタリーブ部３４は、入力された結果データをベクトルレジスタ部３１へ書き戻す。 In the second clock cycle, the interleave unit 34 inputs the second instruction element and the second data element read from the vector register unit 31 to the calculator 33_2. The computing unit 33_2 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. Then, the computing unit 33_2 inputs the result to the interleaving unit 34 in the sixth clock cycle after the fourth clock cycle. The interleave unit 34 writes the input result data back to the vector register unit 31.

３クロックサイクル目で、インタリーブ部３４は、ベクトルレジスタ部３１から読みだされた３つ目の命令要素および３つ目のデータ要素を、演算器３３＿３に入力する。演算器３３＿３は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿３は、４クロックサイクル後の７クロックサイクル目に、結果をインタリーブ部３４に入力する。インタリーブ部３４は、入力された結果データをベクトルレジスタ部３１へ書き戻す。 In the third clock cycle, the interleave unit 34 inputs the third instruction element and the third data element read from the vector register unit 31 to the calculator 33_3. The computing unit 33_3 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 33_3 inputs the result to the interleaving unit 34 in the seventh clock cycle after the fourth clock cycle. The interleave unit 34 writes the input result data back to the vector register unit 31.

４クロックサイクル目で、インタリーブ部３４は、ベクトルレジスタ部３１から読みだされた４つ目の命令要素および４つ目のデータ要素を、演算器３３＿４に入力する。演算器３３＿４は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿４は、４クロックサイクル後の８クロックサイクル目に、結果をインタリーブ部３４に入力する。インタリーブ部３４は、入力された結果データをベクトルレジスタ部３１へ書き戻す。 In the fourth clock cycle, the interleave unit 34 inputs the fourth instruction element and the fourth data element read from the vector register unit 31 to the computing unit 33_4. The computing unit 33_4 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 33_4 inputs the result to the interleaving unit 34 in the eighth clock cycle after the fourth clock cycle. The interleave unit 34 writes the input result data back to the vector register unit 31.

５クロックサイクル目で、インタリーブ部３４は、ベクトルレジスタ部３１から読みだされた５つ目の命令要素および５つ目のデータ要素を、演算器３３＿１に入力する。この時、演算器３３＿１は、１クロックサイクル目で入力されたデータ要素に対する演算を完了している。そこで、演算器３３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿１は、４クロックサイクル後の９クロックサイクル目に、結果をインタリーブ部３４に入力する。インタリーブ部３４は、入力された結果データをベクトルレジスタ部３１へ書き戻す。 In the fifth clock cycle, the interleaving unit 34 inputs the fifth instruction element and the fifth data element read from the vector register unit 31 to the computing unit 33_1. At this time, the calculator 33_1 has completed the calculation for the data element input in the first clock cycle. Therefore, the arithmetic unit 33_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic unit 33_1 inputs the result to the interleave unit 34 in the ninth clock cycle after the fourth clock cycle. The interleave unit 34 writes the input result data back to the vector register unit 31.

以降、ベクトルプロセッサ３は、同様の動作を繰り返す。 Thereafter, the vector processor 3 repeats the same operation.

次に、本発明の第３の実施の形態の効果について述べる。 Next, effects of the third exemplary embodiment of the present invention will be described.

本発明の第３の実施の形態としてのベクトルプロセッサは、パイプラインにおける課題を回避しながら、ｎ段階のパイプラインと同等の高速化を実現する際に、ベクトルデータの要素ごとに異なる演算を可能として柔軟性を増大させる。 The vector processor according to the third embodiment of the present invention can perform different operations for each element of vector data when realizing the high speed equivalent to the n-stage pipeline while avoiding the problems in the pipeline. As increasing flexibility.

その理由は、ベクトルレジスタ部が、ベクトルデータに加えてベクトル命令を保持し、インタリーブ部が、データ要素および命令要素をｎ個の演算器に逐次供給し、各演算器が、命令要素が示す種類の演算を選択してデータ要素に対して実行する一連の処理を1ステージで実行するからである。 The reason is that the vector register unit holds vector instructions in addition to vector data, the interleave unit sequentially supplies data elements and instruction elements to n arithmetic units, and each arithmetic unit is a type indicated by the instruction element. This is because a series of processes to be performed on the data element by selecting this operation is executed in one stage.

（第４の実施の形態）
次に、本発明の第４の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第３の実施の形態に対して、演算部が有する演算器数をＮ（Ｎは１以上の整数）倍とする点が異なる。なお、本実施の形態の説明および各図面において、本発明の第３の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the third embodiment of the present invention in that the number of arithmetic units included in the arithmetic unit is N (N is an integer of 1 or more) times. In the description of the present embodiment and each drawing, the same reference numerals are given to the same components as those of the third embodiment of the present invention, and the detailed description in the present embodiment will be omitted.

本発明の第４の実施の形態としてのベクトルプロセッサ４の構成を図４に示す。図４において、ベクトルプロセッサ４は、ベクトルレジスタ部３１と、演算部４２とを備える。 FIG. 4 shows the configuration of the vector processor 4 as the fourth embodiment of the present invention. In FIG. 4, the vector processor 4 includes a vector register unit 31 and a calculation unit 42.

演算部４２は、ｎ×Ｎ個の演算器３３（３３＿１〜３３＿ｎＮ）と、インタリーブ部４４とを有する。ここで、ｎは、１以上の整数であり、１つの演算器３３による演算がパイプラインで処理されると仮定した場合の段階数ｎに等しい。また、Ｎは、１以上の整数である。なお、図４には、ｎ＝４、Ｎ＝２の場合として８つの演算器３３を示したが、本発明におけるｎ、Ｎの値および演算部が有する演算器の数を限定するものではない。 The computing unit 42 includes n × N computing units 33 (33_1 to 33_nN) and an interleaving unit 44. Here, n is an integer greater than or equal to 1, and is equal to the number of stages n when it is assumed that the operation by one arithmetic unit 33 is processed in the pipeline. N is an integer of 1 or more. In FIG. 4, eight arithmetic units 33 are shown as n = 4 and N = 2, but the values of n and N and the number of arithmetic units included in the arithmetic unit in the present invention are not limited. .

インタリーブ部４４は、本発明の第３の実施の形態におけるインタリーブ部３４と同様に構成される。ただし、ベクトルレジスタ部３１から読み出される命令要素およびデータ要素を逐次供給する先となる演算器３３の数が異なる。 The interleaving unit 44 is configured in the same manner as the interleaving unit 34 in the third embodiment of the present invention. However, the number of arithmetic units 33 to which instruction elements and data elements read from the vector register unit 31 are sequentially supplied is different.

具体的には、インタリーブ部４４は、ｎ×Ｎ個の演算器３３に対して、ベクトルレジスタ部３１から読み出されるデータ要素および命令要素を逐次供給する。すなわち、インタリーブ部４４は、１つ目〜ｎ×Ｎ個目までの各命令要素および各データ要素を、各クロックサイクルで次々と、演算器３３＿１〜３３＿ｎＮに入力する。そして、インタリーブ部３４は、ｎ×Ｎ＋１個目以降の命令要素およびデータ要素を、前のデータ要素および命令要素に基づく一連の処理を終了した演算器３３＿１から順に逐次入力していく。 Specifically, the interleave unit 44 sequentially supplies data elements and instruction elements read from the vector register unit 31 to the n × N arithmetic units 33. That is, the interleaving unit 44 inputs the first to n × Nth instruction elements and data elements to the arithmetic units 33_1 to 33_nN one after another in each clock cycle. Then, the interleaving unit 34 sequentially inputs the n × N + 1 and subsequent instruction elements and data elements in order from the arithmetic unit 33_1 that has completed a series of processes based on the previous data elements and instruction elements.

以上のように構成されたベクトルプロセッサ４の動作を以下に説明する。以下の説明では、各演算器３３は、浮動小数点乗算または浮動小数点加算を１ステージで実行するものとする。また、演算部３２は、各演算器３３が行う処理の段階数４の２倍となる８つの演算器３３＿１〜３３＿８を有するものとする。 The operation of the vector processor 4 configured as described above will be described below. In the following description, each computing unit 33 performs floating point multiplication or floating point addition in one stage. The computing unit 32 includes eight computing units 33_1 to 33_8 that are twice the number of stages of processing 4 performed by each computing unit 33.

最初のクロックサイクルで、インタリーブ部４４は、ベクトルレジスタ部３１から読みだされた１つ目の命令要素および１つ目のデータ要素を、演算器３３＿１に入力する。演算器３３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿１は、８クロックサイクル後の９クロックサイクル目に、結果をインタリーブ部４４に入力する。インタリーブ部４４は、入力されたデータをベクトルレジスタ部３１へ書き戻す。 In the first clock cycle, the interleave unit 44 inputs the first instruction element and the first data element read from the vector register unit 31 to the computing unit 33_1. The computing unit 33_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 33_1 inputs the result to the interleaving unit 44 in the ninth clock cycle after the eight clock cycles. The interleave unit 44 writes the input data back to the vector register unit 31.

以降、２クロックサイクル目〜８クロックサイクル目においても同様に、インタリーブ部４４は、ベクトルレジスタ部３１から読みだされた２つ目〜８つ目の命令要素およびデータ要素を、演算器３３＿２〜演算器３３＿８に逐次入力する。演算器３３＿２〜３３＿８は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算をそれぞれ実行する。そして、演算器３３＿２〜３３＿８は、８クロックサイクル後の１０〜１６クロックサイクル目に、結果をインタリーブ部４４にそれぞれ入力する。インタリーブ部４４は、入力されたデータをベクトルレジスタ部３１へ書き戻す。 Thereafter, similarly in the second to eighth clock cycles, the interleaving unit 44 converts the second to eighth instruction elements and data elements read from the vector register unit 31 into the arithmetic units 33_2 to the arithmetic unit 33_2. The data is sequentially input to the device 33_8. The computing units 33_2 to 33_8 decode the input instruction element, and execute floating point addition or floating point multiplication on the data element according to the decoding result, respectively. Then, the computing units 33_2 to 33_8 input the results to the interleaving unit 44 at the 10th to 16th clock cycles after the 8th clock cycle. The interleave unit 44 writes the input data back to the vector register unit 31.

９クロックサイクル目で、インタリーブ部４４は、ベクトルレジスタ部３１から読みだされた９つ目の命令要素およびデータ要素を、演算器３３＿１に入力する。この時、演算器３３＿１は、１クロックサイクル目で入力されたデータ要素に対する演算を完了している。そこで、演算器３３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器３３＿１は、８クロックサイクル後の１７クロックサイクル目に、結果をインタリーブ部４４に入力する。インタリーブ部４４は、入力されたデータをベクトルレジスタ部３１へ書き戻す。 In the ninth clock cycle, the interleave unit 44 inputs the ninth instruction element and data element read from the vector register unit 31 to the computing unit 33_1. At this time, the calculator 33_1 has completed the calculation for the data element input in the first clock cycle. Therefore, the arithmetic unit 33_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 33_1 inputs the result to the interleaving unit 44 at the 17th clock cycle after 8 clock cycles. The interleave unit 44 writes the input data back to the vector register unit 31.

以降、ベクトルプロセッサ４は、同様の動作を繰り返す。 Thereafter, the vector processor 4 repeats the same operation.

次に、本発明の第４の実施の形態の効果について述べる。 Next, effects of the fourth exemplary embodiment of the present invention will be described.

本発明の第４の実施の形態としてのベクトルプロセッサは、パイプラインにおける課題を回避しながら、ｎ段階のパイプラインよりさらなる高性能を実現する。 The vector processor according to the fourth embodiment of the present invention achieves higher performance than an n-stage pipeline while avoiding problems in the pipeline.

その理由は、演算部が、本発明の第３の実施の形態における演算器をｎ×Ｎ個有し、インタリーブ部が、ｎ×Ｎ個の演算器に対してベクトル命令およびベクトルデータの各要素を逐次供給するからである。これにより、本実施の形態は、本発明の第３の実施の形態に対してクロックサイクルを１／Ｎにして動作周波数をＮ倍にすることができ、その結果、Ｎ倍のスループットを得られることになる。 The reason is that the arithmetic unit has n × N arithmetic units in the third embodiment of the present invention, and the interleave unit has elements of vector instructions and vector data for the n × N arithmetic units. It is because it supplies sequentially. Thereby, this embodiment can make the clock cycle 1 / N and the operating frequency N times that of the third embodiment of the present invention, and as a result, N times the throughput can be obtained. It will be.

（第５の実施の形態）
次に、本発明の第５の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第３の実施の形態に対して、ベクトルレジスタ部を分割して各分割部分を各演算器に対応させた点が異なる。なお、本実施の形態の説明および各図面において、本発明の第３の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Fifth embodiment)
Next, a fifth embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the third embodiment of the present invention in that the vector register section is divided and each divided portion is associated with each arithmetic unit. In the description of the present embodiment and each drawing, the same reference numerals are given to the same components as those of the third embodiment of the present invention, and the detailed description in the present embodiment will be omitted.

本発明の第５の実施の形態としてのベクトルプロセッサ５の構成を図５に示す。図５において、ベクトルプロセッサ５は、ベクトルレジスタ部５１と、演算部５２とを備える。 The configuration of the vector processor 5 as the fifth embodiment of the present invention is shown in FIG. In FIG. 5, the vector processor 5 includes a vector register unit 51 and a calculation unit 52.

ベクトルレジスタ部５１は、本発明におけるデータ保持部の一実施形態を構成する。ベクトルレジスタ部５１は、後述の演算部５２が有する演算器５３と同数であるｎ個の分割部５５（５５＿１〜５５＿ｎ）に分割されている。各分割部５５には、主記憶装置８００から読み込まれるベクトル命令およびベクトルデータがｎ個に分割されて保持される。 The vector register unit 51 constitutes an embodiment of the data holding unit in the present invention. The vector register unit 51 is divided into n division units 55 (55_1 to 55_n), which is the same number as the arithmetic units 53 included in the arithmetic unit 52 described later. In each division unit 55, vector instructions and vector data read from the main storage device 800 are divided into n pieces and held.

つまり、ベクトル命令およびベクトルデータの各要素数をｍとすると、分割部５５は、ベクトル命令およびベクトルデータのそれぞれｍ／ｎ個の要素を保持する。例えば、ｎ＝４とすると、分割部５５＿１には、ベクトル命令およびベクトルデータの要素４ｉ（ｉ＝０，１，・・・）番が保持される。、また、分割部５５＿２には、ベクトル命令およびベクトルデータの要素４ｉ＋１番が保持される。、分割部５５＿１には、ベクトル命令およびベクトルデータの要素４ｉ＋２番が保持される。、分割部５５＿３には、ベクトル命令およびベクトルデータの要素４ｉ＋３番が保持される。 That is, when the number of elements of the vector instruction and vector data is m, the dividing unit 55 holds m / n elements of the vector instruction and vector data. For example, when n = 4, the division unit 55_1 holds the element number 4i (i = 0, 1,...) Of the vector instruction and vector data. Further, the division unit 55_2 holds the element 4i + 1 of the vector instruction and vector data. The dividing unit 55_1 holds the element 4i + 2 of the vector instruction and vector data. The dividing unit 55_3 holds element 4i + 3 of the vector instruction and vector data.

また、各分割部５５は、各演算器５３に対応づけられる。 Each division unit 55 is associated with each computing unit 53.

演算部５２は、ｎ個の演算器５３（５３＿１〜５３＿ｎ）を有する。ここで、ｎは、１以上の整数であり、１つの演算器５３による演算がパイプラインで処理されると仮定した場合の段階数ｎに等しい。なお、図５には、ｎ＝４の場合として４つの分割部５５および４つの演算器５３を示したが、本発明におけるｎの値、ベクトルレジスタ部５１の分割数、および、演算部が有する演算器の数を限定するものではない。 The computing unit 52 includes n computing units 53 (53_1 to 53_n). Here, n is an integer greater than or equal to 1, and is equal to the number of stages n when it is assumed that the calculation by one calculator 53 is processed in the pipeline. In FIG. 5, four division units 55 and four arithmetic units 53 are shown as n = 4, but the value of n, the number of divisions of the vector register unit 51, and the arithmetic unit in the present invention are included. The number of arithmetic units is not limited.

各演算器５３は、本発明の第３の実施の形態における演算器３３とほぼ同様に構成されるが、対応する分割部５５からデータ要素および命令要素を取得する点が異なる。 Each computing unit 53 is configured in substantially the same manner as the computing unit 33 in the third embodiment of the present invention, except that the data element and the instruction element are acquired from the corresponding dividing unit 55.

なお、浮動小数点加算または浮動小数点乗算の場合、各演算器５３は２つのオペランドを必要とする。この場合、各演算器５３は、対応する分割部５５から、命令要素、および、２つのオペランドを含むデータ要素を読み出すものとする。 In the case of floating-point addition or floating-point multiplication, each arithmetic unit 53 requires two operands. In this case, each computing unit 53 reads an instruction element and a data element including two operands from the corresponding dividing unit 55.

また、各演算器５３は、演算の結果を、対応する分割部５５に書き戻す。なお、各演算器５３は、各分割部５５において、データ要素および命令要素とは異なる場所に結果を書き戻すものとする。 Each computing unit 53 writes the result of the computation back to the corresponding dividing unit 55. Note that each computing unit 53 writes the result back to a location different from the data element and the instruction element in each division unit 55.

以上のように構成されたベクトルプロセッサ５の動作を以下に説明する。以下の説明では、各演算器５３は、浮動小数点乗算または浮動小数点加算を１ステージで実行するものとする。また、演算部５２は、各演算器５３が行う処理の段階数４に等しい４つの演算器５３＿１〜５３＿４を有するものとする。また、ベクトルレジスタ部５１は、演算器５３と同数である４つの分割部５５に分割されているものとする。 The operation of the vector processor 5 configured as described above will be described below. In the following description, each computing unit 53 performs floating point multiplication or floating point addition in one stage. The computing unit 52 includes four computing units 53_1 to 53_4 that are equal to the number of stages 4 of processing performed by each computing unit 53. Further, it is assumed that the vector register unit 51 is divided into four division units 55 that are the same number as the arithmetic units 53.

最初のクロックサイクルで、演算器５３＿１は、分割部５５＿１の１つ目の命令要素および１つ目のデータ要素を読み出す。演算器５３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器５３＿１は、１クロックサイクル後の２クロックサイクル目に、結果を分割部５５＿１へ書き戻す。 In the first clock cycle, the arithmetic unit 53_1 reads the first instruction element and the first data element of the dividing unit 55_1. The arithmetic unit 53_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic unit 53_1 writes the result back to the dividing unit 55_1 in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、演算器５３＿２は、分割部５５＿２の１つ目の命令要素および１つ目のデータ要素を読み出す。演算器５３＿２は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器５３＿２は、１クロックサイクル後の２クロックサイクル目に、結果を分割部５５＿２へ書き戻す。 Similarly, in the first clock cycle, the arithmetic unit 53_2 reads the first instruction element and the first data element of the dividing unit 55_2. The arithmetic unit 53_2 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 53_2 then writes the result back to the dividing unit 55_2 in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、演算器５３＿３は、分割部５５＿３の１つ目の命令要素および１つ目のデータ要素を読み出す。演算器５３＿３は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器５３＿３は、１クロックサイクル後の２クロックサイクル目に、結果を分割部５５＿３へ書き戻す。 Similarly, in the first clock cycle, the arithmetic unit 53_3 reads the first instruction element and the first data element of the dividing unit 55_3. The arithmetic unit 53_3 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 53_3 writes the result back to the dividing unit 55_3 in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、演算器５３＿４は、分割部５５＿４の１つ目の命令要素および１つ目のデータ要素を読み出す。演算器５３＿４は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器５３＿４は、１クロックサイクル後の２クロックサイクル目に、結果を分割部５５＿４へ書き戻す。 Similarly, in the first clock cycle, the arithmetic unit 53_4 reads the first instruction element and the first data element of the dividing unit 55_4. The arithmetic unit 53_4 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic unit 53_4 writes the result back to the dividing unit 55_4 in the second clock cycle after one clock cycle.

２クロックサイクル目で、演算器５３＿１〜５３＿４は、それぞれ、１クロックサイクル目で入力されたデータ要素に対する演算を完了している。そこで、演算器５３＿１〜５３＿４は、分割部５５＿１〜５５＿４のそれぞれ２つ目の命令要素およびデータ要素を読み出す。演算器５３＿１〜５３＿４は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器５３＿１〜５３＿４は、１クロックサイクル後の３クロックサイクル目に、結果を分割部５５＿１〜５５＿４へそれぞれ書き戻す。 In the second clock cycle, the calculators 53_1 to 53_4 each complete the calculation for the data element input in the first clock cycle. Therefore, the arithmetic units 53_1 to 53_4 read out the second command element and data element of the dividing units 55_1 to 55_4, respectively. The arithmetic units 53_1 to 53_4 decode the input instruction element, and execute floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic units 53_1 to 53_4 write the results back to the division units 55_1 to 55_4 in the third clock cycle after the first clock cycle.

以降、ベクトルプロセッサ５は、同様の動作を繰り返す。 Thereafter, the vector processor 5 repeats the same operation.

次に、本発明の第５の実施の形態の効果について述べる。 Next, effects of the fifth exemplary embodiment of the present invention will be described.

本発明の第５の実施の形態としてのベクトルプロセッサは、パイプラインにおける課題を回避しながら、ｎ段のパイプラインと同等以上の性能を実現する際に、回路規模を削減するとともに消費電力をより削減することができる。 The vector processor according to the fifth embodiment of the present invention reduces the circuit scale and power consumption when realizing performance equal to or better than that of an n-stage pipeline while avoiding problems in the pipeline. Can be reduced.

その理由は、ベクトルレジスタ部が、演算器と同数であるｎ個の分割部に分割されて各演算器に対応付けられ、各分割部が、ベクトルデータを分割して保持し、各演算器が、対応する分割部から読み出される命令要素およびデータ要素に基づく一連の処理を１ステージで実行するからである。 The reason is that the vector register unit is divided into n number of division units, which is the same number as the arithmetic units, and is associated with each arithmetic unit, and each division unit divides and holds the vector data. This is because a series of processes based on the instruction elements and data elements read from the corresponding dividing units are executed in one stage.

これにより、本実施の形態は、本発明の第３の実施の形態においてｎ個の演算器を用いてｎ段のパイプラインと同等の性能を実現する際に必要としていたインタリーブ部を不要とする。そのため、本実施の形態は、本発明の第３の実施の形態より回路規模を削減することになる。また、本実施の形態は、ベクトルレジスタ部および演算器間のインタフェースのクロックサイクルをｎ倍にしている。その結果、本実施の形態は、本発明の第３の実施の形態より動作周波数を削減することができ、消費電力をさらに削減することになる。 As a result, the present embodiment eliminates the need for an interleaving unit, which is necessary when realizing performance equivalent to that of an n-stage pipeline using n arithmetic units in the third embodiment of the present invention. . Therefore, this embodiment reduces the circuit scale as compared with the third embodiment of the present invention. In this embodiment, the clock cycle of the interface between the vector register unit and the arithmetic unit is increased by n times. As a result, the present embodiment can reduce the operating frequency and further reduce the power consumption as compared with the third embodiment of the present invention.

（第６の実施の形態）
次に、本発明の第６の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第５の実施の形態に対して、ベクトルレジスタの要素数と同数の演算器を有する点が異なる。なお、本実施の形態の説明および各図面において、本発明の第５の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Sixth embodiment)
Next, a sixth embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the fifth embodiment of the present invention in that it has the same number of arithmetic units as the number of elements of the vector register. Note that, in the description of the present embodiment and each drawing, the same components as those in the fifth embodiment of the present invention are denoted by the same reference numerals, and detailed description in the present embodiment is omitted.

本発明の第６の実施の形態としてのベクトルプロセッサ６の構成を図６に示す。図６において、ベクトルプロセッサ６は、ベクトルレジスタ部６１と、演算部６２とを備える。 A configuration of a vector processor 6 as a sixth embodiment of the present invention is shown in FIG. In FIG. 6, the vector processor 6 includes a vector register unit 61 and a calculation unit 62.

ベクトルレジスタ部６１は、ｍ個の要素部６６（６６＿〜６６＿ｍ）からなる。各要素部６６は、演算器６３による演算に必要な数のオペランドを含むデータ要素と、対応する命令要素とを保持するものとする。 The vector register unit 61 includes m element units 66 (66_ to 66_m). Each element unit 66 holds a data element including the number of operands necessary for the calculation by the calculator 63 and a corresponding instruction element.

演算部６２は、要素部６６と同数のｍ個の演算器６３（６３＿１〜６３＿ｍ）を有する。 The computing unit 62 includes the same number m of computing units 63 (63_1 to 63_m) as the element unit 66.

各演算器６３は、本発明の第５の実施の形態における演算器５３とほぼ同様に構成されるが、対応する要素部６６から、データ要素および命令要素を取得する点が異なる。 Each computing unit 63 is configured in substantially the same manner as the computing unit 53 in the fifth embodiment of the present invention, except that a data element and an instruction element are obtained from the corresponding element unit 66.

また、各演算器６３は、演算の結果を、対応する要素部６６に書き戻す。なお、各演算器６３は、各要素部６６において、データ要素および命令要素とは異なる場所に結果を書き戻すものとする。 Each computing unit 63 writes back the result of the computation in the corresponding element unit 66. Note that each computing unit 63 writes the result back to a location different from the data element and the instruction element in each element unit 66.

以上のように構成されたベクトルプロセッサ６の動作を以下に説明する。以下の説明では、各演算器６３は、浮動小数点乗算または浮動小数点加算を１ステージで実行するものとする。 The operation of the vector processor 6 configured as described above will be described below. In the following description, each computing unit 63 performs floating point multiplication or floating point addition in one stage.

最初のクロックサイクルで、演算器６３＿１は、要素部６６＿１から命令要素およびデータ要素を読み出す。演算器６３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器６３＿１は、１クロックサイクル後の２クロックサイクル目に、結果を要素部６６＿１へ書き戻す。 In the first clock cycle, the arithmetic unit 63_1 reads the instruction element and the data element from the element unit 66_1. The arithmetic unit 63_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. The computing unit 63_1 then writes the result back to the element unit 66_1 in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、演算器６３＿２〜６３＿ｍは、要素部６６＿２〜６６＿ｍから命令要素およびデータ要素をそれぞれ読み出す。演算器６３＿２〜６３＿ｍは、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算をそれぞれ実行する。そして、演算器６３＿２〜６３＿ｍは、１クロックサイクル後の２クロックサイクル目に、結果を要素部６６＿２〜６６＿ｍへそれぞれ書き戻す。 Similarly, in the first clock cycle, the arithmetic units 63_2 to 63_m read the instruction element and the data element from the element units 66_2 to 66_m, respectively. The arithmetic units 63_2 to 63_m decode the input instruction element, and execute floating point addition or floating point multiplication on the data element according to the decoding result, respectively. Then, the computing units 63_2 to 63_m write back the results to the element units 66_2 to 66_m in the second clock cycle after the first clock cycle.

２クロックサイクル目で、演算器６３＿１〜６３＿ｍは、１クロックサイクル目で入力されたデータ要素に対する演算をそれぞれ完了している。そこで、演算器６３＿１〜６３＿ｍは、要素部６６＿１〜６６＿ｍから命令要素およびデータ要素をそれぞれ読み出す。演算器６３＿１〜６３＿ｍは、入力された命令をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器６３＿１〜６３＿ｍは、１クロックサイクル後の３クロックサイクル目に、結果を要素部６６＿１〜６６＿ｍへそれぞれの書き戻す。 In the second clock cycle, the calculators 63_1 to 63_m each complete the calculation for the data element input in the first clock cycle. Therefore, the arithmetic units 63_1 to 63_m read the instruction element and the data element from the element units 66_1 to 66_m, respectively. The arithmetic units 63_1 to 63_m decode the input instruction, and execute floating point addition or floating point multiplication on the data element according to the decoding result. Then, the computing units 63_1 to 63_m write the results back to the element units 66_1 to 66_m in the third clock cycle after the first clock cycle.

以降、ベクトルプロセッサ６は、同様の動作を繰り返す。 Thereafter, the vector processor 6 repeats the same operation.

次に、本発明の第６の実施の形態の効果について述べる。 Next, effects of the sixth exemplary embodiment of the present invention will be described.

本発明の第６の実施の形態としてのベクトルプロセッサは、パイプラインにおける課題を回避しながら、パイプラインと同等以上の高性能を実現することができる。 The vector processor according to the sixth embodiment of the present invention can achieve high performance equal to or higher than that of a pipeline while avoiding problems in the pipeline.

その理由は、演算部が、ベクトルレジスタ部の要素数ｍと同数の演算器を有し、各演算器が、対応する要素部から命令要素およびデータ要素を読み出して一連の処理を１ステージで実行するからである。 The reason is that the arithmetic unit has the same number of arithmetic units as the number m of elements in the vector register unit, and each arithmetic unit reads the instruction element and the data element from the corresponding element unit and executes a series of processes in one stage. Because it does.

これにより、本実施の形態は、ｎ段階の処理を１ステージで行う演算器をｍ個備えるので、ｎ段のパイプラインに対してｍ／ｎ倍の性能を実現することができる。 Thus, since this embodiment includes m computing units that perform n stages of processing in one stage, m / n times performance can be realized with respect to an n stage pipeline.

（第７の実施の形態）
次に、本発明の第７の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第６の実施の形態に対して、ベクトルレジスタ部に保持されるデータの一部が各演算器にブロードキャストされる点が異なる。なお、本実施の形態の説明および各図面において、本発明の第６の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Seventh embodiment)
Next, a seventh embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the sixth embodiment of the present invention in that a part of data held in the vector register unit is broadcast to each arithmetic unit. Note that, in the description of the present embodiment and the respective drawings, the same components as those in the sixth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof will be omitted.

本発明の第７の実施の形態としてのベクトルプロセッサ７の構成を図７に示す。図７において、ベクトルプロセッサ７は、ベクトルレジスタ部７１と、演算部７２とを備える。 FIG. 7 shows the configuration of a vector processor 7 as a seventh embodiment of the present invention. In FIG. 7, the vector processor 7 includes a vector register unit 71 and a calculation unit 72.

ベクトルレジスタ部７１は、ｍ個の要素部６６（６６＿〜６６＿ｍ）と、ブロードキャスト部７７とを含む。主記憶装置８００から読み込まれるベクトルデータおよびベクトル命令は、ｍ個の要素部６６およびブロードキャスト部７７に保持される。ブロードキャスト部７７に保持されるベクトル命令およびベクトルデータは、後述の各演算器７３にブロードキャストされる。 The vector register unit 71 includes m element units 66 (66_ to 66_m) and a broadcast unit 77. Vector data and vector instructions read from the main storage device 800 are held in the m element units 66 and the broadcast unit 77. The vector command and vector data held in the broadcast unit 77 are broadcast to each computing unit 73 described later.

演算部７２は、要素部６６と同数のｍ個の演算器７３（７３＿１〜７３＿ｍ）を有する。 The computing unit 72 includes the same number m of computing units 73 (73_1 to 73_m) as the element unit 66.

各演算器７３は、データ要素および命令要素を、ブロードキャスト部７７および対応する要素部６６から取得して、命令要素の示す演算をデータ要素に対して実行する。このとき、各演算器７３は、オペランドとなるデータ要素を、要素部６６およびブロードキャスト部７７のいずれかまたは両方から読み出してもよい。また、各演算器７３は、命令要素を、要素部６６およびブロードキャスト部７７のいずれかから読み出せばよい。これにより、演算部７２は、全ての演算器７３で同一の命令を異なるオペランドに対して実行可能なる。あるいは、演算部７２は、全ての演算器７３で少なくとも１つの同一のオペランドを用いて異なる演算を実行可能となる。 Each computing unit 73 acquires the data element and the instruction element from the broadcast unit 77 and the corresponding element unit 66, and executes the operation indicated by the instruction element on the data element. At this time, each computing unit 73 may read out a data element as an operand from one or both of the element unit 66 and the broadcast unit 77. Each computing unit 73 may read the command element from either the element unit 66 or the broadcast unit 77. As a result, the arithmetic unit 72 can execute the same instruction on different operands in all the arithmetic units 73. Alternatively, the calculation unit 72 can execute different calculations using at least one identical operand in all the calculation units 73.

また、各演算器７３は、演算の結果を、対応する要素部６６に書き戻す。 Each computing unit 73 writes the result of the computation back to the corresponding element unit 66.

以上のように構成されたベクトルプロセッサ７の動作を以下に説明する。以下の説明では、各演算器７３は、浮動小数点乗算または浮動小数点加算を１ステージで実行するものとする。また、以下の説明では、ブロードキャスト部７７からは、命令要素およびオペランドの１つを表すデータ要素がブロードキャストされ、要素部６６からは、オペランドの他方を表すデータ要素が読み出されるものとする。 The operation of the vector processor 7 configured as described above will be described below. In the following description, each computing unit 73 performs floating point multiplication or floating point addition in one stage. In the following description, it is assumed that a data element representing one of an instruction element and an operand is broadcast from the broadcast unit 77, and a data element representing the other of the operands is read from the element unit 66.

最初のクロックサイクルで、演算器７３＿１は、ブロードキャスト部７７からブロードキャストされた命令要素およびデータ要素と、要素部６６＿１から読みだされたデータ要素を入力として得る。演算器７３＿１は、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算を実行する。そして、演算器７３＿１は、１クロックサイクル後の２クロックサイクル目に、結果を要素部６６＿１へ書き戻す。 In the first clock cycle, the arithmetic unit 73_1 receives the command element and data element broadcast from the broadcast unit 77 and the data element read from the element unit 66_1 as inputs. The arithmetic unit 73_1 decodes the input instruction element, and performs floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic unit 73_1 writes the result back to the element unit 66_1 in the second clock cycle after the first clock cycle.

同様に最初のクロックサイクルで、演算器７３＿２〜７３＿ｍは、ブロードキャスト部７７からブロードキャストされた命令要素およびデータ要素と、要素部６６＿２〜６６＿ｍから読みだされたデータ要素をそれぞれ入力として得る。演算器７３＿２〜７３＿ｍは、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算をそれぞれ実行する。そして、演算器７３＿２〜７３＿ｍは、１クロックサイクル後の２クロックサイクル目に、結果を要素部６６＿２〜６６＿ｍへそれぞれ書き戻す。 Similarly, in the first clock cycle, the arithmetic units 73_2 to 73_m respectively receive the command element and data element broadcast from the broadcast unit 77 and the data element read from the element units 66_2 to 66_m as inputs. The arithmetic units 73_2 to 73_m decode the input instruction element, and execute floating point addition or floating point multiplication on the data element according to the decoding result, respectively. Then, the computing units 73_2 to 73_m write the results back to the element units 66_2 to 66_m in the second clock cycle after the first clock cycle.

２クロックサイクル目で、演算器７３＿１〜７３＿ｍは、１クロックサイクル目で入力されたデータ要素に対する一連の処理をそれぞれ完了している。そこで、演算器７３＿１〜７３＿ｍは、ブロードキャスト部７７からブロードキャストされた命令要素およびデータ要素と、要素部６６＿１〜６６＿ｍから読みだされたデータ要素をそれぞれ入力として得る。演算器７３＿１〜７３＿ｍは、入力された命令要素をデコードし、デコード結果にしたがって、データ要素に対する浮動小数点加算または浮動小数点乗算をそれぞれ実行する。そして、演算器７３＿１〜７３＿ｍは、１クロックサイクル後の３クロックサイクル目に、結果を要素部６６＿１〜６６＿ｍへそれぞれ書き戻す。 In the second clock cycle, the arithmetic units 73_1 to 73_m each complete a series of processes for the data element input in the first clock cycle. Therefore, the arithmetic units 73_1 to 73_m obtain the command element and the data element broadcast from the broadcast unit 77 and the data element read from the element units 66_1 to 66_m as inputs. The computing units 73_1 to 73_m decode the input instruction element, and execute floating point addition or floating point multiplication on the data element according to the decoding result. Then, the arithmetic units 73_1 to 73_m write the results back to the element units 66_1 to 66_m in the third clock cycle after the first clock cycle.

以降、ベクトルプロセッサ７は、同様の動作を繰り返す。 Thereafter, the vector processor 7 repeats the same operation.

次に、本発明の第７の実施の形態の効果について述べる。 Next, effects of the seventh exemplary embodiment of the present invention will be described.

本発明の第７の実施の形態としてのベクトルプロセッサは、パイプラインにおける課題を回避しながらパイプラインと同等の性能を実現する際に、全ての演算において命令またはオペランドを同一とする演算を効率よく実行可能とする。 The vector processor according to the seventh embodiment of the present invention efficiently performs an operation that makes the instructions or operands the same in all operations when realizing performance equivalent to that of the pipeline while avoiding problems in the pipeline. Make it executable.

その理由は、ベクトルレジスタ部が、要素部およびブロードキャスト部を有し、各演算器が、ブロードキャスト部からブロードキャストされる命令要素またはデータ要素と、要素部から読み出される命令要素またはデータ要素とを用いて一連の処理を１ステージで実行するからである。これにより、本実施の形態は、全ての演算器において同一命令を実行したい場合や、全ての演算器において同一オペランドを用いたい場合に、メモリとのスループットが性能ネックになることを回避できることになる。 The reason is that the vector register unit has an element unit and a broadcast unit, and each arithmetic unit uses an instruction element or data element broadcast from the broadcast unit and an instruction element or data element read from the element unit. This is because a series of processing is executed in one stage. As a result, this embodiment can prevent the throughput with the memory from becoming a performance bottleneck when the same instruction is to be executed in all the arithmetic units or the same operand is to be used in all the arithmetic units. .

（第８の実施の形態）
次に、本発明の第８の実施の形態について図面を参照して詳細に説明する。本実施の形態では、スカラプロセッサに対して本発明のプロセッサを適用した一例について説明する。 (Eighth embodiment)
Next, an eighth embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the processor of the present invention is applied to a scalar processor will be described.

まず、本発明の第８の実施の形態としてのスカラプロセッサ８の構成を図８に示す。 First, FIG. 8 shows a configuration of a scalar processor 8 as an eighth embodiment of the present invention.

図８において、スカラプロセッサ８は、命令キャッシュ部８１と、演算部８２とを備える。 In FIG. 8, the scalar processor 8 includes an instruction cache unit 81 and a calculation unit 82.

命令キャッシュ部８１は、本発明におけるデータ保持部の一実施形態を構成する。命令キャッシュ部８１は、主記憶装置８００から読み込まれる命令を保持する。例えば命令キャッシュ部８１は、複数の命令を保持可能なキャッシュとして構成されている。 The instruction cache unit 81 constitutes an embodiment of the data holding unit in the present invention. The instruction cache unit 81 holds an instruction read from the main storage device 800. For example, the instruction cache unit 81 is configured as a cache that can hold a plurality of instructions.

ここで、命令キャッシュ部８１には、演算部８２に対して連続的に命令を供給できるよう主記憶装置８００から逐次命令が読み込まれて保持される。このような命令キャッシュ部８１の構成には、公知の各種技術を適用可能である。 Here, the instruction cache unit 81 sequentially reads and holds instructions from the main storage device 800 so that instructions can be continuously supplied to the arithmetic unit 82. Various known techniques can be applied to the configuration of the instruction cache unit 81.

演算部８２は、ｎ個の演算器８３（８３＿１〜８３＿ｎ）と、インタリーブ部８４とを有する。ここで、ｎは、１以上の整数であり、上述の命令をパイプラインで処理する場合の段階数ｎに等しい。例えば、１つの命令は、フェッチ、デコード、実行、メモリアクセス、および、ライトバックの５ステージのパイプラインで処理可能である。この場合、演算部８２は、５個の演算器８３を有する。なお、図８には、ｎ＝５として５つの演算器８３を示したが、本発明における命令の段階数および演算部が有する演算器の数を限定するものではない。 The computing unit 82 includes n computing units 83 (83_1 to 83_n) and an interleaving unit 84. Here, n is an integer of 1 or more, and is equal to the number n of stages when the above-described instruction is processed in the pipeline. For example, one instruction can be processed in a 5-stage pipeline of fetch, decode, execute, memory access, and write back. In this case, the calculation unit 82 includes five calculation units 83. Although FIG. 8 shows five arithmetic units 83 with n = 5, the number of instruction stages and the number of arithmetic units included in the arithmetic unit in the present invention are not limited.

各演算器８３は、命令キャッシュ部８１から供給される１つの命令の各段階を１ステージにまとめて実行する。なお、各演算器８３には、後述のインタリーブ部８４により命令が供給される。また、各演算器８３は、命令の結果をインタリーブ部８４に入力する。 Each computing unit 83 executes each stage of one instruction supplied from the instruction cache unit 81 in one stage. An instruction is supplied to each computing unit 83 from an interleaving unit 84 described later. Each computing unit 83 inputs the result of the instruction to the interleave unit 84.

また、演算部８２は、各演算器８３間にフォワーディングパスを有する。各演算器８３によって生成されるデータは、フォワーディングパスを介して次の演算器８３に入力される。例えば、図８の例では、演算器８３＿１の実行段階の結果およびメモリアクセス段階の結果は、次の演算器８３＿２の実行段階へ入力される。 In addition, the arithmetic unit 82 has a forwarding path between the arithmetic units 83. Data generated by each computing unit 83 is input to the next computing unit 83 via the forwarding path. For example, in the example of FIG. 8, the result of the execution stage of the computing unit 83_1 and the result of the memory access stage are input to the execution stage of the next computing unit 83_2.

インタリーブ部８４は、命令キャッシュ部８１から各演算器８３に対して命令を逐次供給する。すなわち、インタリーブ部８４は、あるクロックサイクルで１つ目の命令を演算器８３＿１に入力し、次のクロックサイクルで２つ目の命令を演算器８３＿２に入力する。以降、インタリーブ部８４は、３〜ｎ個目の各命令を、続く各クロックサイクルで次々と、演算器８３＿３〜８３＿ｎに入力する。そして、インタリーブ部８４は、ｎ＋１個目以降の命令を、前の命令に基づく一連の処理を終了した演算器８３＿１から順に逐次入力していく。 The interleaving unit 84 sequentially supplies instructions from the instruction cache unit 81 to each computing unit 83. That is, the interleave unit 84 inputs the first instruction to the computing unit 83_1 in a certain clock cycle, and inputs the second instruction to the computing unit 83_2 in the next clock cycle. Thereafter, the interleave unit 84 inputs the third to n-th instructions to the arithmetic units 83_3 to 83_n one after another in each subsequent clock cycle. Then, the interleave unit 84 sequentially inputs the (n + 1) th and subsequent instructions in order from the arithmetic unit 83_1 that has completed a series of processes based on the previous instruction.

また、インタリーブ部８４は、各演算器８３から入力される演算の結果をキャッシュ（不図示）に書き戻す。 Further, the interleave unit 84 writes back the result of the calculation input from each calculator 83 to a cache (not shown).

以上のように構成されたスカラプロセッサ８の動作を以下に説明する。以下の説明では、各演算器２３は、フェッチ、デコード、実行、メモリアクセス、および、ライトバックの５段階を１ステージとして命令を実行するものとする。また、演算部８２は、命令パイプラインの段階数５に等しい５つの演算器８３＿１〜８３＿５を有するものとする。 The operation of the scalar processor 8 configured as described above will be described below. In the following description, it is assumed that each arithmetic unit 23 executes an instruction with five stages of fetch, decode, execution, memory access, and write back as one stage. Further, it is assumed that the arithmetic unit 82 includes five arithmetic units 83_1 to 83_5 that are equal to the number of stages 5 in the instruction pipeline.

最初のクロックサイクルで、インタリーブ部８４は、命令キャッシュ部８１から読みだされた１つ目の命令を、演算器８３＿１に入力する。演算器８３＿１は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿２にフォワーディングされる。そして、演算器８３＿１は、５クロックサイクル後の６クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the first clock cycle, the interleave unit 84 inputs the first instruction read from the instruction cache unit 81 to the arithmetic unit 83_1. The arithmetic unit 83_1 fetches the input instruction and executes decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the arithmetic unit 83_2. The arithmetic unit 83_1 inputs the result to the interleave unit 84 in the sixth clock cycle after the fifth clock cycle. The interleaving unit 84 writes the input result data back to the cache.

２クロックサイクル目で、インタリーブ部８４は、命令キャッシュ部８１から読みだされた２つ目の命令を、演算器８３＿２に入力する。演算器８３＿２は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿３にフォワーディングされる。そして、演算器８３＿２は、５クロックサイクル後の７クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the second clock cycle, the interleave unit 84 inputs the second instruction read from the instruction cache unit 81 to the arithmetic unit 83_2. The arithmetic unit 83_2 fetches the input instruction and executes decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the computing unit 83_3. Then, the arithmetic unit 83_2 inputs the result to the interleave unit 84 at the seventh clock cycle after the fifth clock cycle. The interleaving unit 84 writes the input result data back to the cache.

３クロックサイクル目で、インタリーブ部８４は、命令キャッシュ部８１から読みだされた３つ目の命令を、演算器８３＿３に入力する。演算器８３＿３は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿４にフォワーディングされる。そして、演算器８３＿３は、５クロックサイクル後の８クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the third clock cycle, the interleave unit 84 inputs the third instruction read from the instruction cache unit 81 to the arithmetic unit 83_3. The arithmetic unit 83_3 fetches the input instruction and performs decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the computing unit 83_4. The arithmetic unit 83_3 inputs the result to the interleave unit 84 in the eighth clock cycle after the fifth clock cycle. The interleaving unit 84 writes the input result data back to the cache.

４クロックサイクル目で、インタリーブ部８４は、命令キャッシュ部８１から読みだされた４つ目の命令を、演算器８３＿４に入力する。演算器８３＿４は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿５にフォワーディングされる。そして、演算器８３＿４は、５クロックサイクル後の９クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the fourth clock cycle, the interleave unit 84 inputs the fourth instruction read from the instruction cache unit 81 to the arithmetic unit 83_4. The arithmetic unit 83_4 fetches the input instruction and executes decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the computing unit 83_5. The arithmetic unit 83_4 inputs the result to the interleave unit 84 in the ninth clock cycle after the fifth clock cycle. The interleaving unit 84 writes the input result data back to the cache.

５クロックサイクル目で、インタリーブ部８４は、命令キャッシュ部８１から読みだされた５つ目の命令を、演算器８３＿５に入力する。演算器８３＿５は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿１にフォワーディングされる。そして、演算器８３＿５は、５クロックサイクル後の１０クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the fifth clock cycle, the interleave unit 84 inputs the fifth instruction read from the instruction cache unit 81 to the arithmetic unit 83_5. The arithmetic unit 83_5 fetches the input instruction and performs decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the arithmetic unit 83_1. The computing unit 83_5 inputs the result to the interleaving unit 84 at the 10th clock cycle after the 5th clock cycle. The interleaving unit 84 writes the input result data back to the cache.

６クロックサイクル目で、インタリーブ部８４は、命令キャッシュ部８１から読みだされた６つ目の命令を、演算器８３＿１に入力する。この時、演算器８３＿１は、１クロックサイクル目で入力された命令に基づく一連の処理を完了している。そこで、演算器８３＿１は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを実行する。実行段階の結果およびメモリアクセス段階の結果は、演算器８３＿２にフォワーディングされる。そして、演算器８３＿１は、５クロックサイクル後の１１クロックサイクル目に、結果をインタリーブ部８４に入力する。インタリーブ部８４は、入力された結果データをキャッシュへ書き戻す。 In the sixth clock cycle, the interleave unit 84 inputs the sixth instruction read from the instruction cache unit 81 to the arithmetic unit 83_1. At this time, the arithmetic unit 83_1 has completed a series of processes based on the instruction input in the first clock cycle. Therefore, the arithmetic unit 83_1 fetches the input instruction and executes decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the arithmetic unit 83_2. The computing unit 83_1 inputs the result to the interleaving unit 84 in the 11th clock cycle after 5 clock cycles. The interleaving unit 84 writes the input result data back to the cache.

以降、スカラプロセッサ８は、同様の動作を繰り返す。 Thereafter, the scalar processor 8 repeats the same operation.

次に、本発明の第８の実施の形態の効果について述べる。 Next, effects of the eighth exemplary embodiment of the present invention will be described.

本発明の第８の実施の形態としてのスカラプロセッサは、パイプラインにおけるステージ間のオーバヘッド、回路設計の複雑化および消費電力の増大を抑えながら、ｎ段階のパイプラインと同等の高速化を実現することができる。 The scalar processor according to the eighth embodiment of the present invention achieves the same speed as an n-stage pipeline while suppressing the overhead between stages in the pipeline, the complexity of circuit design, and the increase in power consumption. be able to.

その理由は、演算部が、各演算器により命令がパイプライン処理されると仮定した場合の段階数と同数の演算器を有し、インタリーブ部が、各クロックサイクルでそれぞれの演算器に命令を逐次供給するからである。そして、各演算器は、ｎ段のパイプラインで実行していた命令を１ステージとして実行するからである。これにより、本実施の形態のスカラプロセッサは、命令をパイプライン処理した場合と同様に各命令を連続的に開始して並行して実行することになる。しかも、本実施の形態のスカラプロセッサは、命令をパイプライン処理する場合のようにステージ間のレジスタを必要としない。したがって、本実施の形態は、ステージ間におけるセットアップ、クロックスキュー等のオーバーヘッドがない。また、本実施の形態のスカラプロセッサでは、その回路設計において、ステージ間のバランスやパイプラインハザードを考慮する必要がない。 The reason is that the arithmetic unit has the same number of arithmetic units as the number of stages when it is assumed that the instruction is pipelined by each arithmetic unit, and the interleaving unit sends an instruction to each arithmetic unit in each clock cycle. It is because it supplies sequentially. This is because each arithmetic unit executes an instruction executed in the n-stage pipeline as one stage. As a result, the scalar processor according to the present embodiment starts each instruction continuously and executes it in parallel as in the case where the instructions are pipelined. Moreover, the scalar processor of this embodiment does not require registers between stages as in the case of pipeline processing of instructions. Therefore, this embodiment has no overhead such as setup between stages and clock skew. In the scalar processor of this embodiment, it is not necessary to consider the balance between stages and the pipeline hazard in the circuit design.

また、本実施の形態のスカラプロセッサでは、１つの演算器を用いて命令をｎ段のパイプラインで処理する場合と比較して、回路規模がｎ倍になるが、動作周波数が略１／ｎとなる。前述のように、プロセッサの消費電力が、動作周波数×回路規模×電圧×電圧に比例するとして、本実施の形態の消費電力は、パイプライン処理の場合とほぼ変わらない。 Further, in the scalar processor of the present embodiment, the circuit scale is n times as compared with the case where an instruction is processed by an n-stage pipeline using one arithmetic unit, but the operating frequency is approximately 1 / n. It becomes. As described above, assuming that the power consumption of the processor is proportional to the operating frequency × circuit scale × voltage × voltage, the power consumption of the present embodiment is almost the same as in the case of pipeline processing.

このように、本実施の形態は、パイプラインのステージ段階数を増やす代わりに、並列に実行可能な演算器数を増やすことにより、パイプラインにおける問題を回避しながら、パイプラインとほぼ変わらない消費電力で同等の高スループットを実現している。 In this way, this embodiment consumes almost the same as the pipeline while avoiding problems in the pipeline by increasing the number of arithmetic units that can be executed in parallel instead of increasing the number of stages in the pipeline. Equivalent high throughput with power.

（第９の実施の形態）
次に、本発明の第９の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第８の実施の形態に対して、演算部が有する演算器数をＮ（Ｎは１以上の整数）倍とする点が異なる。なお、本実施の形態の説明および各図面において、本発明の第８の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Ninth embodiment)
Next, a ninth embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the eighth embodiment of the present invention in that the number of arithmetic units included in the arithmetic unit is N (N is an integer of 1 or more) times. Note that, in the description of the present embodiment and the drawings, the same components as those in the eighth embodiment of the present invention are denoted by the same reference numerals, and detailed description in the present embodiment is omitted.

本発明の第９の実施の形態としてのスカラプロセッサ９の構成を図９に示す。図９において、スカラプロセッサ９は、命令キャッシュ部８１と、演算部９２とを備える。 FIG. 9 shows the configuration of the scalar processor 9 as the ninth embodiment of the present invention. In FIG. 9, the scalar processor 9 includes an instruction cache unit 81 and a calculation unit 92.

演算部９２は、ｎ×Ｎ個の演算器８３（８３＿１〜８３＿ｎＮ）と、インタリーブ部９４とを有する。ここで、ｎは、１以上の整数であり、命令をパイプライン処理する場合の段階数ｎに等しい。また、Ｎは、１以上の整数である。 The computing unit 92 includes n × N computing units 83 (83_1 to 83_nN) and an interleaving unit 94. Here, n is an integer of 1 or more, and is equal to the number n of stages when the instruction is pipelined. N is an integer of 1 or more.

インタリーブ部９４は、本発明の第８の実施の形態におけるインタリーブ部８４と同様に構成される。ただし、命令キャッシュ部８１から読み出される命令を逐次供給する先となる演算器８３の数が異なる。 Interleaving unit 94 is configured in the same manner as interleaving unit 84 in the eighth embodiment of the present invention. However, the number of arithmetic units 83 to which instructions read from the instruction cache unit 81 are sequentially supplied is different.

具体的にはインタリーブ部９４は、命令キャッシュ部８１から読み出される命令を、各クロックサイクルで次々と、ｎ×Ｎ個の演算器８３＿１〜８３＿ｎＮに入力する。そして、インタリーブ部９４は、ｎ×Ｎ＋１個目以降の命令を、前の命令に基づく一連の処理を終了した演算器８３＿１から順に逐次入力していく。 Specifically, the interleave unit 94 inputs instructions read from the instruction cache unit 81 to n × N arithmetic units 83_1 to 83_nN one after another in each clock cycle. Then, the interleaving unit 94 sequentially inputs n × N + 1 and subsequent instructions from the arithmetic unit 83_1 that has completed a series of processes based on the previous instruction.

以上のように構成されたスカラプロセッサ９の動作を以下に説明する。以下の説明では、各演算器８３は、フェッチ、デコード、実行、メモリアクセス、および、ライトバックの５段階を１ステージとして命令を処理するものとする。また、演算部９２は、命令パイプラインの段階数５の２倍に等しい１０個の演算器８３＿１〜８３＿１０を有するものとする。 The operation of the scalar processor 9 configured as described above will be described below. In the following description, it is assumed that each arithmetic unit 83 processes an instruction with five stages of fetch, decode, execution, memory access, and write back as one stage. Further, it is assumed that the arithmetic unit 92 includes ten arithmetic units 83_1 to 83_10 that are equal to twice the number of stages of the instruction pipeline.

最初のクロックサイクルで、インタリーブ部９４は、命令キャッシュ部８１から読みだされた１つ目の命令を、演算器８３＿１に入力する。演算器８３＿１は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを行う。実行段階の結果およびメモリアクセス段階の結果は、次の演算器８３＿２にフォワーディングされる。そして、演算器８３＿１は、１０クロックサイクル後の１１クロックサイクル目に、結果をインタリーブ部９４に入力する。インタリーブ部９４は、入力された結果データをキャッシュへ書き戻す。 In the first clock cycle, the interleave unit 94 inputs the first instruction read from the instruction cache unit 81 to the arithmetic unit 83_1. The arithmetic unit 83_1 fetches the input instruction, and performs decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the next computing unit 83_2. Then, the arithmetic unit 83_1 inputs the result to the interleaving unit 94 in the 11th clock cycle after 10 clock cycles. The interleaving unit 94 writes the input result data back to the cache.

以降、２クロックサイクル目〜１０クロックサイクル目においても同様に、インタリーブ部９４は、命令キャッシュ部８１から読みだされた命令を、演算器８３＿２〜８３＿１０に逐次入力する。演算器８３＿２〜８３＿１０は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックをそれぞれ行う。実行段階の結果およびメモリアクセス段階の結果は、それぞれ次の演算器８３にフォワーディングされる。そして、演算器８３＿２〜８３＿１０は、それぞれ１０クロックサイクル後の１２〜２０クロックサイクル目に、結果をインタリーブ部９４に入力する。インタリーブ部９４は、入力された結果データをキャッシュへそれぞれ書き戻す。 Thereafter, similarly in the second clock cycle to the tenth clock cycle, the interleaving unit 94 sequentially inputs the instructions read from the instruction cache unit 81 to the arithmetic units 83_2 to 83_10. The arithmetic units 83_2 to 83_10 fetch input instructions and perform decoding, execution, memory access, and write back, respectively. The result of the execution stage and the result of the memory access stage are forwarded to the next computing unit 83, respectively. The arithmetic units 83_2 to 83_10 input the results to the interleaving unit 94 at 12th to 20th clock cycles after 10 clock cycles, respectively. The interleaving unit 94 writes the input result data back to the cache.

１１クロックサイクル目で、インタリーブ部９４は、命令キャッシュ部８１から読みだされた１１個目の命令を、演算器８３＿１に入力する。この時、演算器８３＿１は、１クロックサイクル目で入力された命令に基づく一連の処理を完了している。そこで、演算器８３＿１は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを行う。実行段階の結果およびメモリアクセス段階の結果は、次の演算器８３＿２にフォワーディングされる。そして、演算器８３＿１は、１０クロックサイクル後の２１クロックサイクル目に、結果をインタリーブ部９４に入力する。インタリーブ部９４は、入力された結果データをキャッシュへ書き戻す。 In the eleventh clock cycle, the interleave unit 94 inputs the eleventh instruction read from the instruction cache unit 81 to the arithmetic unit 83_1. At this time, the arithmetic unit 83_1 has completed a series of processes based on the instruction input in the first clock cycle. Therefore, the arithmetic unit 83_1 fetches the input instruction and performs decoding, execution, memory access, and write back. The result of the execution stage and the result of the memory access stage are forwarded to the next computing unit 83_2. Then, the arithmetic unit 83_1 inputs the result to the interleaving unit 94 at the 21st clock cycle after 10 clock cycles. The interleaving unit 94 writes the input result data back to the cache.

以降、スカラプロセッサ９は、同様の動作を繰り返す。 Thereafter, the scalar processor 9 repeats the same operation.

次に、本発明の第９の実施の形態の効果について述べる。 Next, effects of the ninth exemplary embodiment of the present invention will be described.

本発明の第９の実施の形態としてのスカラプロセッサは、パイプラインにおける課題を回避しながら、ｎ段階のパイプラインよりさらなる高性能を実現する。 The scalar processor according to the ninth embodiment of the present invention achieves higher performance than the n-stage pipeline while avoiding problems in the pipeline.

その理由は、演算部が、本発明の第８の実施の形態における演算器をｎのＮ倍個有し、インタリーブ部が、ｎ×Ｎ個の演算器に対して命令を逐次入力するからである。これにより、本実施の形態は、本発明の第８の実施の形態に対してクロックサイクルを１／Ｎにして動作周波数をＮ倍にすることができ、その結果、本発明の第８の実施の形態に対してＮ倍のスループットを得られることになる。 The reason is that the arithmetic unit has N arithmetic units in the eighth embodiment of the present invention, and the interleave unit sequentially inputs instructions to the n × N arithmetic units. is there. As a result, the present embodiment can make the clock cycle 1 / N and the operating frequency N times that of the eighth embodiment of the present invention, and as a result, the eighth embodiment of the present invention. N times the throughput can be obtained.

なお、本発明の第８および第９の実施の形態において、命令キャッシュ部８１に保持される命令が、依存関係がないよう制約されている場合について考える。例えば、コンパイル時に、主記憶装置８００に、あらかじめ依存関係がないように制約された命令が格納されているようなケースが想定される。この場合、各実施の形態は、図１０に示すように、演算器８３間のフォワーディングパスを省略可能となる。これにより、本実施の形態は、回路を簡略化することができ、回路規模の削減および消費電力の削減という効果をさらに奏することができる。 In the eighth and ninth embodiments of the present invention, a case is considered in which the instructions held in the instruction cache unit 81 are restricted so as not to have a dependency relationship. For example, a case may be assumed in which instructions that are previously constrained so as to have no dependency are stored in the main storage device 800 during compilation. In this case, in each embodiment, as shown in FIG. 10, the forwarding path between the computing units 83 can be omitted. As a result, this embodiment can simplify the circuit, and can further achieve the effects of reducing the circuit scale and power consumption.

（第１０の実施の形態）
次に、本発明の第１０の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第９の実施の形態に対して、演算器の個数に合わせて命令キャッシュ部を分割した点が異なる。なお、本実施の形態の説明および各図面において、本発明の第９の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Tenth embodiment)
Next, a tenth embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the ninth embodiment of the present invention in that the instruction cache unit is divided according to the number of arithmetic units. Note that in the description of the present embodiment and the drawings, the same components as those in the ninth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof will be omitted.

本発明の第１０の実施の形態としてのスカラプロセッサ１００の構成を図１１に示す。図１１において、スカラプロセッサ１００は、命令キャッシュ部１０１と、演算部１０２とを備える。 FIG. 11 shows the configuration of a scalar processor 100 as a tenth embodiment of the present invention. In FIG. 11, the scalar processor 100 includes an instruction cache unit 101 and a calculation unit 102.

命令キャッシュ部１０１は、本発明におけるデータ保持部の一実施形態を構成する。命令キャッシュ部１０１は、後述の演算部１０２が有する演算器１０３と同数であるｎ×Ｎ個の分割部１０５（１０５＿１〜１０５＿ｎ）に分割される。分割部１０５は、主記憶装置８００から読み込まれる命令を保持する。また、各分割部１０５は、ｎ×Ｎ個の演算器１０３のいずれかに対応づけられる。 The instruction cache unit 101 constitutes an embodiment of a data holding unit in the present invention. The instruction cache unit 101 is divided into n × N division units 105 (105_1 to 105_n), which is the same number as the arithmetic units 103 included in the calculation unit 102 described later. The dividing unit 105 holds an instruction read from the main storage device 800. Each division unit 105 is associated with one of n × N computing units 103.

演算部１０２は、ｎ×Ｎ個の演算器１０３（１０３＿１〜１０３＿ｎＮ）を有する。ここで、ｎは、１以上の整数であり、命令をパイプライン処理する場合の段階数ｎに等しい。また、Ｎは、１以上の整数である。なお、図１１において、ｎ＝５、Ｎ＝２の場合として１０個の演算器１０３および１０個の分割部１０５を示したが、本発明におけるｎ、Ｎの値、演算部が有する演算器の数、および、命令キャッシュ部の分割数を限定するものではない。 The computing unit 102 includes n × N computing units 103 (103_1 to 103_nN). Here, n is an integer of 1 or more, and is equal to the number n of stages when the instruction is pipelined. N is an integer of 1 or more. In FIG. 11, 10 arithmetic units 103 and 10 division units 105 are shown as n = 5 and N = 2, but the values of n and N in the present invention, the arithmetic units included in the arithmetic units, The number and the division number of the instruction cache unit are not limited.

各演算器１０３は、本発明の第９の実施の形態における演算器８３とほぼ同様に構成されるが、対応する分割部１０５から命令を取得する点が異なる。また、各演算器１０３は、命令の結果をキャッシュ（不図示）に書き戻す。 Each computing unit 103 is configured in substantially the same manner as the computing unit 83 in the ninth embodiment of the present invention, except that an instruction is acquired from the corresponding dividing unit 105. Each computing unit 103 writes the result of the instruction back to a cache (not shown).

以上のように構成されたスカラプロセッサ１００の動作を以下に説明する。以下の説明では、各演算器１０３は、フェッチ、デコード、実行、メモリアクセス、および、ライトバックの５段階を１ステージとして命令を処理するものとする。また、演算部１０２は、命令パイプラインの段階数５の２倍に等しい１０個の演算器１０３＿１〜１０３＿１０を有するものとする。また、命令キャッシュ部１０１は、演算器１０３と同数である１０個の分割部１０５（１０５＿１〜１０５＿１０）に分割されるものとする。 The operation of the scalar processor 100 configured as described above will be described below. In the following description, it is assumed that each arithmetic unit 103 processes an instruction with five stages of fetch, decode, execution, memory access, and write back as one stage. Further, it is assumed that the arithmetic unit 102 includes ten arithmetic units 103_1 to 103_10 that are equal to twice the number of stages of the instruction pipeline. The instruction cache unit 101 is divided into ten division units 105 (105_1 to 105_10), which is the same number as the arithmetic unit 103.

最初のクロックサイクルで、演算器１０３＿１〜１０３＿１０は、分割部１０５＿１〜１０５＿１０から読みだされた命令をそれぞれ入力として得る。演算器１０３＿１〜１０３＿１０は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックをそれぞれ行う。そして、演算器１０３＿１〜１０３＿１０は、１クロックサイクル後の２クロックサイクル目に、結果をキャッシュへそれぞれ書き戻す。 In the first clock cycle, the arithmetic units 103_1 to 103_10 respectively receive the instructions read from the dividing units 105_1 to 105_10 as inputs. The arithmetic units 103_1 to 103_10 fetch input instructions and perform decoding, execution, memory access, and write back, respectively. Then, the arithmetic units 103_1 to 103_10 respectively write back the results to the cache in the second clock cycle after the first clock cycle.

２クロックサイクル目で、演算器１０３＿１〜１０３＿１０は、分割部１０５＿１〜１０５＿１０から読みだされた命令をそれぞれ入力として得る。この時、演算器１０３＿１〜１０３＿１０は、１クロックサイクル目で入力された命令に基づく一連の処理をそれぞれ完了している。そこで、演算器１０３＿１〜１０３＿１０は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックをそれぞれ行う。そして、演算器１０３＿１〜１０３＿１０は、１クロックサイクル後の３クロックサイクル目に、結果をキャッシュへそれぞれ書き戻す。 In the second clock cycle, the arithmetic units 103_1 to 103_10 respectively receive the instructions read from the dividing units 105_1 to 105_10 as inputs. At this time, the arithmetic units 103_1 to 103_10 each complete a series of processes based on the instruction input in the first clock cycle. Therefore, the arithmetic units 103_1 to 103_10 fetch the input instruction, and perform decoding, execution, memory access, and write back, respectively. Then, the arithmetic units 103_1 to 103_10 respectively write back the results to the cache in the third clock cycle after the first clock cycle.

以降、スカラプロセッサ１００は、同様の動作を繰り返す。 Thereafter, the scalar processor 100 repeats the same operation.

次に、本発明の第１０の実施の形態の効果について述べる。 Next, effects of the tenth embodiment of the present invention will be described.

本発明の第１０の実施の形態としてのスカラプロセッサは、パイプラインにおける課題を回避しながら、ｎ段のパイプラインと同等以上の性能を実現する際に、回路規模を削減するとともに消費電力をより削減することができる。 The scalar processor according to the tenth embodiment of the present invention reduces the circuit scale and power consumption when realizing performance equal to or better than that of an n-stage pipeline while avoiding problems in the pipeline. Can be reduced.

その理由は、命令キャッシュ部が、演算器と同数であるｎ×Ｎ個の分割部に分割され、各演算器が、各分割部から読み出される命令に基づく一連の処理を１ステージで実行するからである。 The reason is that the instruction cache unit is divided into n × N division units, which is the same number as the arithmetic units, and each arithmetic unit executes a series of processes based on instructions read from each division unit in one stage. It is.

これにより、本実施の形態は、本発明の第９の実施の形態においてｎ×Ｎ個の演算器を用いてｎ段のパイプラインと同等の性能を実現する際に必要としていたインタリーブ部を不要とする。そのため、本実施の形態は、本発明の第９の実施の形態に対して、回路規模を削減することになる。また、本実施の形態は、本発明の第９の実施の形態に対して、命令キャッシュ部および演算器間のインタフェースのクロックサイクルをｎ倍にしている。その結果、本実施の形態は、本発明の第９の実施の形態と同様の効果を得ながらさらに回路規模を削減し消費電力を削減することになる。 As a result, the present embodiment does not require the interleaving unit that is necessary in the ninth embodiment of the present invention to achieve performance equivalent to an n-stage pipeline using n × N arithmetic units. And Therefore, the present embodiment reduces the circuit scale as compared with the ninth embodiment of the present invention. Further, in this embodiment, the clock cycle of the interface between the instruction cache unit and the arithmetic unit is increased by n times as compared with the ninth embodiment of the present invention. As a result, the present embodiment further reduces the circuit scale and power consumption while obtaining the same effects as those of the ninth embodiment of the present invention.

（第１１の実施の形態）
次に、本発明の第１１の実施の形態について図面を参照して詳細に説明する。本実施の形態は、本発明の第１０の実施の形態に対して、命令レジスタの個数と同数の演算器を有する点が異なる。なお、本実施の形態の説明および各図面において、本発明の第１０の実施の形態と同一の構成には同一の符号を付して本実施の形態における詳細な説明を省略する。 (Eleventh embodiment)
Next, an eleventh embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is different from the tenth embodiment of the present invention in that it has the same number of arithmetic units as the number of instruction registers. Note that, in the description of the present embodiment and each drawing, the same components as those in the tenth embodiment of the present invention are denoted by the same reference numerals, and detailed description thereof will be omitted.

本発明の第１１の実施の形態としてのスカラプロセッサ１１０の構成を図１２に示す。図１２において、スカラプロセッサ１１０は、Ｍ個の命令レジスタ１１１（１１１＿１〜１１１＿Ｍ）と、演算部１１２とを備える。 FIG. 12 shows the configuration of the scalar processor 110 according to the eleventh embodiment of the present invention. 12, the scalar processor 110 includes M instruction registers 111 (111_1 to 111_M) and an arithmetic unit 112.

Ｍ個の命令レジスタ１１１には、主記憶装置８００から読み込まれる命令が保持される。 The M instruction registers 111 hold instructions read from the main storage device 800.

演算部１１２は、命令レジスタ１１１と同数のＭ個の演算器１１３（１１３＿１〜１１３＿Ｍ）を有する。 The arithmetic unit 112 includes the same number of M arithmetic units 113 (113_1 to 113_M) as the instruction register 111.

各演算器１１３は、本発明の第１０の実施の形態における演算器１０３とほぼ同様に構成されるが、対応する命令レジスタ１１１から、命令を取得する点が異なる。また、各演算器１１３は、処理の結果をレジスタ（不図示）に書き戻す。 Each computing unit 113 is configured in substantially the same manner as the computing unit 103 in the tenth embodiment of the present invention, except that an instruction is acquired from the corresponding instruction register 111. Each computing unit 113 writes the processing result back to a register (not shown).

以上のように構成されたスカラプロセッサ１１０の動作を以下に説明する。以下の説明では、各演算器１１３は、フェッチ、デコード、実行、メモリアクセス、および、ライトバックの５段階を１ステージとして命令を処理するものとする。 The operation of the scalar processor 110 configured as described above will be described below. In the following description, it is assumed that each arithmetic unit 113 processes an instruction with five stages of fetch, decode, execution, memory access, and write back as one stage.

最初のクロックサイクルで、演算器１１３＿１は、命令レジスタ１１１＿１から読みだされた命令を入力として得る。演算器１１３＿１は、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックを行う。そして、演算器１１３＿１は、１クロックサイクル後の２クロックサイクル目に結果をレジスタへ書き戻す。 In the first clock cycle, the arithmetic unit 113_1 receives an instruction read from the instruction register 111_1 as an input. The arithmetic unit 113_1 fetches the input instruction and performs decoding, execution, memory access, and write back. Then, the arithmetic unit 113_1 writes the result back to the register in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、演算器１１３＿２〜１１３＿Ｍは、命令レジスタ１１１＿２〜１１１＿Ｍから読みだされた命令をそれぞれ入力として得る。演算器１１３＿２〜１１３＿Ｍは、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックをそれぞれ行う。そして、演算器１１３＿２〜１１３＿Ｍは、１クロックサイクル後の２クロックサイクル目に結果をレジスタへそれぞれ書き戻す。 Similarly, in the first clock cycle, the calculators 113_2 to 113_M respectively receive the instructions read from the instruction registers 111_2 to 111_M as inputs. The arithmetic units 113_2 to 113_M fetch the input instruction and perform decoding, execution, memory access, and write back, respectively. Then, the arithmetic units 113_2 to 113_M write the results back to the registers in the second clock cycle after the first clock cycle.

２クロックサイクル目で、演算器１１３＿１〜１１３＿Ｍは、命令レジスタ１１１＿１〜１１１＿Ｍから読みだされた命令をそれぞれ入力として得る。演算器１１３＿１〜１１３＿Ｍは、入力された命令をフェッチし、デコード、実行、メモリアクセス、およびライトバックをそれぞれ行う。そして、演算器１１３＿１〜１１３＿Ｍは、１クロックサイクル後の３クロックサイクル目に、結果をレジスタへそれぞれ書き戻す。 In the second clock cycle, the arithmetic units 113_1 to 113_M respectively receive the instructions read from the instruction registers 111_1 to 111_M as inputs. The arithmetic units 113_1 to 113_M fetch the input instruction and perform decoding, execution, memory access, and write back, respectively. Then, the arithmetic units 113_1 to 113_M write the results back to the registers in the third clock cycle after the first clock cycle.

以降、スカラプロセッサ１１０は、同様の動作を繰り返す。 Thereafter, the scalar processor 110 repeats the same operation.

次に、本発明の第１１の実施の形態の効果について述べる。 Next, effects of the eleventh embodiment of the present invention will be described.

本発明の第１１の実施の形態としてのスカラプロセッサは、パイプラインにおける課題を回避しながら、パイプラインと同等以上の高性能を実現することができる。 The scalar processor according to the eleventh embodiment of the present invention can achieve high performance equal to or higher than that of a pipeline while avoiding problems in the pipeline.

その理由は、演算部が、命令レジスタの個数Ｍと同数の演算器を有し、各演算器が、対応する命令レジスタから命令を読み出して一連の処理を１ステージで実行するからである。 This is because the arithmetic unit has the same number of arithmetic units as the number M of instruction registers, and each arithmetic unit reads an instruction from the corresponding instruction register and executes a series of processes in one stage.

このように、本実施の形態は、ｎ段階の命令を１ステージで行う演算器をＭ個備えるので、ｎ段のパイプラインに対してＭ／ｎ倍の性能を実現することができる。 As described above, since this embodiment includes M arithmetic units that perform n-stage instructions in one stage, it is possible to realize M / n times performance with respect to an n-stage pipeline.

（第１２の実施の形態）
次に、本発明の第１２の実施の形態について図面を参照して詳細に説明する。本実施の形態では、本発明の第７の実施の形態における各演算器に、本発明の第１１の実施の形態としてのスカラプロセッサを適用した例について説明する。なお、本実施の形態の説明および各図面において、本発明の第７および第１１の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。 (Twelfth embodiment)
Next, a twelfth embodiment of the present invention will be described in detail with reference to the drawings. In the present embodiment, an example in which the scalar processor according to the eleventh embodiment of the present invention is applied to each arithmetic unit in the seventh embodiment of the present invention will be described. In the description of the present embodiment and the drawings, the same reference numerals are given to the same configurations and steps that operate in the same manner as in the seventh and eleventh embodiments of the present invention, and the details in the present embodiment. The detailed explanation is omitted.

本発明の第１２の実施の形態としてのプロセッサ１２０の構成を図１３に示す。図１３において、プロセッサ１２０は、ベクトルレジスタ部７１と、演算部１２２とを備える。 FIG. 13 shows the configuration of a processor 120 according to the twelfth embodiment of the present invention. In FIG. 13, the processor 120 includes a vector register unit 71 and a calculation unit 122.

演算部１２２は、ベクトルレジスタの要素数と同数のｍ個のスカラプロセッサ１１０（１１０＿１〜１１０＿ｍ）を有する。各スカラプロセッサ１１０は、ベクトルデータおよびベクトル命令を、ブロードキャスト部７７および対応する要素部６６から取得して、各命令要素の示す演算を各データ要素に対して実行する。このとき、各スカラプロセッサ１１０は、Ｍ個の演算器１１３を用いて、ブロードキャストされたベクトルデータおよびベクトル命令のＭ個の要素に基づくＭ個の処理を実行可能である。たとえば、各スカラプロセッサ１１０は、要素部６６およびブロードキャスト部７７のいずれかまたは両方からオペランドとなるデータ要素を読み出してもよい。また、各演算器１１３は、命令要素を、要素部６６およびブロードキャスト部７７のいずれかから読み出せばよい。これにより、演算部１２２は、全てのスカラプロセッサ１１０で同一命令を異なるオペランドに対して演算することが可能となる。あるいは、演算部１２２は、全てのスカラプロセッサ１１０で少なくとも１つの同一のオペランドを用いて異なる演算を実行することが可能となる。 The calculation unit 122 includes m scalar processors 110 (110_1 to 110_m) having the same number as the number of elements of the vector register. Each scalar processor 110 acquires vector data and a vector instruction from the broadcast unit 77 and the corresponding element unit 66, and executes an operation indicated by each instruction element on each data element. At this time, each scalar processor 110 can execute M processes based on the broadcasted vector data and the M elements of the vector instruction using M computing units 113. For example, each scalar processor 110 may read a data element as an operand from either or both of the element unit 66 and the broadcast unit 77. Each computing unit 113 may read the command element from either the element unit 66 or the broadcast unit 77. Thereby, the arithmetic unit 122 can operate the same instruction on different operands in all the scalar processors 110. Alternatively, the arithmetic unit 122 can execute different operations using at least one identical operand in all the scalar processors 110.

以上のように構成されたプロセッサ１２０の動作を以下に説明する。なお、以下の説明では、ブロードキャスト部７７からは、ベクトル命令およびオペランドの１つを表すベクトルデータがブロードキャストされ、要素部６６からは、オペランドの他方を表すデータ要素が読み出されるものとする。 The operation of the processor 120 configured as described above will be described below. In the following description, it is assumed that the broadcast unit 77 broadcasts vector data representing one of a vector command and an operand, and the element unit 66 reads a data element representing the other of the operands.

最初のクロックサイクルで、スカラプロセッサ１１０＿１は、ブロードキャスト部７７からブロードキャストされたベクトル命令およびベクトルデータ、および、要素部６６＿１から読みだされた１つ目のデータ要素を入力として得る。スカラプロセッサ１１０＿１は、入力された命令およびデータに基づく処理を、１クロックサイクル後の２クロックサイクル目に処理する。 In the first clock cycle, the scalar processor 110_1 receives the vector instruction and vector data broadcast from the broadcast unit 77 and the first data element read from the element unit 66_1 as inputs. The scalar processor 110_1 processes a process based on the input instruction and data in the second clock cycle after the first clock cycle.

同様に、最初のクロックサイクルで、スカラプロセッサ１１０＿２〜１１０＿ｍは、ブロードキャスト部７７からブロードキャストされたベクトル命令およびベクトルデータ、および、要素部６６＿２〜６６ｍから読みだされたデータ要素をそれぞれ入力として得る。スカラプロセッサ１１０＿２〜１１０＿ｍは、入力された命令およびデータに基づく処理を、１クロックサイクル後の２クロックサイクル目に処理する。 Similarly, in the first clock cycle, the scalar processors 110_2 to 110_m respectively receive the vector instruction and vector data broadcast from the broadcast unit 77 and the data elements read from the element units 66_2 to 66m as inputs. The scalar processors 110_2 to 110_m process the processing based on the input instruction and data in the second clock cycle after one clock cycle.

２クロックサイクル目で、スカラプロセッサ１１０＿１〜１１０＿ｍは、ブロードキャスト部７７からブロードキャストされたベクトル命令およびベクトルデータ、および、要素部６６＿１〜６６ｍから読みだされたデータ要素をそれぞれ入力として得る。スカラプロセッサ１１０＿１〜１１０＿ｍは、入力された命令およびデータに基づく処理を、１クロックサイクル後の３クロックサイクル目に処理する。 In the second clock cycle, the scalar processors 110_1 to 110_m respectively receive the vector command and vector data broadcast from the broadcast unit 77 and the data elements read from the element units 66_1 to 66m as inputs. The scalar processors 110_1 to 110_m process a process based on the input instruction and data in the third clock cycle after one clock cycle.

以降、プロセッサ１２０は、同様の動作を繰り返す。 Thereafter, the processor 120 repeats the same operation.

次に、本発明の第１２の実施の形態の効果について述べる。 Next, effects of the twelfth embodiment of the present invention will be described.

本発明の第１２の実施の形態としてのプロセッサは、パイプラインにおける課題を回避しながらパイプラインと同等以上の性能を実現する際に、命令またはオペランドを同一とする複数の演算を効率よく実行可能とする。 The processor according to the twelfth embodiment of the present invention can efficiently execute a plurality of operations with the same instruction or operand when realizing performance equal to or better than that of the pipeline while avoiding problems in the pipeline. And

その理由は、ベクトルレジスタ部が、要素部およびブロードキャスト部を有し、演算部が、要素部と同数の本発明の第１１の実施の形態としてのスカラプロセッサを有し、各スカラプロセッサが、ブロードキャスト部からブロードキャストされるベクトル命令またはベクトルデータと、要素部から読み出される命令要素またはデータ要素とを用いて一連の処理を１ステージで実行するからである。これにより、本実施の形態は、全ての演算器において同一命令を実行したい場合や、全ての演算器において同一オペランドを用いたい場合に、メモリとのスループットが性能ネックになることを回避できることになる。 The reason is that the vector register unit has an element unit and a broadcast unit, the arithmetic unit has the same number of scalar processors as the eleventh embodiment of the present invention, and each scalar processor broadcasts. This is because a series of processing is executed in one stage using a vector command or vector data broadcast from the unit and a command element or data element read from the element unit. As a result, this embodiment can prevent the throughput with the memory from becoming a performance bottleneck when the same instruction is to be executed in all the arithmetic units or the same operand is to be used in all the arithmetic units. .

なお、上述した本発明の第２から第７の実施の形態において、ベクトルプロセッサにおいて各演算器が実行する浮動小数点演算が４段階からなる例を中心に説明したが、各演算器が実行する演算の種類およびその段階数を限定するものではない。 In the second to seventh embodiments of the present invention described above, the floating point arithmetic operation performed by each arithmetic unit in the vector processor has been described mainly with four stages. However, the arithmetic operation performed by each arithmetic unit is described. The type and the number of stages are not limited.

また、上述した本発明の第２の実施の形態において、ベクトルプロセッサにおいて各演算器が実行する一連の処理が浮動小数点加算である例を中心に説明したが、演算の種類を限定するものではない。 Further, in the above-described second embodiment of the present invention, the example in which the series of processing executed by each arithmetic unit in the vector processor is floating point addition has been mainly described, but the type of arithmetic is not limited. .

また、上述した本発明の第３から第７の実施の形態において、各演算器が、浮動小数点加算および浮動小数点乗算のいずれかを選択して実行可能である例を中心に説明したが、選択可能な演算の種類や種類数を限定するものではない。 Further, in the above-described third to seventh embodiments of the present invention, each arithmetic unit has been described mainly with respect to an example in which either a floating point addition or a floating point multiplication can be selected and executed. It does not limit the types and number of types of operations that can be performed.

また、上述した本発明の第８〜第１１の実施の形態において、スカラプロセッサにおいて各演算器が実行する命令が５段階からなる例を中心に説明したが、命令の各段階の処理内容や段階数を限定するものではない。 Further, in the above-described eighth to eleventh embodiments of the present invention, description has been made centering on an example in which the instruction executed by each arithmetic unit in the scalar processor is composed of five stages. The number is not limited.

また、上述した各実施の形態は、適宜組み合わせて実施されることが可能である。 Moreover, each embodiment mentioned above can be implemented in combination as appropriate.

また、本発明は、上述した各実施の形態に限定されず、様々な態様で実施されることが可能である。 The present invention is not limited to the above-described embodiments, and can be implemented in various modes.

また、上述した各実施の形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
主記憶装置から読み込まれる各データを保持するデータ保持部と、
前記データに基づく一連の処理を終了してから次のデータに基づく前記一連の処理を開始する複数の演算器を用いて、前記各データに基づく一連の処理を並列に実行する演算部と、
を備えたプロセッサ。
（付記２）
前記演算部は、前記データ保持部に保持される各データを前記複数の演算器に対してインタリーブするインタリーブ部をさらに有することを特徴とする付記１に記載のプロセッサ。
（付記３）
前記データ保持部は、前記演算器の個数に分割されて各分割部分を各演算器に対応させ、
前記各演算器は、対応する前記分割部分に保持されたデータに基づいて、前記一連の処理を実行することを特徴とする付記１に記載のプロセッサ。
（付記４）
前記演算部は、前記一連の処理がｎ（ｎは１以上の整数）段階からなる場合、ｎのＮ（Ｎは１以上の整数）倍個の前記演算器を有することを特徴とする付記１から付記３のいずれか１つに記載のプロセッサ。
（付記５）
前記データ保持部は、前記各演算器に対してブロードキャストされるデータをさらに保持し、
前記各演算器は、前記ブロードキャストされるデータにさらに基づいて、前記一連の処理を実行することを特徴とする付記１から付記４のいずれか１つに記載のプロセッサ。
（付記６）
前記演算部は、前記各演算器において前記一連の処理において生成されるデータを他の演算器に入力するフォワーディングパスを前記演算器間に有することを特徴とする付記１から付記５のいずれか１つに記載のプロセッサ。
（付記７）
前記データ保持部は、データ要素からなるベクトルデータを保持するベクトルレジスタによって構成され、
前記各演算器は、前記データ要素に対する演算を前記一連の処理として実行することを特徴とする付記１から付記５のいずれか１つに記載のプロセッサ。
（付記８）
前記データ保持部は、前記各データ要素に対する演算の内容を表す命令要素からなるベクトル命令をさらに保持し、
前記各演算器は、前記各データ要素に対して対応する前記命令要素の示す演算を前記一連の処理として実行することを特徴とする付記７に記載のプロセッサ。
（付記９）
前記演算部は、前記ベクトルレジスタの要素数と同数の前記演算器を有し、
前記各演算器は、対応する前記データ要素に対して前記一連の処理を実行することを特徴とする付記７または付記８に記載のプロセッサ。
（付記１０）
前記データ保持部は、前記各演算器に対する命令を表す情報を保持し、
前記各演算器は、前記命令に基づく一連の処理を終了してから次の命令に基づく一連の処理を開始することを特徴とする付記１から付記５のいずれか１つに記載のプロセッサ。
（付記１１）
主記憶装置から読み込まれてデータ保持部に保持される各データに基づく一連の処理を終了してから次のデータに基づく前記一連の処理を開始する複数の演算器を用いて、前記各データに基づく一連の処理を並列に実行する、プロセッサの処理方法。
（付記１２）
前記データ保持部に保持される各データを前記複数の演算器に対してインタリーブすることにより、前記各データに基づく一連の処理を並列に実行することを特徴とする付記１１に記載のプロセッサの処理方法。 A part or all of each of the above-described embodiments can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
A data holding unit for holding each data read from the main storage device;
An arithmetic unit that executes a series of processes based on each data in parallel using a plurality of arithmetic units that start the series of processes based on the next data after finishing a series of processes based on the data;
With processor.
(Appendix 2)
The processor according to appendix 1, wherein the arithmetic unit further includes an interleaving unit that interleaves each data held in the data holding unit with respect to the plurality of arithmetic units.
(Appendix 3)
The data holding unit is divided into the number of the computing units, and each divided part is associated with each computing unit,
The processor according to appendix 1, wherein each of the arithmetic units executes the series of processing based on data held in the corresponding divided portion.
(Appendix 4)
The arithmetic unit includes N arithmetic units of N (N is an integer of 1 or more) times n when the series of processes includes n (n is an integer of 1 or more) stages. To 4. The processor according to any one of appendix 3.
(Appendix 5)
The data holding unit further holds data broadcast to the respective computing units,
The processor according to any one of Supplementary Note 1 to Supplementary Note 4, wherein each of the arithmetic units executes the series of processes based on the broadcast data.
(Appendix 6)
Any one of appendix 1 to appendix 5, wherein the computing unit includes a forwarding path between the computing units for inputting data generated in the series of processes in each computing unit to another computing unit. Processor described in 1.
(Appendix 7)
The data holding unit is configured by a vector register that holds vector data composed of data elements,
The processor according to any one of Supplementary Note 1 to Supplementary Note 5, wherein each of the computing units executes computation on the data element as the series of processes.
(Appendix 8)
The data holding unit further holds a vector instruction composed of instruction elements representing the contents of an operation on each data element,
The processor according to appendix 7, wherein each of the arithmetic units executes an operation indicated by the instruction element corresponding to each data element as the series of processes.
(Appendix 9)
The computing unit has the same number of computing units as the number of elements of the vector register,
9. The processor according to appendix 7 or appendix 8, wherein each of the arithmetic units executes the series of processes on the corresponding data element.
(Appendix 10)
The data holding unit holds information representing an instruction for each arithmetic unit,
The processor according to any one of appendix 1 to appendix 5, wherein each arithmetic unit ends a series of processes based on the instruction and then starts a series of processes based on a next instruction.
(Appendix 11)
Using a plurality of arithmetic units that start the series of processes based on the next data after completing a series of processes based on the data read from the main storage device and held in the data holding unit, A processing method of a processor that executes a series of processes based on parallel processing.
(Appendix 12)
The processor processing according to appendix 11, wherein a series of processing based on each data is executed in parallel by interleaving each data held in the data holding unit with respect to the plurality of arithmetic units. Method.

１、１２０プロセッサ
２、３、４、５、６、７、ベクトルプロセッサ
８、９、１００、１１０スカラプロセッサ
１１データ保持部
１２、２２、３２、４２、５２、６２、７２、８２、９２、１０２、１１２、１２２演算部
１３、２３、３３、５３、６３、７３、８３、１０３、１１３演算器
２１、３１、５１、６１、７１ベクトルレジスタ部
２４、３４、４４、８４、９４インタリーブ部
５５、１０５分割部
６６要素部
７７ブロードキャスト部
８１、１０１命令キャッシュ部
１１１命令レジスタ
８００主記憶装置
８０１メモリ
８０２メモリインタリーブ部
９００プロセッサ
９０１論理回路 1, 120 Processor 2, 3, 4, 5, 6, 7, Vector processor 8, 9, 100, 110 Scalar processor 11 Data holding unit 12, 22, 32, 42, 52, 62, 72, 82, 92, 102 , 112, 122 arithmetic unit 13, 23, 33, 53, 63, 73, 83, 103, 113 arithmetic unit 21, 31, 51, 61, 71 vector register unit 24, 34, 44, 84, 94 interleave unit 55, 105 Dividing unit 66 Element unit 77 Broadcasting unit 81, 101 Instruction cache unit 111 Instruction register 800 Main storage device 801 Memory 802 Memory interleaving unit 900 Processor 901 Logic circuit

Claims

An operation on one or more vector data to be calculated, read from the main storage device, and an i-th data element (i is an integer equal to or more than 1 and equal to or less than the number of elements of the vector data) A data holding unit for holding a vector command composed of command elements representing the contents of
Using a plurality of arithmetic units that start the series of processes for the next data element after completing a series of processes for the operation indicated by the instruction element corresponding to the data element, An arithmetic unit that executes processing in parallel;
With processor.

The arithmetic unit, the is held in the data holding unit, one or more of said vector data to be the arithmetic, and further comprising a interleaving unit for interleaving with respect to the plurality of arithmetic units The processor of claim 1.

The data holding unit is divided into the number of the computing units, and each divided part is associated with each computing unit,
Wherein each operation unit, based on the data elements held in the divided portion corresponding processor of claim 1, characterized in that executing the series of processes.

The arithmetic unit includes, when the series of processes includes n (n is an integer equal to or greater than 1) stages, N arithmetic units of N (N is an integer equal to or greater than 1) times. The processor according to any one of claims 1 to 3.

The data holding unit further holds the vector data that is the broadcast for each calculator,
Wherein each operation unit is further based on the vector data that is the broadcast, the processor according to any one of claims 1 to 4, characterized in that executing the series of processes.

One or more vector data to be calculated, read from the main storage device and held in the data holding unit, and the i-th vector in each of the vector data (i is 1 or more and the number of elements of the vector data or less And a vector instruction consisting of instruction elements representing the contents of the operation on the data element of
Using a plurality of arithmetic units that start the series of processes for the next data element after completing a series of processes for the operation indicated by the instruction element corresponding to the data element, A processor processing method that executes processing in parallel.

The series of processes are executed in parallel by interleaving the one or more vector data and the vector instruction to be subjected to the operation held in the data holding unit with respect to the plurality of computing units. The processor processing method according to claim 6 .