JPH0434191B2

JPH0434191B2 -

Info

Publication number: JPH0434191B2
Application number: JP61068387A
Authority: JP
Inventors: Masami Takahata; Juji Aoki
Original assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Current assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Priority date: 1986-03-28
Filing date: 1986-03-28
Publication date: 1992-06-05
Also published as: JPS62226275A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はベクトル処理装置に係り、特に行列形
式データの処理において、疎行列データのアクセ
スを高速に処理するに好適なベクトル処理装置に
関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a vector processing device, and particularly to a vector processing device suitable for processing access to sparse matrix data at high speed in processing matrix format data.

[Conventional technology]

ベクトル処理装置において、科学技術計算に現
れる大規模の行列計算では、扱うデータが非常に
大きく、そのままの形では主記憶に収まらない場
合がある。また疎行列を扱う行列計算では、計算
結果に関与するのはごく少数の非零の要素であ
る。このため主記憶上には圧縮された形式で行列
データを蓄えておき、計算時にベクトル・レジス
タ上に伸張して転送する機能を備えることによつ
て記憶容量の制約を解消しつつ、行列計算の高速
化をはかつている。 In vector processing devices, in large-scale matrix calculations that occur in scientific and technical calculations, the data handled is so large that it may not fit into main memory in its original form. Furthermore, in matrix calculations that handle sparse matrices, only a small number of nonzero elements are involved in the calculation results. Therefore, by providing a function to store matrix data in a compressed format in main memory and decompress and transfer it to a vector register during calculation, it is possible to solve the memory capacity constraint and perform matrix calculation. We are trying to speed up the process.

この種のベクトル処理装置では、ベクトル・レ
ジスタ上の伸張されたデータの有効性を示すため
にマスク・レジスタが設けられている。マスク・
レジスタの各ビツトはベクトル・レジスタの対応
する要素の有効性を示す。ベクトル・レジスタの
ｎ番目の要素に対応するマスク・ビツトVMR
（ｎ）が値“１”であるとき該要素は有効、“０”
であるとき無効であることが示される。 In this type of vector processing device, a mask register is provided to indicate the validity of the decompressed data on the vector register. mask·
Each bit in the register indicates the validity of the corresponding element in the vector register. Mask bit VMR corresponding to the nth element of the vector register
When (n) has the value “1”, the element is valid, “0”
It is indicated that it is invalid when .

主記憶上の圧縮されたデータは、先頭アドレス
を保持するアドレス・レジスタVARと間隔値を
保持するインクリメント・レジスタVIRとによつ
て次のようにアドレス付けされる。ｎ番目の要素
のアドレスは、 a_o＝VAR＋（_o-1 〓ⁱ⁼⁰ VMR(i)）＊VIR ……(1) となる。主記憶ではｎ番目の要素以前の有効要素
のみが主記憶上に圧縮されて配置されるので、ｎ
番目の要素のアドレスは、該要素以前の“１”の
マスク・ビツトのカウント数にインクリメントを
乗じ、先頭アドレスを加えた値となる。 The compressed data on the main memory is addressed as follows by an address register VAR that holds the start address and an increment register VIR that holds the interval value. The address of the nth element is a _o = VAR + ( _o-1 〓 ⁱ⁼⁰ VMR(i)) * VIR ... (1). In the main memory, only valid elements before the nth element are compressed and placed in the main memory, so n
The address of the th element is the value obtained by multiplying the count of "1" mask bits before the element by an increment and adding the start address.

主記憶上の圧縮されたデータとベクトル・レジ
スタ上の伸張されたデータとの間の転送は、マス
ク・レジスタの内容によつて制御される。ｎ番目
の要素に対しては、マスク・ビツトVMR（ｎ）
が値“１”をとる場合に主記憶からベクトル・レ
ジスタへのロード、またはベクトル・レジスタか
ら主記憶へのストアが行われる。値“０”を取る
場合にはロード、ストアは行われない。以上のよ
うにマスク・ビツトによるメモリ・リクエスト制
御が行われ、先頭要素から順次転送が行われる。 Transfers between compressed data on main memory and decompressed data on vector registers are controlled by the contents of the mask register. For the nth element, mask bits VMR(n)
When takes the value "1", a load from the main memory to the vector register or a store from the vector register to the main memory is performed. If the value is "0", no loading or storing is performed. As described above, memory request control is performed using mask bits, and data is sequentially transferred starting from the first element.

このような構成をもつベクトル処理装置では、
従来、アドレス計算は次のように遂次的に行われ
る。 In a vector processing device with such a configuration,
Conventionally, address calculations are performed sequentially as follows.

a_o＝a_o-1＋VMR（ｎ−１）＊VIR……(2) ここにa_oはｎ番目の要素の主記憶アドレス、
a_o-1は直前のｎ−１番目の要素の主記憶アドレス
である。このため、データ転送は逐次的に１本の
ロード・ストア・パイプラインで行われていた。
なお、この種の装置として関連するものには例え
ば特開昭58−214963号公報が挙げられる。 a _o = a _o-1 + VMR (n-1) * VIR...(2) where a _o is the main memory address of the nth element,
ao _-1 is the main memory address of the immediately preceding (n-1)th element. For this reason, data transfer was performed sequentially using a single load/store pipeline.
Note that related devices of this type include, for example, Japanese Patent Application Laid-Open No. 58-214963.

〔発明が解決しようとする問題点〕上記従来技術は、要素並列の多重パイプライン
方式の適合性については配慮されておらず、多重
化されたパイプラインが有効に利用されないとい
う問題があつた。つまり、最近のベクトル処理装
置では、主記憶からベクトル・レジスタへデータ
を転送する能力を増強するために、複数のロー
ド・ストア・パイプラインが設置され、データ中
の複数の要素を並列に一度に転送する機構があ
る。ところが上記の圧縮・伸張型のデータ転送の
場合には、主記憶にアクセスすべきアドレスがデ
ータの有効性を示すマスク・ビツトの制御の下で
遂次的にしか生成されない。このため複数設置さ
れたロード・ストア・パイプラインの内、１本で
しかデータ転送を行うことができなかつた。[Problems to be Solved by the Invention] The above-mentioned conventional technology does not take into account the suitability of the element-parallel multiple pipeline system, and there is a problem in that the multiplexed pipelines are not used effectively. That is, in modern vector processing devices, to increase the ability to transfer data from main memory to vector registers, multiple load/store pipelines are installed to process multiple elements of data in parallel at once. There is a mechanism for transferring. However, in the case of the compression/expansion type data transfer described above, addresses to access the main memory are generated only sequentially under the control of mask bits indicating the validity of the data. For this reason, data could only be transferred using one of the multiple load/store pipelines installed.

本発明の目的は、複数設置されたロード・スト
ア・パイプラインにおいて、複数の要素の主記憶
アドレスを並列に計算し、要素並列に圧縮・伸張
型のデータ転送を実行するベクトル処理装置を提
供することにある。 An object of the present invention is to provide a vector processing device that calculates main memory addresses of multiple elements in parallel and executes compression/expansion type data transfer in parallel with the elements in multiple installed load/store pipelines. There is a particular thing.

[Means to solve the problem]

本発明は、各パイプラインで処理すべき要素の
主記憶アドレスを並列に計算できるように各パイ
プラインのアドレス計算回路を構成することによ
り達成される。 The present invention is achieved by configuring the address calculation circuits of each pipeline so that the main memory addresses of the elements to be processed in each pipeline can be calculated in parallel.

各パイプラインのアドレス計算回路を、各パイ
プラインで処理すべき要素番号を認識し、マス
ク・ビツトを選択してマスク・ビツトの有効数を
計数するカウンタ、カウンタの内容に基づき要素
の間隔値、インクリメントの倍数を発生する倍数
発生回路、各パイプラインにおいて前ステージに
て処理した要素の主記憶アドレスに上記で発生し
た倍数を加えることによつて、各パイプラインに
おいて該ステージにて処理すべき要素の主記憶ア
ドレスを求める多入力の加算器で構成する。 The address calculation circuit of each pipeline is configured by a counter that recognizes the element number to be processed in each pipeline, selects a mask bit, and counts the effective number of mask bits, and an element interval value based on the contents of the counter. A multiple generation circuit that generates a multiple of an increment, and by adding the multiple generated above to the main memory address of the element processed in the previous stage in each pipeline, the element to be processed in that stage in each pipeline It consists of a multi-input adder that calculates the main memory address of .

[Effect]

要素並列の多重パイプラインにおいて、各パイ
プラインで並列にアドレス計算を実行するために
は次のような計算を実行しなければならない。 In order to perform address calculation in parallel in each pipeline in an element-parallel multiple pipeline, the following calculation must be performed.

ここでＭは多重パイプラインの本数、a_o〜
a_o+M-1は該ステージで処理すべき要素の主記憶ア
ドレス、a_o-M〜a_o-1は前ステージで処理した要素
の主記憶アドレスである。VMR(i)はｉ番目の要
素に対応するマスク・ビツトであり、VIRは要素
の間隔値、インクリメントである。１本のパイプ
ラインについて見ると、前ステージで処理したｎ
＋ｊ−Ｍ番目の要素の主記憶アドレスa_o+j-Mと該
ステージで処理すべきｎ＋ｊ番目の要素の主記憶
アドレスの差は、インクリメントVIRの倍数（_o+j-1 〓^i=n+j-M VMR−(i)）＊VIRとなつている。この
倍数は、前ステージで処理した要素に対応するＭ
個のマスク・ビツトの内、該パイプラインで処理
した要素から要素番号が後の要素に対応する有効
マスク・ビツト数_o-1 〓^i=n+j-M VIR(i)と、該ステージで
処理する要素に対応するＭ個のマスク・ビツトの
内、該パイプラインで処理する要素より要素番号
が前の要素に対応する有効マスク・ビツト数_o+j-1 〓ⁱ⁼ⁿ VMR(i)との和から作られる。 Here, M is the number of multiple pipelines, a _o ~
ao _+M-1 is the main memory address of the element to be processed in this stage, and _aoM to _ao-1 are the main memory addresses of the elements processed in the previous stage. VMR(i) is the mask bit corresponding to the i-th element, and VIR is the element spacing value, increment. Looking at one pipeline, n processed in the previous stage
The difference between the main memory address a _o+jM of the +j-Mth element and the main memory address of the n+jth element to be processed in this stage is a multiple of the increment VIR ( _o+j-1 〓 ^i=n+jM VMR -(i)) *VIR. This multiple is M corresponding to the elements processed in the previous stage.
Among the mask bits, the number of effective mask bits corresponding to the element whose element number is after the element processed by the pipeline _{is o-1} 〓 ^i=n+jM VIR(i) and processed by the stage. Among the M mask bits corresponding to the element, the number of effective mask bits corresponding to the element whose element number is earlier than the element processed by the pipeline _o+j-1 〓 ⁱ⁼ⁿ With VMR(i) Made from Japanese.

したがつて、各パイプラインのアドレス計算回
路において、カウンタによつて上記有効ビツト数
を計数し、そのカウント値に基づいて倍数発生回
路にてインクリメントVIRの倍数を発生し、加算
器によつて該パイプラインで前ステージに処理し
たｎ＋ｊ−Ｍ番目の要素の主記憶アドレスa_o+j-M
に加算すると、該パイプラインで次のステージで
処理すべきｎ＋ｊ番目の要素の主記憶アドレス
a_o+jが求まる。以上により複数のパイプラインに
おいて複数の要素の主記憶アドレスを並列に得る
ことができる。 Therefore, in the address calculation circuit of each pipeline, a counter counts the number of effective bits, a multiple generation circuit generates a multiple of the increment VIR based on the count value, and an adder generates a multiple of the increment VIR. Main memory address a _o+jM of the n+j−Mth element processed in the previous stage of the pipeline
The main memory address of the n+j element to be processed in the next stage of the pipeline
Find a _o+j . As described above, main memory addresses of multiple elements can be obtained in parallel in multiple pipelines.

各パイプラインにおいて処理すべき要素に対応
するマスク・ビツトが“１”の場合、上記アドレ
ス計算によつて得られた主記憶アドレスを用いて
メモリ・リクエストを主記憶に対して発行し、ロ
ード・ストア処理を行う。マスク・ビツトが
“０”の場合には、アドレス計算は行うが、メモ
リ・リクエストは抑止する。この場合には、アド
レス計算回路内に保持される主記憶アドレスが更
新されるのみである。 If the mask bit corresponding to the element to be processed in each pipeline is "1", a memory request is issued to the main memory using the main memory address obtained by the above address calculation, and the load is executed. Perform store processing. If the mask bit is "0", address calculation is performed but memory requests are suppressed. In this case, only the main memory address held within the address calculation circuit is updated.

処理すべき残り要素数は、現在の残り要素数か
ら、１ステージで処理される要素数（これはパイ
プライン本数に等しい。）を減算することによつ
て得られる。そして、残り要素数が非正になつた
ことを検出器によつて検出し、ロード・ストア処
理を終了する。 The number of remaining elements to be processed is obtained by subtracting the number of elements to be processed in one stage (this is equal to the number of pipelines) from the current number of remaining elements. Then, the detector detects that the number of remaining elements has become non-positive, and the load/store process ends.

〔Example〕

以下、本発明の内容を図を用いて説明する。 Hereinafter, the content of the present invention will be explained using figures.

第１図は本発明が適用されるベクトル処理装置
の全体構成を示すものである。第１図において、
１は主記憶（MS）、２はベクトル・レジスタ
（VR）、３はマスク・レジスタ（VMR）、４はロ
ード・ストア・パイプライン（LS）、５は演算器
（ALU）、６はデータ分配回路、７はデータ選択
回路である。ベクトル・レジスタ２は各々Ｌ個の
要素を格納する容量を持ち、全体でVR₀〜VR₇で
示す８本設けられている。マスク・レジスタ３は
Ｌビツトで１本設けられている。マスク・レジス
タ３の１ビツトがベクトル・レジスタ２の１要素
の有効性を示す。尚、本実施例では簡略化のため
マスク・レジスタを１本とするが、マスク・レジ
スタは複数あつても差しつかえない。ロード・ス
トア・パイプライン４はLS₀〜LS₇で示す８本、
演算器５はALU₀〜ALU₇で示す８個とする。 FIG. 1 shows the overall configuration of a vector processing device to which the present invention is applied. In Figure 1,
1 is main memory (MS), 2 is vector register (VR), 3 is mask register (VMR), 4 is load/store pipeline (LS), 5 is arithmetic unit (ALU), 6 is data distribution The circuit 7 is a data selection circuit. Each of the vector registers 2 has a capacity to store L elements, and a total of eight vector registers 2 are provided as shown by VR ₀ to VR ₇ . One L-bit mask register 3 is provided. One bit in mask register 3 indicates the validity of one element in vector register 2. In this embodiment, one mask register is used for simplicity, but there may be a plurality of mask registers. The load/store pipeline 4 has eight lines indicated by LS ₀ to _{LS 7} ,
There are eight computing units 5, indicated by ALU ₀ to _{ALU 7} .

各ロード・ストア・パイプラインは主記憶上の
行列型式データの各要素の主記憶アドレスを順次
計算し、主記憶に対してメモリ・リクエストを発
行する。主記憶アドレスの計算結果はパイプライ
ン動作の基本単位である１ステージごとに得ら
れ、メモリ・リクエストも１ステージ・ピツチで
発行される。ロード命令、またはストア命令が起
動されると、８本のロード・ストア・パイプライ
ン４、LS₀、LS₁、…、LS₇が同時にロード・ス
トア処理を開始する。第１ステージでは、０番、
１番、…、７番の要素の主記憶アドレスa₀、a₁、
…、a₇が８本のパイプライン４において同時に計
算され、８個のメモリ・リクエストが主記憶１に
対して発行される。第２ステージでは、８番、９
番、…、15番の要素の主記憶アドレスa₈、a₉、
…、a₁₅が８本のパイプライン４において同時に
計算され、８個のメモリ・リクエストが主記憶１
に対して発行される。以下、１ステージごとに８
個の要素の主記憶アドレスが８本のロード・スト
ア・パイプライン４において同時に計算され、８
個のメモリ・リクエストが主記憶１に対して発行
される。そして該メモリ・リクエストに基づき、
８個の要素のデータ転送が主記憶１とベクトル・
レジスタ２との間で行われる。 Each load/store pipeline sequentially calculates the main memory address of each element of matrix-type data on the main memory, and issues a memory request to the main memory. The calculation result of the main memory address is obtained for each stage, which is the basic unit of pipeline operation, and memory requests are also issued at one stage pitch. When a load instruction or a store instruction is activated, the eight load/store pipelines 4, LS ₀ , LS ₁ , . . . , LS ₇ simultaneously start load/store processing. In the first stage, number 0,
Main memory addresses a ₀ , a _{1 ,} of elements No. 1, ..., No. 7
..., _a7 are calculated simultaneously in eight pipelines 4, and eight memory requests are issued to the main memory 1. In the second stage, numbers 8 and 9
No.,..., main memory address of element No. 15 a ₈ , a ₉ ,
..., a ₁₅ are computed simultaneously in eight pipelines 4, and eight memory requests are sent to main memory 1.
issued to. Below, 8 for each stage
The main memory addresses of 8 elements are calculated simultaneously in 8 load/store pipelines 4, and 8
memory requests are issued to main memory 1. and based on the memory request,
Data transfer of 8 elements is performed between main memory 1 and vector
This is done with register 2.

ロード命令の場合には、主記憶上のデータのア
ドレスがロード・ストア・パイプライン４からパ
ス８を介して主記憶１に送られ、データが主記憶
１からパス９を介しロード・ストア・パイプライ
ン４に取り出される。さらにデータはパス１０を
介しデータ分配回路６に入り、命令で指定された
ベクトル・レジスタ２に対してパス１１を介し書
き込まれる。 In the case of a load instruction, the address of data on main memory is sent from load/store pipeline 4 to main memory 1 via path 8, and the data is sent from main memory 1 to main memory 1 via path 9. It is taken out to line 4. Further data enters the data distribution circuit 6 via path 10 and is written via path 11 to the vector register 2 specified by the instruction.

ストア命令の場合には、命令によつて指定され
たベクトル・レジスタ２上のデータがパス１２を
介しデータ選択回路７に読み出され、パス１５を
介してロード・ストア・パイプライン４に入る。
ロード・ストア・パイプライン４では主記憶アド
レスが与えられ、主記憶アドレスはパス８に乗せ
て、データはパス１６に乗せて主記憶１に送り込
まれる。 In the case of a store instruction, data on the vector register 2 specified by the instruction is read out to the data selection circuit 7 via path 12 and enters the load/store pipeline 4 via path 15.
In the load/store pipeline 4, a main memory address is given, the main memory address is carried on a path 8, and the data is carried on a path 16 and sent to the main memory 1.

データがベクトル・レジスタ２にロードされて
からストアされるまでの間にデータに対する演算
が行われる。演算命令によつて指定される３本の
レジスタ間で演算が行われる。３本のベクトル・
レジスタの内の２本はオペランドが格納されてい
て、オペランドはデータ選択回路７を介して読み
出されパス１３を経由して演算器５に入力され
る。演算結果は演算器５から出力され、パス１４
を経由しデータ分配回路６を介してベクトル・レ
ジスタ２に書き込まれる。演算命令が要素対応に
マスク・ビツトを生成するマスク生成命令である
場合には、演算結果として得られるマスク・ビツ
トはマスク・レジスタ３に書き込まれる。 Operations are performed on the data after it is loaded into vector register 2 and before it is stored. An operation is performed between the three registers specified by the operation instruction. Three vectors
Operands are stored in two of the registers, and the operands are read out via the data selection circuit 7 and input to the arithmetic unit 5 via the path 13. The calculation result is output from the calculation unit 5 and passed through the path 14.
The data is written to the vector register 2 via the data distribution circuit 6. If the operation instruction is a mask generation instruction that generates mask bits corresponding to elements, the mask bits obtained as the operation result are written into the mask register 3.

ロード・ストア・パイプライン４におけるロー
ド・ストア処理は各要素ごとにマスク・レジスタ
３のマスク・ビツトの制御を受ける。このためマ
スク・レジスタ３の内容は１ステージごとにパイ
プラインの本数分、即ち８ビツトずつ読み出さ
れ、パス１７を介して各パイプラインに分配され
る。 Load/store processing in the load/store pipeline 4 is controlled by mask bits in the mask register 3 for each element. Therefore, the contents of the mask register 3 are read out for each stage by the number of pipelines, that is, 8 bits at a time, and distributed to each pipeline via the path 17.

次に第２図並びに第３図にて主記憶上の圧縮さ
れたデータとベクトル・レジスタ上の伸張された
データとの間の転送処理動作を示す。 Next, FIGS. 2 and 3 show the transfer processing operation between compressed data on the main memory and expanded data on the vector register.

第２図は主記憶上の圧縮されたｌ個のデータ
a₀、a₁、…、a_l-1をベクトル・レジスタ上に伸張
してロードする処理を示したものである。ロード
処理の前にはベクトル・レジスタ２のＬ個の要素
に対してＬビツトのマスク・ビツトがマスク・レ
ジスタ３に設定されている。マスク・ビツトの
内、要素が有効であることを示す“１”の数は、
主記憶１上の圧縮されたデータの要素数に等し
い。主記憶１上のデータは先頭の要素から順にマ
スク・ビツト“１”に対応するベクトル・レジス
タ２の要素位置にロードされる。マスク・ビツト
“０”に対応する要素位置にはロードしない。 Figure 2 shows l pieces of compressed data on main memory.
This figure shows the process of expanding and loading a ₀ , a ₁ , ..., a _l-1 onto a vector register. Before the load process, L mask bits are set in mask register 3 for L elements of vector register 2. The number of “1”s in the mask bits that indicate that the element is valid is:
Equal to the number of compressed data elements in main memory 1. The data on main memory 1 is loaded into the element position of vector register 2 corresponding to mask bit "1" in order from the first element. Element positions corresponding to mask bits "0" are not loaded.

第３図はベクトル・レジスタ２上の伸張された
Ｌ個のデータa₀、a₁、…、a_L-1を圧縮して主記憶
１にストアする処理を示したものである。ストア
処理の前にはベクトル・レジスタ２のＬ個の要素
に対してＬビツトのマスク・ビツトがマスク・レ
ジスタ３に設定されている。ロードとは逆に、マ
スクビツト“１”に対応するベクトル・レジスタ
２の要素位置のデータが先頭から順に主記憶１に
ストアされる。 FIG. 3 shows the process of compressing L pieces of expanded data a ₀ , a ₁ , . . . , a _L-1 on the vector register 2 and storing them in the main memory 1. Before store processing, L mask bits are set in mask register 3 for L elements of vector register 2. Contrary to loading, data at element positions of vector register 2 corresponding to mask bit "1" are stored in main memory 1 in order from the beginning.

第４図にロード・ストア・パイプライン１本の
アドレス計算回路の構成を示し、圧縮・伸張型の
ロード・ストア処理におけるアドレス計算を説明
する。本図は(3)式の多重度Ｍ＝８の場合の構成を
示したものである。この場合、処理のパイプライ
ン動作の単位である１ステージごとにデータ中の
８個の要素がロード・ストア処理される。各パイ
プラインには信号線３３によつてリクエスタ番号
と称する０〜７の値が与えられている。各パイプ
ラインはリクエスタ番号の値によつて処理すべき
要素の系列を認識し動作する。リクエスタ番号が
０のパイプラインでは、要素a₀、a₈、a₁₆、…が
順次ロード、ストア処理される。 FIG. 4 shows the configuration of an address calculation circuit with one load/store pipeline, and address calculation in compression/expansion type load/store processing will be explained. This figure shows the configuration when the multiplicity of equation (3) is M=8. In this case, eight elements in the data are loaded and stored in each stage, which is a unit of pipeline operation of processing. Each pipeline is given a value from 0 to 7 called a requester number via a signal line 33. Each pipeline operates by recognizing the sequence of elements to be processed based on the value of the requester number. In the pipeline with requester number 0, elements a ₀ , a ₈ , a ₁₆ , . . . are sequentially loaded and stored.

命令解読回路２０において主記憶１上の圧縮デ
ータとベクトル・レジスタ２上の伸張データとの
間のデータ転送を指示するロード・ストア命令が
解読されると、命令解読回路２０から主記憶１上
の圧縮データをアドレス付けする情報がレジスタ
２１，２２，２４に設定される。アドレス・レジ
スタVAR２１にはデータの先頭アドレスが、イ
ンクリメント・レジスタVIR２２にはデータの間
隔値が、レングス・レジスタVLR２４にはデー
タの長さがそれぞれ設定される。データに関する
情報の設定と共に信号線３７によりデータのロー
ド・ストア処理を起動する信号が送られ、ラツチ
３９が“１”にセツトされる。ラツチ３９の出力
はメモリ・リクエストを制御するAND回路４０
を開き、ロード・ストア処理のためのメモリ・リ
クエストの送出が始まる。 When the instruction decoding circuit 20 decodes a load/store instruction that instructs data transfer between the compressed data on the main memory 1 and the decompressed data on the vector register 2, the instruction decoding circuit 20 decodes the data on the main memory 1. Information for addressing compressed data is set in registers 21, 22, and 24. The start address of the data is set in the address register VAR21, the data interval value is set in the increment register VIR22, and the length of the data is set in the length register VLR24. Along with the setting of data-related information, a signal to start data load/store processing is sent via the signal line 37, and the latch 39 is set to "1". The output of latch 39 is an AND circuit 40 that controls memory requests.
and begins sending memory requests for load/store processing.

マスク・レジスタVMR３からはロード・スト
ア処理に同期してパイプラインの本数分、８ビツ
トのマスク・ビツトが並列に読み出され、パス１
７を介しレジスタ１８にセツトされる。マスク・
ビツトの読み出しは、ロード・ストア処理のパイ
プライン動作の単位と同じく１ステージごとに行
われる。レジスタ１８に入つたマスク・ビツトは
次のステージにはレジスタ１９に転送される。レ
ジスタ１８，１９の出力はビツト選択回路３５，
３６に入力され、リクエスタ番号によつて定まる
ある範囲のビツトのみが抽出され、カウンタ２３
に転送される。カウンタ２３は選択されたマス
ク・ビツト中の“１”のビツトの数を計数し、計
数して得られた数から倍数発生回路２６，２７を
制御し、インクリメント・レジスタVIR２２の０
〜７倍の倍数を発生する。倍数発生回路２６では
８、４、０倍の倍数が、倍数発生回路２７では
２、１、０、−１倍の倍数がそれぞれ発生される。
両者を組み合せることによつて０〜８倍の倍数が
得られる。 Eight mask bits for the number of pipelines are read out in parallel from mask register VMR3 in synchronization with load/store processing, and pass 1
7 to the register 18. mask·
Bit reading is performed for each stage, similar to the unit of pipeline operation for load/store processing. The mask bits entered in register 18 are transferred to register 19 in the next stage. The outputs of the registers 18 and 19 are sent to the bit selection circuit 35,
36, only a certain range of bits determined by the requester number are extracted, and the bits are input to the counter 23.
will be forwarded to. The counter 23 counts the number of "1" bits in the selected mask bit, controls the multiple generation circuits 26 and 27 from the number obtained by counting, and sets the increment register VIR22 to 0.
Generates multiples of ~7x. The multiple generation circuit 26 generates multiples of 8, 4, and 0 times, and the multiple generation circuit 27 generates multiples of 2, 1, 0, and -1, respectively.
By combining the two, a multiple of 0 to 8 times can be obtained.

ロード・ストア処理の最初のステージではデー
タの先頭アドレスがアドレス・レジスタVAR２
１からセレクタ２５を介しキヤリー・セーブ・ア
ダーCSA２９に入力される。同時にマスク・カ
ウント数に基づくインクリメント・レジスタVIR
の倍数が倍数発生回路２６，２７からキヤリー・
セーブ・アダーCSA２９に入力される。両者は
キヤリー・セーブ・アダーCSA２９とその直後
にあるパラレル・アダーPA３０によつて加算さ
れ、第１ステージで処理される要素の主記憶アド
レスとなる。第２ステージ以降では、前のステー
ジで処理した要素の主記憶アドレスがセレクタ２
５を介しキヤリー・セーブ・アダーCSA２９に
再び入力され、該ステージで処理する主記憶アド
レスを計算するために使用される。第２ステージ
以降のアドレス計算では、アドレス・レジスタ
VAR２１の内容の代りに前ステージで求めた主
記憶アドレスを用いる点のみ異なる。 In the first stage of load/store processing, the start address of the data is in address register VAR2.
1 to the carry save adder CSA 29 via the selector 25. Increment register VIR based on the number of mask counts simultaneously
The multiple of
Input to save adder CSA29. Both are added by the carry save adder CSA 29 and the parallel adder PA 30 immediately following it, and become the main memory address of the element processed in the first stage. From the second stage onwards, the main memory address of the element processed in the previous stage is set to selector 2.
5 to the carry save adder CSA 29 and used to calculate the main memory address to be processed in this stage. In address calculations from the second stage onwards, the address register
The only difference is that the main memory address obtained in the previous stage is used instead of the contents of VAR21.

第５図においてリクエスタ番号３のロード・ス
トア・パイプラインにおけるアドレス計算の一例
を示す。要素a_o-8に引き続く８個の要素に対応す
るマスク・ビツトが“10110100”、その次のステ
ージで処理される要素a_oに引き続く８個の要素に
対応するマスク・ビツトが“01011001”であると
する。リクエスタ番号３のパイプラインでは２ス
テージの間に要素a_o+3-8と要素a_o+3とについてロ
ード・ストア処理を行う。要素a_o+3の主記憶アド
レスは、要素a_o+3-8の主記憶アドレスとマスク・
ビツトVMR(i)（ｉ＝ｎ＋３−８〜ｎ＋２）とか
ら次のようにして求められる。まず、ビツト選択
回路３５によつて要素a_o+3-8から後の５個の要素
a_o+3-8、a_o+4-8、a_o+5-8、a_o+6-8、a_o+7-8に対応す
るマスク・ビツト“10100”が選択され、その内
の有効ビツト数_o-1 〓^i=n+3-8 VMR(i)＝２が得られる。
次にビツト選択回路３６によつて要素a_o+3より前
の３個の要素a_o、a_o+1、a_o+2、に対応するマス
ク・ビツト“010”が選択され、その内の有効ビ
ツト数_o+2 〓ⁱ⁼ⁿ VMR(i)＝１が得られる。カウンタ２３
によつて両者が計数されその和_o+2 〓^i=n+3-8 VMR(i)＝
３をもとに倍数３＊VIRが生成される。倍数発生
回路２６では４＊VIRが、倍数発生回路２７では
（−１）＊VIRが発生され、両者はチヤリー・セ
ーブ・アダーCSA２９とパラレル・アダーPA３
０とにおいてa_o+3-8と加算されa_o+3が得られる。 FIG. 5 shows an example of address calculation in the load/store pipeline of requester number 3. The mask bits corresponding to the 8 elements following element _ao-8 are "10110100", and the mask bits corresponding to the 8 elements following element _ao to be processed in the next stage are "01011001". Suppose there is. In the pipeline of requester number 3, load/store processing is performed for elements a _o+3-8 and element a _o+3 during two stages. The main memory address of element a _o+3 is the main memory address of element a _o+3-8 and the mask.
It is obtained from the bit VMR(i) (i=n+3-8 to n+2) as follows. First, the bit selection circuit 35 selects the five elements after element a _o+3-8.
Mask bit “10100” corresponding to a _o+3-8 , a _o+4-8 , a _o+5-8 , a o _+6-8 , a _o+7-8 is selected, and Effective number of bits _o-1 〓 ^i=n+3-8 VMR(i)=2 is obtained.
Next, the bit selection circuit 36 selects the mask bit "010" corresponding to the three elements _ao , _ao+1 , and _ao+2 before the element _ao+3 ; Effective number of bits _o+2 〓 ⁱ⁼ⁿ VMR(i)=1 is obtained. counter 23
Both are counted and their sum _o+2 〓 ^i=n+3-8 VMR(i)=
The multiple 3*VIR is generated based on 3. The multiple generating circuit 26 generates 4*VIR, and the multiple generating circuit 27 generates (-1)*VIR, both of which are generated by the Charlie save adder CSA29 and the parallel adder PA3.
0 and a _o+3-8 are added to obtain a _o+3 .

a_o+3＝a_o+3-8＋（_o-1 〓^i=n+3-8 VMR(i) ＋_o+2 〓ⁱ⁼ⁿ VMR(i)）＊VIR ＝a_o+3-8＋（２＋１）＊VIR ＝a_o+3-8＋３＊VIR ＝a_o+3-8＋４＊VIR＋（−１）＊VIR……(4) 第４図においてビツト選択回路３５は、マス
ク・カウント値を求めるためのビツト選択と共
に、該パイプラインで処理される要素に対応する
マスク・ビツトの選択も行う。リクエスタ番号か
ら処理要素に対応するマスク・ビツト位置を求
め、該マスク・ビツトを抽出し、AND回路４０
に送る。AND回路４０では、ラツチ３９からの
メモリ・リクエスト発行信号とビツト選択回路３
５からのマスク・ビツトとのANDがとられ、メ
モリ・リクエストとしてパス１３に乗せて主記憶
１に送出される。処理すべき要素に対応するマス
ク・ビツトが“１”の場合にはメモリ・リクエス
トが発行され、該当する要素が主記憶１から読み
出されたり、主記憶１に書き込まれたりする。マ
スク・ビツトが０の場合にはメモリ・リクエスト
は抑止される。メモリ・リクエスト発行の際に
は、キヤリー・セーブ・アダーCSA２９とパラ
レル・アダーPA３０とによつて計算された主記
憶アドレスがパス１３に乗せてメモリ・リクエス
トと共に主記憶１に対して送出される。a _o+3 = a _o+3-8 + ( _o-1 〓 ^i=n+3-8 VMR(i) + _o+2 〓 ⁱ⁼ⁿ VMR(i)) *VIR = a _o+3-8 +(2+1)*VIR =a _o+3-8 +3*VIR =a _o+3-8 +4*VIR+(-1)*VIR...(4) In FIG. In addition to selecting bits for determining values, mask bits corresponding to elements to be processed in the pipeline are also selected. The mask bit position corresponding to the processing element is determined from the requester number, the mask bit is extracted, and the AND circuit 40
send to The AND circuit 40 outputs the memory request issue signal from the latch 39 and the bit selection circuit 3.
It is ANDed with the mask bit from 5 and sent to main memory 1 on path 13 as a memory request. If the mask bit corresponding to the element to be processed is "1", a memory request is issued, and the corresponding element is read from or written to the main memory 1. If the mask bit is 0, memory requests are suppressed. When issuing a memory request, the main memory address calculated by the carry save adder CSA 29 and parallel adder PA 30 is sent to the main memory 1 along with the memory request on the path 13.

レングス・レジスタVLR２４にはロード・ス
トア処理に先き立つてデータの長さが格納されて
いる。ロード・ストア処理が１ステージ進行する
ごとにレングス・レジスタVLRの内容は減算回
路２８によつて−８される。ロード・ストア処理
の１ステージにおいて、パイプラインの本数８と
同じ個数の要素が１度に処理されるので−８す
る。すべての要素についてロード・ストア処理が
実行された時点で減算結果は０以下となる。そこ
で符号検出回路３１にて処理の終了を検出し、終
了信号３８によつてラツチ４０を“０”にリセツ
トし、AND回路４０を閉じる。AND回路４０が
閉じられたことによつて以後のメモリ・リクエス
ト送出が停止される。 The length register VLR 24 stores the length of data prior to load/store processing. Each time the load/store process progresses by one stage, the contents of the length register VLR are decremented by 8 by the subtraction circuit 28. In one stage of load/store processing, the same number of elements as the number of pipelines (8) are processed at one time, so the value is -8. When the load/store process is executed for all elements, the subtraction result becomes 0 or less. Then, the sign detection circuit 31 detects the end of the process, resets the latch 40 to "0" by the end signal 38, and closes the AND circuit 40. Since the AND circuit 40 is closed, subsequent transmission of memory requests is stopped.

本実施例によれば、主記憶上の圧縮されたデー
タとベクトル・レジスタ上の伸張されたデータと
の間のデータ転送を並列に設置され複数のロー
ド・ストア・パイプラインにより実行可能とな
る。このためデータ転送速度をパイプラインの本
数と同じだけ向上させることができる。 According to this embodiment, data transfer between compressed data on the main memory and decompressed data on the vector register can be executed by a plurality of load/store pipelines installed in parallel. Therefore, the data transfer speed can be increased by the same amount as the number of pipelines.

〔Effect of the invention〕

本発明によれば、要素並列の多重ロード・スト
ア・パイプラインを具備するベクトル処理装置に
おいて、主記憶上に圧縮されたデータとベクト
ル・レジスタ上の伸張されたデータとの間のデー
タ転送についても要素並列の形態で実行すること
が可能となる。したがつて通常の単純ロード・ス
トアと同様に並列に設置されたパイプライン本数
分のデータ転送速度を得ることができる。これに
よつて圧縮・伸張型のテータ・アクセスを必要と
する疎行列の行列計算が高速に処理される。 According to the present invention, in a vector processing device equipped with an element-parallel multiple load/store pipeline, data transfer between compressed data on main memory and decompressed data on a vector register is also possible. It becomes possible to execute in element parallel form. Therefore, the data transfer speed corresponding to the number of pipelines installed in parallel can be obtained similarly to a normal simple load/store. As a result, matrix calculations for sparse matrices that require compressed/expanded data access can be processed at high speed.

[Brief explanation of drawings]

第１図は本発明の一実施例のベクトル処理装置
の全体構成図、第２図は主記憶上に圧縮されたデ
ータをベクトル・レジスタ上に伸張してロードす
る処理を示す図、第３図はベクトル・レジスタ上
の伸張されたデータを主記憶に圧縮してストアす
る処理を示す図、第４図はロード・ストア・パイ
プラインのアドレス計算回路の構成を示す図、第
５図はアドレス計算の一例を示す図である。２……ベクトル・レジスタ、３……マスク・レ
ジスタ、４……ロード・ストア・パイプライン、
２１……アドレス・レジスタ、２２……インクリ
メント・レジスタ、３５，３６……ビツト選択回
路、２３……カウンタ、２６，２７……倍数発生
回路、２９……キヤリー・セーブ・アダー、３０
……パラレル・アダー。 FIG. 1 is an overall configuration diagram of a vector processing device according to an embodiment of the present invention, FIG. 2 is a diagram showing the process of decompressing and loading data compressed on main memory onto a vector register, and FIG. 3 is a diagram showing the process of compressing and storing the decompressed data on the vector register into main memory, Figure 4 is a diagram showing the configuration of the address calculation circuit of the load/store pipeline, and Figure 5 is the address calculation It is a figure showing an example. 2...Vector register, 3...Mask register, 4...Load/store pipeline,
21... Address register, 22... Increment register, 35, 36... Bit selection circuit, 23... Counter, 26, 27... Multiple generation circuit, 29... Carry save adder, 30
...Parallel adder.

Claims

[Claims]

1 Consists of multiple arithmetic units, multiple vector registers, multiple vector mask registers that indicate data validity, multiple load/store pipelines, and interleaved main memory To identify data to be processed by the pipeline in order to perform compressed/expanded data transfer between compressed data on the main memory and expanded data on the vector register in the vector processing device. a bit selection circuit that selects the contents of a mask register based on a requester number given to the pipeline; a counter that counts the number of valid bits in the selected bit string; A multiple generation circuit that generates multiple sets of data interval values, that is, multiples of increments, and a main memory address of data processed in the previous processing stage by the pipeline and multiples of multiples of increments from the multiple generation circuit. and a multi-input adder that calculates the main memory address of the data to be processed in the next stage by simultaneously adding them together, and generates the main memory address in parallel for each pipeline, creating an element-parallel multi-pipeline system. A vector processing device characterized by performing the compression/expansion type data transfer by operation.