JPS62226275A

JPS62226275A - Vector processor

Info

Publication number: JPS62226275A
Application number: JP6838786A
Authority: JP
Inventors: Masami Takahata; 高畑　正美; Yuji Aoki; 雄二青木
Original assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Current assignee: Hitachi Ltd; Hitachi Computer Engineering Co Ltd
Priority date: 1986-03-28
Filing date: 1986-03-28
Publication date: 1987-10-05
Also published as: JPH0434191B2

Abstract

PURPOSE:To attain a fast processing with matrix calculation by securing an element parallel form even with the transfer of the compressed data stored in a main memory and the extended data stored in a vector register. CONSTITUTION:When a load or store instruction is started, the load/store pipelines 4 start the load/store processing at the same time. In a load instruction mode the address of the data on a main memory 1 is sent to the memory 1 through the pipeline 4 and data are extracted to the pipeline 4 out of the memory 1. Furthermore the data are written to a vector register 2 via a data distributing circuit 6. In a store instruction mode the data are read to a data selecting circuit 7 out of the register 2 and sent to the pipeline 4. Then data are loaded to the register 2 and stored there. Meanwhile the arithmetic operation is given to the data.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はベクトル処理装置に係り、特に行列形式データ
の処理において、疎行列データのアクセスを高速に処理
するに好適なベクトル処理装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a vector processing device, and particularly to a vector processing device suitable for processing access to sparse matrix data at high speed in processing matrix format data.

[Conventional technology]

ベクトル処理装置において、科学技術計算に現われる大
規模の行列計算では、扱うデータが非常に大きく、その
ままの形では主記憶に収まらない場合がある。また疎行
列を扱う行列計算では、計算結果に関与するのはと（少
数の非零の要素である。このため主記憶上忙は圧縮され
た形式で行列データを蓄えておき、計算時にベクトル・
レジスタ上に伸張して転送する機能を備えることによっ
て記憶容量の制約を解消しつつ、行列計算の高速化をは
かっている。In vector processing devices, in large-scale matrix calculations that appear in scientific and technical calculations, the data handled is so large that it may not fit into main memory in its original form. In addition, in matrix calculations that handle sparse matrices, only a small number of nonzero elements are involved in the calculation results. Therefore, the matrix data is stored in a compressed format in the main memory, and the vector data is stored in the main memory in a compressed format.
By providing a function to expand and transfer onto a register, it eliminates storage capacity constraints and speeds up matrix calculations.

この稲のベクトル処理装置では、ベクトル・レジスタ上
の伸張されたデータの有効性を示すためにマスクＯレジ
スタが設けられている。マスク・レジスタの各ビットは
ベクトル会レジスタの対応する要素の有効性を示す。ベ
クトル・レジスタのル番目の要素に対応するマスク・ビ
ットＶＭＲ（ｎ）が値１１′であるとき該要素は有効、
１０′であるとき無効であることが示される。In this rice vector processing device, a mask O register is provided to indicate the validity of the decompressed data on the vector register. Each bit in the mask register indicates the validity of the corresponding element in the vector register. The element is valid when the mask bit VMR(n) corresponding to the th element of the vector register has the value 11';
10' indicates invalidity.

主記憶上の圧縮されたデータは、先頭アドレスを保持す
るアドレス・レジスタＶＡＲと間隔値を保持するインク
リメント・レジスタＶＩＲとによって次のようにアドレ
ス付けされる。ル番目の要素のアドレスは１、みが主記憶上に圧縮されて配置されるので、ル番目の
要素のアドレスは、該要素以前の１１′のマスク・ビッ
トのカウント数にインクリメントを乗じ、先頭アドレス
を加えた値となる。The compressed data on the main memory is addressed as follows by an address register VAR that holds the start address and an increment register VIR that holds the interval value. The address of the element 1 is compressed and placed in main memory, so the address of the element 2 is calculated by multiplying the count number of 11' mask bits before the element by an increment, and then The value is the addition of the address.

主記憶上の圧縮されたデータとベクトル拳レジスタ上の
伸張されたデータとの間の転送は、マスク・レジスタの
内容によって制御される。ル番Ｈの要素に対しては、マ
スク・ピッ）ＶＭＲ（ｒＬ）が値゛１９をとる場合に主
記憶からベクトル・レジスタへのロード、またはベクト
ル・レジスタから主記憶へのストアが行われる。値１０
？を取る場合にはロード、ストアは行われない。以上の
ようにマスク・ビットによるメモリ・リクエスト制御が
行われ、先頭要素から順次転送が行われる。Transfers between compressed data on main memory and decompressed data on the vector fist register are controlled by the contents of the mask register. For the element with loop number H, when the mask pin (rL) takes the value ``19'', a load from the main memory to the vector register or a store from the vector register to the main memory is performed. value 10
? If you take , loading and storing are not performed. As described above, memory request control is performed using mask bits, and data is sequentially transferred starting from the first element.

このような構成をもつベクトル処理装置では、従来、ア
ドレス計算は次のように遂次的に行われる０ α１−αｎ、＋ＶＭＲ（ｎ、−１）＊ＶＩＲ・＝（２）
ここにαユはル番目の要素の主記憶アドレス、α２゜は
直前のｒＬ−１番目の要素の主記憶アドレスである。こ
のため、データ転送は遂次的に１本のロード・ストア・
パイプラインで行われていた。なお、この種の装置とし
て関連するものには例えば特開昭５８−２１４９６３号
公報が挙げられる。Conventionally, in a vector processing device having such a configuration, address calculation is performed sequentially as follows.
Here, αU is the main memory address of the rth element, and α2° is the main memory address of the immediately preceding rL-1th element. Therefore, data transfer is performed sequentially through one load, store, and
It was done in a pipeline. Note that related devices of this type include, for example, Japanese Unexamined Patent Publication No. 58-214963.

[Problem that the invention seeks to solve]

上記従来技術は、要素並列の多重パイプライン方式の適
合性については配慮されておらず、多重化されたパイプ
ラインが有効に利用されないという問題があった。つま
り、最近のベクトル処理装置では、主記憶からベクトル
拳レジスタへデータを転送する能力を増強するために、
複数のロード・ストア・パイプラインが設置され、デー
タ中の複数の要素を並列に一度に転送する機構がある。The above-mentioned conventional technology does not take into consideration the suitability of the element-parallel multiple pipeline system, and there is a problem in that the multiplexed pipelines are not used effectively. In other words, in modern vector processing devices, in order to increase the ability to transfer data from main memory to vector registers,
Multiple load-store pipelines are installed to provide a mechanism to transfer multiple elements of data in parallel at once.

ところが上記の圧縮会伸張型のデータ転送の場合には、
主記憶にアクセスすべきアドレスがデータの有効性を示
すマスク−ビットの制御の下で遂次的にしか生成されな
い。このため複数設置されたロード・ストア・パイプラ
インの内、１本でしかデータ転送を行うことができなか
った。However, in the case of the compression/decompression type data transfer mentioned above,
Addresses to access main memory are generated only sequentially under the control of mask-bits that indicate the validity of the data. For this reason, data could only be transferred using one of the multiple load/store pipelines installed.

本発明の目的は、複数設置されたロード・ストア・パイ
プラインにおいて、複数の要素の主記憶τ　ｋ”　　Ｉ
／　　２　　ｋ　前２；Ｉｔ　Ｉｙ　　−ｕ　’Ｗｉｔ
　　Ｉ　　　　　亜専Ｗ　ＰＩ　Ｉｆｆ　　Ｉｒ　蝕ａ
　　ｊ巾阻型のデータ転送を実行するベクトル処理装置
を提供することにある。An object of the present invention is to store main memories τ k” I of multiple elements in a load/store pipeline installed in multiple
/ 2 k previous 2; It Iy -u 'Wit
I Subspecialty W PI If Ir Eclipse a
An object of the present invention is to provide a vector processing device that performs a block-type data transfer.

[Means for solving problems]

本発明は、各パイプラインで処理すべき要素の主記憶ア
ドレスを並列に計算できるように各パイプラインのアド
レス計算回路を構成することにより達成される。The present invention is achieved by configuring the address calculation circuits of each pipeline so that the main memory addresses of the elements to be processed in each pipeline can be calculated in parallel.

各パイプラインのアドレス計算回路を、各パイプライン
で処理すべき要素番号を認識し、マスク会ビットを選択
してマスク・ビットの有効数を計数するカウンタ、カウ
ンタの内容に基づき要素の間隔値、インクリメントの倍
数を発生する倍数発生回路、各パイプラインにおいて前
ステージにて処理した要素の主記憶アドレスに上記で発
生した倍数を加えることによって、各パイプラインにお
いて該ステージにて処理すべき要素の主記憶アドレスを
求める多入力の加算器で構成する。The address calculation circuit of each pipeline is configured by a counter that recognizes the element number to be processed in each pipeline, selects a mask bit, and counts the effective number of mask bits, and an element interval value based on the contents of the counter. A multiple generation circuit that generates a multiple of the increment. By adding the multiple generated above to the main memory address of the element processed in the previous stage in each pipeline, It consists of a multi-input adder that obtains storage addresses.

[Effect]

要素並列の多重パイプラインにおいテ、各パイプライン
で並列にアドレス計算を実行するためには次のような計
算を並列に実行しなげればならなここでＭは多重パイプ
ラインの本数、α１〜αｎ＋９−２は該ステージで処理
すべき要素の主記憶アドレス、αニー７〜”ｒＬ−＋は
ｎＩＪステージで処理した要素の主記憶アドレスである
。ｖＭＲ（Ｌ）は１番目の要素に対応するマスク拳ビッ
トであり、ｖ工Ｒは要素の間隔値、インクリメントであ
る。１本のパイプラインについて見ると、前ステージで
処理したル＋ｉ−Ｍ番目の要素の主記憶アドレスαｎ＋
ｉ−□と該ステー（す）＊ＶＩＲとなっている。この倍
数は、前ステージで処理した要素に対応するＭ個のマス
ク・ビットの内、該パイプラインで処理した要素から要
素番号が後の要素に対応する有効マスク・ビットら作ら
れる。In an element-parallel multiple pipeline, in order to perform address calculation in parallel in each pipeline, the following calculation must be executed in parallel. Here, M is the number of multiple pipelines, α1 ~ αn+9-2 is the main memory address of the element to be processed in the stage, αN7~"rL-+ is the main memory address of the element processed in the nIJ stage. vMR(L) corresponds to the first element It is a mask fist bit, and v-R is the element interval value and increment.When looking at one pipeline, the main memory address αn+ of the +i−Mth element processed in the previous stage
i-□ and the stay(su)*VIR. This multiple is created from valid mask bits corresponding to elements whose element number is later than the element processed in the pipeline, among the M mask bits corresponding to the elements processed in the previous stage.

したがって、各パイプラインのアドレス計算回Ｍにおい
て、カウンタによって上記有効ビット数を計数し、その
カウント値に基づいて倍数発生回路にてインクリメント
ＶＩＲの倍数を発生し、加算器によって該パイプライン
で前ステージに処理したル＋ｉ−Ｍ番目の要素の主記憶
アドレスαｎ＋ｉ−Ｍに加算すると、該パイプラインで
次のステージで処理すべきル＋ｊ番目の要素の主記憶ア
ドレスαｎ＋ｉが求まる。以上により複数のパイプライ
ンにおいて複数の要素の主記憶アドレスを並列に得るこ
とができる。Therefore, in the address calculation time M of each pipeline, a counter counts the number of effective bits, a multiple generation circuit generates a multiple of the increment VIR based on the count value, and an adder generates a multiple of the increment VIR in the previous stage of the pipeline. By adding it to the main memory address αn+i−M of the +i−Mth element processed in the above, the main memory address αn+i of the +jth element to be processed in the next stage of the pipeline is determined. As described above, main memory addresses of multiple elements can be obtained in parallel in multiple pipelines.

各パイプラインにおいて処理すべき要素に対応−Ｌ６マ
スクΦビットがｗｌｔの場合、上記アドレス計算によっ
て得られた主記憶アドレスを用いてメモリφリクエスト
を主記憶に対して発行し、ロード会ストア処理を行う。Corresponding to the element to be processed in each pipeline - If the L6 mask Φ bit is wlt, a memory φ request is issued to the main memory using the main memory address obtained by the address calculation above, and the load session store processing is performed. conduct.

マスク・ビットがマ０７の場合には、アドレス計算は行
うが、メモリ・リクエストは抑止する。この場合には、
アドレス計算回路内に保持される主記憶アドレスが更新
されるのみである。If the mask bit is MA07, address calculation is performed but memory requests are suppressed. In this case,
Only the main memory address held within the address calculation circuit is updated.

処理すべき残り要素数は、現在の残り要素数から、１ス
テージで処理される要素数（これはパイプライン本数に
等しい。）を減算することによって得られる。そして、
残り要素数が非正になったことを検出器によって検出し
、ロード・ストア処理を終了する。The number of remaining elements to be processed is obtained by subtracting the number of elements processed in one stage (this is equal to the number of pipelines) from the current number of remaining elements. and,
The detector detects that the number of remaining elements has become non-positive, and the load/store process ends.

〔Example〕

以下、本発明の内容を図を用いて説明する。 Hereinafter, the content of the present invention will be explained using figures.

第１図は本発明が適用されるベクトル処理装置の全体構
成を示すものである。第１図において、１は主記憶（Ｍ
Ｓ）、２はベクトル・レジスタ（ＶＲ）、３は−ｒスク
＊ｖジスタ（ＶＭＲ）、４はロードＯストア・パイプラ
イン（ＬＳ）、５は演算器（ＡＬＵ）、６はデータ分配
回路、７はデータ選択回路である。ベクトル・レジスタ
２は各々Ｌ個の要素を格納する容量を持ち、全体でＶＲ
，〜ＶＲ，で示す８本設けられている。マスク・レジス
タ３はＬビットで１本設けられている。マスク・レジス
タ３の１ビツトがベクトル・レジスタ２の１要素の有効
性を示す。尚、本実施例では簡略化のためマスク舎レジ
スタを１本とするが、マスク・レジスタは複数あっても
差しつかえない。ロード−ストア番パイプライン４はＬ
　５ｏ−Ｌ　Ｓ、で示す８本、演算器５はＡ　Ｌ　Ｕ、
　−Ａ　Ｌ　Ｕ、で示す８個とする。FIG. 1 shows the overall configuration of a vector processing device to which the present invention is applied. In Figure 1, 1 is the main memory (M
S), 2 is a vector register (VR), 3 is a -r disk * v register (VMR), 4 is a load-O-store pipeline (LS), 5 is an arithmetic unit (ALU), 6 is a data distribution circuit, 7 is a data selection circuit. Vector register 2 each has a capacity to store L elements, and the total VR
, ~VR, are provided. One mask register 3 is provided with L bits. One bit in mask register 3 indicates the validity of one element in vector register 2. In this embodiment, for the sake of simplicity, there is one mask register, but there may be a plurality of mask registers. Load-store number pipeline 4 is L
5o-LS, 8 units, arithmetic unit 5 is ALU,
-A L U, there are 8 pieces.

各ロード・ストアーパイプラインは主記憶上の行列型式
データの各要素の主記憶アドレスを順次計算し、主記憶
に対してメモリ・リクエストを発行する。主記憶アドレ
スの計算結果はパイプライン動作の基本単位である１ス
テージごとに得られ、メモリ・リクエストも１ステージ
・ピッチで発行される。ロード命令、またはストア命令
が起動されると、８本のロード・ストア・パイプライン
４、ＬＳｏ、ＬＳ、、・・・、ＬＳ、が同時にロード・
ストア処理を開始する。第１ステージでは、０番、１番
、・・・、７番の要素の主記憶アドレスα。、α７、・
・・、α、が８本のパイプライン４において同時に計算
され、８個のメモリ・リクエストが主記憶１に対して発
行される。第２ステージでは、８番、９番、・・・、１
５番の要素の主記憶アドレスα３、α８、・・・、α、
５が８本のパイプライン４において同時に計算され、８
個のメモリ・リクエストが主記憶１に対して発行される
。以下、１ステージごとに８個の要素の主記憶アドレス
が８本のロード・ストア・パイプライン４において同時
に計算され、８個のメモリ・リクエストが主記憶１に対
して発行される。そして該メモリ・リクエストに基づき
、８個の要素のデータ転送が主記憶１とベクトル・レジ
スタ２との間で行われる。Each load/store pipeline sequentially calculates the main memory address of each element of matrix-type data on the main memory, and issues a memory request to the main memory. The calculation result of the main memory address is obtained for each stage, which is the basic unit of pipeline operation, and memory requests are also issued at a pitch of one stage. When a load or store instruction is activated, eight load/store pipelines 4, LSo, LS, ..., LS, load and store simultaneously.
Start store processing. In the first stage, the main memory address α of elements numbered 0, 1, . . . , 7. ,α7,・
..., α, are calculated simultaneously in eight pipelines 4, and eight memory requests are issued to the main memory 1. In the second stage, No. 8, No. 9, ..., 1
Main memory address α3, α8, ..., α, of element number 5
5 are computed simultaneously in 8 pipelines 4, and 8
memory requests are issued to main memory 1. Thereafter, main memory addresses of eight elements are calculated simultaneously in eight load/store pipelines 4 for each stage, and eight memory requests are issued to the main memory 1. Data transfer of eight elements is then performed between main memory 1 and vector register 2 based on the memory request.

ロード命令の場合には、主記憶上のデータのアドレスが
ロード・ストア番パイプライン４からパス８を介して主
１己憶１に送られ、データが主記憶１からパス９を介し
ロード・ストア・パイプライン４に取り出される。さら
にデータはパス１０を介しデータ分配回路６に入り、命
令で指定されたベクトル・レジスタ２に対してパス１１
を介し書き込まれる。In the case of a load instruction, the address of the data on the main memory is sent from the load/store number pipeline 4 to the main memory 1 via path 8, and the data is sent from main memory 1 to the load/store number via path 9. - Taken out to pipeline 4. Furthermore, the data enters the data distribution circuit 6 via path 10, and is sent to the vector register 2 specified by the instruction via path 11.
written via.

ストア命令の場合には、命令によって指定されたベクト
ル・レジスタ２上のデータがパス１２を介しデータ選択
回路７に読み出され、パス１５を介してロード・６スト
ア・パイプライン４に入る。ロード・ストア・パイプラ
イン４では主記憶アドレスが与えられ、主記憶アドレス
はパス８に乗せて、データはパス１６に乗せて主記憶１
に送り込まれる。In the case of a store instruction, data on the vector register 2 specified by the instruction is read out to the data selection circuit 7 via path 12 and enters the load/store pipeline 4 via path 15. In the load/store pipeline 4, the main memory address is given, the main memory address is placed on path 8, and the data is placed on path 16, main memory 1.
sent to.

データがベクトル・レジスタ２にロードされてからスト
アされるまでの間にデータに対する演算が行われる。演
算命令によって指定される３本のレジスタ間で演算が行
われる。５本のベクトル・レジスタの内の２本はオペラ
ンドが格納されていて、オペランドはデータ選択回路７
を介して読み出されパス１３を経由して演算器５に入力
される。Operations are performed on the data after it is loaded into vector register 2 and before it is stored. An operation is performed between the three registers specified by the operation instruction. Two of the five vector registers store operands, and the operands are sent to the data selection circuit 7.
The signal is read out via the path 13 and input to the arithmetic unit 5.

演算結果は演算器５から出力され、パス１４を経由しデ
ータ分配回路６を介してベクトルΦレジスタ２に書き込
まれる。演算命令が要素対応にマスク・ビットを生成す
るマスク生成命令である場合には、演算結果として得ら
れるマスク・ビットはマスク・レジスタ６に書き込まれ
る。The calculation result is output from the calculation unit 5 and written to the vector Φ register 2 via the path 14 and the data distribution circuit 6. If the operation instruction is a mask generation instruction that generates mask bits corresponding to elements, the mask bits obtained as a result of the operation are written into the mask register 6.

ロード拳ストア・パイプライン４におけるロード・スト
ア処理は各要素ごとにマスク・レジスタ５のマスク・ビ
ットの制御を受ける。このため７１７を介して各パイプ
ラインに分配される。The load/store processing in the load/store pipeline 4 is controlled by the mask bit of the mask register 5 for each element. Therefore, it is distributed to each pipeline via 717.

次に第２図並びに第３図にて主記憶上の圧縮されたデー
タとベクトル会レジスタ上の伸張されたデータとの間の
転送処理動作を示す。Next, FIGS. 2 and 3 show the transfer processing operation between compressed data on the main memory and decompressed data on the vector register.

第２図は主記憶上の圧縮された１個のデータα。、α１
・゛〜α２−１をベクトル・レジスタ上に伸張してロー
ドする処理を示したものである。ロード処理の前にはベ
クトル・レジスタ２のＬｍの要素に対してＬビットのマ
スク−ビットがマスク・レジスタ５に設定されている。FIG. 2 shows one piece of compressed data α on the main memory. , α1
・This shows the process of expanding and loading ゛~α2-1 onto a vector register. Before the load process, L-bit mask bits are set in the mask register 5 for the Lm element of the vector register 2.

マスク・ビットの内、要素が有効であることを示すマ１
ｖの数は、主記憶１上の圧縮されたデータの要素数りに
等しい。主記憶１上のデータは先頭の要素から順にマス
ク・ビットマ１ｖに対応するベクトル・レジスタ２の要
素位置にロードされる。マスク・ビットＹＱＹに対応す
る要素位置にはロードしない。Mask bits that indicate that the element is valid
The number v is equal to the number of compressed data elements in the main memory 1. The data on the main memory 1 is loaded into the element positions of the vector register 2 corresponding to the mask bitma 1v in order from the first element. Do not load into the element location corresponding to mask bit YQY.

第６図はベクトル・レジスタ２上の伸張されたＬ個のデ
ータα。、α１、・・・、α、−１を圧縮して主記憶１
にストアする処理を示したものである。ストア処理の前
にはベクトル・レジスタ２のＬ個の要素に対してＬビッ
トのマスク・ビットがマスク・レジスタ３に設定されて
いる。ロードとは逆に、マスクビットｗ１Ｆに対応する
ベクトル・レジスタ２の要素位置のデータが先頭から順
に主記憶１にストアされる。FIG. 6 shows L pieces of expanded data α on vector register 2. , α1, ..., α, -1 are compressed and stored in main memory 1.
This figure shows the process of storing data in a file. Before store processing, L mask bits are set in mask register 3 for L elements of vector register 2. Contrary to loading, the data at the element positions of vector register 2 corresponding to mask bit w1F are stored in main memory 1 in order from the beginning.

第４図にロード・ストア・パイプライン１本のアドレス
計算回路の構成を示し、圧縮・伸張型のロード拳ストア
処理におけるアドレス計算を説明する。本図は（３）式
の多重度Ｍ−８の場合の構成を示したものである。この
場合、処理のパイプライン動作の単位である１ステージ
ごとにデータ中の８個の要素がロード・ストア処理され
る。各パイプラインには信号線３３によってリクエスタ
番号と称する０〜７の値が与えられている。各パイプラ
インはりクエスタ番号の値によって処理すべき要素の系
列を認識し動作する。リクエスタ番号が００パイプライ
ンでは、要素α。、α６、α、６、・・・が順次ロード
、ストア処理される。FIG. 4 shows the configuration of an address calculation circuit with one load/store pipeline, and address calculation in compression/expansion type load/store processing will be explained. This figure shows the configuration of equation (3) when the multiplicity is M-8. In this case, eight elements in the data are loaded and stored in each stage, which is a unit of pipeline operation of processing. Each pipeline is given a value from 0 to 7 called a requester number via a signal line 33. Each pipeline recognizes and operates the sequence of elements to be processed based on the value of the quester number. In a pipeline with requester number 00, element α. , α6, α,6, . . . are sequentially loaded and stored.

命令解読回路２０において主記憶１上の圧縮データとベ
クトル・レジスタ２上の伸張データとの間のデータ転送
を指示するロードφストア命令が解読されると、命令解
読回路２０から主記憶１上の圧縮データをアドレス付ゆ
する情報がレジスタ２１．２２．２４に設定される。ア
ドレス・レジスタｖＡＲ２１にはデータの先頭アドレス
が、インクリメント拳レジスタＶ　Ｉ　Ｒ２２にはデー
タの間隔値が、レングス・レジスタＶ　Ｌ　Ｒ２４には
データの長さがそれぞれ設定される。データに関する情
報の設定と共に信号線３７　ｉＣよりデータのロード・
ストア処理を起動する信号が送られ、ラッチ５９が１１
１にセットされる。ラッチ５９の出力はメモリ・リクエ
ストを制御するＡ、　Ｎ　Ｄ回路４０を開き、ロード・
ストア処理のためのメモリ※リクエストの送出が始まる
。When the instruction decoding circuit 20 decodes the load φ store instruction that instructs data transfer between the compressed data on the main memory 1 and the decompressed data on the vector register 2, the instruction decoding circuit 20 decodes the data on the main memory 1. Information for addressing compressed data is set in registers 21, 22, and 24. The start address of the data is set in the address register vAR21, the data interval value is set in the increment register VIR22, and the length of the data is set in the length register VLR24. Along with setting data-related information, data is loaded from signal line 37 iC.
A signal is sent to start the store process, and the latch 59 is set to 11.
Set to 1. The output of latch 59 opens A, N D circuit 40 that controls memory requests and loads.
Sending of memory request for store processing begins.

マスク０レジスタＶＭＲ３からはロード・ストア処理に
同期してパイプラインの本数分、８ビツトのマスク・ビ
ットが並列に読み出され、バス１７を介しレジスタ１８
にセットされる。マスク・ビットの読み出しは、ロード
・ストア処理のパイプライン動作の単位と同じく１ステ
ージごとに行われる。レジスタ１日に入ったマスク・ビ
ットは次のステージにはレジスタ１９に転送される。レ
ジスタ１８．１９の出力はビット選択回路３５．３６に
入力され、リクエスタ番号によって定まるある範囲のビ
ットのみが抽出され、カウンタ２３に転送される。カウ
ンタ２３は選択されたマスク・ビット中の１１１のビッ
トの数を計数し、計数して得られた数から倍数発生回路
２６．２７を制御し、インクリメント拳レジスタＶ　Ｉ
　Ｒ２２の０〜７倍の倍数を発生する。倍数発生回路２
６では８．４．０倍の倍数が、倍数発生回路２７では２
．１．０、−１倍の倍数がそれぞれ発生される。両者を
組み合せることによって０〜％倍の倍数が得られる。Eight mask bits for the number of pipelines are read in parallel from the mask 0 register VMR3 in synchronization with the load/store processing, and are read out from the register 18 via the bus 17.
is set to Reading of the mask bits is performed for each stage, similar to the unit of pipeline operation for load/store processing. The mask bits entered in register 1 are transferred to register 19 in the next stage. The outputs of registers 18 and 19 are input to bit selection circuits 35 and 36, and only bits within a certain range determined by the requester number are extracted and transferred to counter 23. Counter 23 counts the number of 111 bits in the selected mask bits, controls multiple generator circuits 26 and 27 from the counted number, and increments register VI
Generates a multiple of 0 to 7 times R22. Multiple generation circuit 2
6, the multiple is 8.4.0 times, and the multiple generation circuit 27 is 2.
．． Multiples of 1.0 and -1 are generated, respectively. By combining the two, a multiple of 0 to % can be obtained.

ロード拳ストア処理の最初のステージではデータの先頭
アドレスがアドレス・レジスタＶＡＲ２１からセレクタ
２５を介しキャリー参セーブ・アダー　ＣＳ　Ａ　２９
に入力される。同時にマスク・カウント数に基づ（イン
クリメント・レジスタＶＩＲの倍数が倍数発生回路２６
．２７からキャリー・セーブ拳アダーＣＳ　Ａ　２９に
入力される。両者はキャリー・セーブ・アダーＣＳ　Ａ
　２９とその直後にあるパラレル・アダーＰ　Ａ　３０
によって加算され、第１ステージで処理される要素の主
記憶アドレスとなる。第２ステージ以降では、前のステ
ージで処理した要素の主記憶アドレスがセレクタ２５を
介しキャリー・セーブ拳アダーＣＳ　Ａ　２９に再び入
力され、該ステージで処理する要素の主記憶アドレスを
計算するために使用される。第２ステージ以降のアドレ
ス計算では、アドレス・レジスタＶＡＲ２１の内容の代
りに前ステージで求めた主記憶アドレスを用いる点のみ
異なる。At the first stage of the load store process, the start address of the data is carried from the address register VAR21 via the selector 25 to the save adder CS A 29
is input. At the same time, based on the mask count number (the multiple of the increment register VIR is
．． 27 to the carry save fist adder CS A 29. Both are carry save adder CS A
29 and the parallel adder P A 30 immediately after it.
It becomes the main memory address of the element processed in the first stage. From the second stage onwards, the main memory address of the element processed in the previous stage is input again to the carry-save fist adder CS A 29 via the selector 25, in order to calculate the main memory address of the element processed in the stage. used. The only difference is that the address calculations in the second and subsequent stages use the main memory address obtained in the previous stage instead of the contents of the address register VAR21.

第５図においてリクエスタ番号３のロード・ストア・パ
イプラインにおけるアドレス計算の一例を示す。要素α
ｎ−８に引き続く８個の要素に対応するマスクφビット
が’１０１１０１　Ｄｏ’、その次のステージで処理さ
れる要素αユに引き続く８個の要素に対応するマスク・
ビットが’０１０１１００１’であるとする。リクエス
タ番号５のパイプラインでは２ステージの間に要素αｎ
＋５−ＩＩと要素α１＋５とについてロード会ストア処
理を行う。要素αｎ＋５の主記憶アドレスは、要素αｎ
＋＆−８の主記憶アドレスとマスク・ビットＶＭＲ（す
（ｉキル＋５−８〜ル＋２）とから次のようにして求め
られる。まず、ビット選択回路６５によって要素αｒＬ
＋５−８から後の５個の要素αｒＬ＋５−ＩＩ　％αＷ
＋４−８　％　αＷ＋５−！ｌ　％　ｃＬｎ、＋６−Ｊ
　％α１＋７−６に対応するマスク・ピッ）　’１０１
００’が選択得られる。次にビット選択回路６６によっ
て要素αユや、より前の３個の要素ａ、ｒＬ１αユや１
、αｒＬや２、に数発生回路２６では４　＊ＶＩＲが、
倍数発生回路２７では（−１）＊ＶＩＲが発生され、両
者はチャリー・セーブ・アダーＣＳ　Ａ　２９とパラレ
ル・アダーＰＡ３０とにおいてαｎ＋５−８と加算され
ｃＬｒＬ＋　ｓが得られる。FIG. 5 shows an example of address calculation in the load/store pipeline of requester number 3. element α
The mask φ bit corresponding to the 8 elements following n-8 is '101101 Do', and the mask φ bit corresponding to the 8 elements following the element α to be processed in the next stage is '101101 Do'.
Assume that the bit is '01011001'. In the pipeline of requester number 5, element αn is added between two stages.
Road session store processing is performed for +5-II and element α1+5. The main memory address of element αn+5 is element αn
It is determined from the main memory address +&-8 and the mask bit VMR (s(i kill+5-8 to rule+2)) as follows.First, the bit selection circuit 65 selects the element αrL.
5 elements after +5-8 αrL+5-II %αW
+4-8% αW+5-! l % cLn, +6-J
Mask corresponding to %α1+7-6) '101
00' is selected. Next, the bit selection circuit 66 selects element α, the previous three elements a, rL1α, and 1.
, αrL and 2, the number generation circuit 26 has 4 *VIR,
The multiple generating circuit 27 generates (-1)*VIR, and both are added to αn+5-8 in the Charlie save adder CS A 29 and the parallel adder PA 30 to obtain cLrL+s.

＊ＶＩＲ −αｎ＋ｓ　−ａ　＋（２＋　１　）＊ＶＩ　Ｒｘ　ａ
ｎ＋、−、＋　！ｌ　＊　Ｖ　Ｉ　Ｒｇ　αｒ、、＋、
−、＋４＊ＶＩＲ＋（−１）＊ＶＩＲ・・・（４）第４図においてビット選択回路３５は、マスク・カウン
ト値を求めるだめのビット選択と共に、該パイプライン
で処理される要素に対応するマスク・ビットの選択も行
う。リクエスタ番号から処理要素に対応するマスク・ビ
ット位置を求め、該。*VIR −αn+s −a +(2+ 1 )*VI Rx a
n+,-,+! l * V I Rg αr,,+,
-, +4*VIR+(-1)*VIR...(4) In FIG. 4, the bit selection circuit 35 corresponds to the element to be processed in the pipeline, as well as the bit selection for determining the mask count value. Also performs mask bit selection. The mask bit position corresponding to the processing element is determined from the requester number.

マスク・ビットを抽出し、ＡＮＤ回路４ｏに送る。The mask bits are extracted and sent to AND circuit 4o.

ＡＮＤ回路４０では、ランチ３９からのメモリ・リクエ
スト発行信号とビット選択回路３５からのマスクｅビッ
トとのＡＮＤがとられ、メモリ・リクエストとしてバス
１６に乗せて主記憶１に送出される。処理すべき要素に
対応するマスク番ビットが１１７の場合にはメモリ争リ
クエストが発行され、該当する要素が主記憶１から読み
出されたり、主記憶１に書き込まれたりする。マスク・
ビットが０　の場合にはメモリ争リクエストは抑止され
る。In the AND circuit 40, the memory request issue signal from the launch 39 and the mask e bit from the bit selection circuit 35 are ANDed and sent to the main memory 1 on the bus 16 as a memory request. If the mask number bit corresponding to the element to be processed is 117, a memory contention request is issued, and the corresponding element is read from or written to the main memory 1. mask·
If the bit is 0, memory contention requests are suppressed.

メモリ会リクエスト発行の際には、キャリー・セーブ拳
アダーＣＳ　Ａ　２９とパラレル争アダーＰＡ３ｏとに
よって計算された主記憶アドレスがバス１３に乗せてメ
モリ・リクエストと共に主記憶１に対して送出される。When issuing a memory session request, the main memory address calculated by the carry save fist adder CS A 29 and the parallel conflict adder PA3o is sent to the main memory 1 along with the memory request on the bus 13.

レングス・レジスタＶ　Ｌ　Ｒ２４にはロード・ストア
処理に先き立ってデータの長さが格納されている。ロー
ド０ストア処理が１ステ一ジ進行するごとにレングス・
レジスタＶＬＲの内容は減算回路２８　Ｋよって−８さ
れる。ロード拳ストア処理の１ステージにおいて、パイ
プラインの本数８と同じ個数の要素が一度に処理される
ので−８する。The length register VLR24 stores the length of data prior to load/store processing. Each time the load 0 store processing progresses one stage, the length
The contents of register VLR are subtracted by -8 by subtraction circuit 28K. In one stage of the load store process, the same number of elements as the number of pipelines (8) are processed at once, so the value is -8.

すべての要素についてロード・ストア処理が実行された
時点で減算結果は０以下となる。そこで符号検出回路３
１にて処理の終了を検出し、終了信号３８によってラッ
チ４０をＪｔにリセットし、ＡＮＤ回路４０を閉じる。When the load/store process is executed for all elements, the subtraction result becomes 0 or less. Therefore, the sign detection circuit 3
1, the end of the process is detected, the end signal 38 resets the latch 40 to Jt, and the AND circuit 40 is closed.

ＡＮＤ回路４０が閉じられたことによって以後のメモリ
・リクエスト送出が停止される。Since the AND circuit 40 is closed, subsequent sending of memory requests is stopped.

本実施例によれば、主記憶上の圧縮されたデータとベク
トル・レジスタ上の伸張されたデータとの間のデータ転
送を並列に設置され複数のロード・ストア・パイプライ
ンにより実行可能となる。According to this embodiment, data transfer between compressed data on the main memory and decompressed data on the vector register can be executed by a plurality of load/store pipelines installed in parallel.

このためデータ転送速度をパイプラインの本数と同じだ
け向上させることができる。Therefore, the data transfer speed can be increased by the same amount as the number of pipelines.

〔Effect of the invention〕

本発明によれば、要素並列の多重ロード・ストア会パイ
プラインを具（イａするベクトル処理装置において、主
記憶上に圧縮されたデータとベクトル・レジスタ上が伸
張されたデータとの間のデータ転送についても要素並列
の形態で実行することが可能となる。したがって通常の
単純ロードφストアと同様に並列に設置されたパイプラ
イン本数分のデータ転送速度を得ることができる。これ
によって圧縮・伸張型のデータ・アクセスを必要とする
疎行列の行列計算が高速に処理される。According to the present invention, in a vector processing device having an element-parallel multiple load/store pipeline, data between data compressed on main memory and data decompressed on a vector register is provided. Transfer can also be executed in element parallel form.Therefore, similar to normal simple load φ store, it is possible to obtain the data transfer speed equivalent to the number of pipelines installed in parallel.This allows compression/decompression Matrix calculations on sparse matrices that require type data access are processed quickly.

[Brief explanation of drawings]

第１図は本発明の一実施例のベクトル処理装置の全体構
成図、第２図は主記憶上に圧縮されたデータをベクトル
・レジスタ上に伸張してロードする処理を示す図、第６
図はベクトル・レジスタ上の伸張されたデータを主記憶
に圧縮してストアする処理を示す図、第４図はロード・
ストア・パイプラインのアドレス計算回路の構成を示す
図、第５図はアドレス計算の一例を示す図である。２・・・ベクトル・レジスタ、３・・・マスクのレジスタ、４・・・ロード・ストア・パイプライン、２１・・・ア
ドレス・レジスタ、２２・・・インクリメント・レジスタ、３５．３６・・
・ビット選択回路、２５・・・カウンタ、２６．２７・・・倍数発生回路、２９・・・キャリー・セーブ・アダー、３０・・・パラ
レル・アダー。高　１　図、１第　２　図第３図第４図FIG. 1 is an overall configuration diagram of a vector processing device according to an embodiment of the present invention, FIG. 2 is a diagram showing the process of decompressing and loading data compressed on the main memory onto a vector register, and FIG.
The figure shows the process of compressing and storing the decompressed data on the vector register in main memory.
FIG. 5 is a diagram showing the configuration of the address calculation circuit of the store pipeline, and is a diagram showing an example of address calculation. 2...Vector register, 3...Mask register, 4...Load/store pipeline, 21...Address register, 22...Increment register, 35.36...
・Bit selection circuit, 25...Counter, 26.27...Multiple generation circuit, 29...Carry save adder, 30...Parallel adder. High 1 Figure, 1 Figure 2 Figure 3 Figure 4

Claims

[Claims]

1. Vector processing consisting of multiple arithmetic units, multiple vector registers, multiple vector mask registers that indicate data validity, multiple load/store pipelines, and interleaved main memory. In the apparatus, in order to transfer data between compressed data on main memory and decompressed data on a vector register, a method for identifying data to be processed in the pipeline in order to transfer data between compressed data on main memory and decompressed data on a vector register. a bit selection circuit that selects the contents of the mask register based on the requester number given to the bit string; a counter that counts the number of valid bits in the selected bit string; A multiple generation circuit that generates a plurality of multiples, and a multiplex generation circuit that simultaneously adds the main memory address of the data processed in the previous processing stage and the increment multiple of the multiple sets in the pipeline to generate the data to be processed in the next stage. 1. A vector processing device comprising a multi-input adder for calculating a main memory address, and performing the compression/expansion type data transfer using element parallel multiple pipeline operations.