JPH0332829B2

JPH0332829B2 -

Info

Publication number: JPH0332829B2
Application number: JP58151327A
Authority: JP
Inventors: Hajime Matsumoto
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1983-08-19
Filing date: 1983-08-19
Publication date: 1991-05-14
Also published as: JPS6043775A

Description

[Detailed description of the invention]

発明の属する技術分野本発明は主メモリ上のデータをベクトルレジス
タに転速してこのベクトルレジスタに転送したデ
ータを使用してベクトル演算を行うデータ処理装
置に関する。従来技術第１図を参照すると、この種の従来のデータ処
理装置では主メモリ１から２本のアクセスパイプ
ライン６および７を介してメモリ制御装置２に各
要素のデータがそれぞれ１マシンサイクルに２要
素ずつ順次読出され、ベクトルレジスタ部３内の
ベクトルレジスタにセツトされる。このベクトル
レジスタの各要素のデータに対し、演算パイプラ
イン部４または５により、１マシンサイクルに２
要素ずつ順次演算が実行される。例えば、主メモ
リ上のベクトルデータＡ，ＢおよびＣがそれぞれ
３２の要素Ａ（０），Ａ(1)，…Ａ（30）およびＡ
（31）；Ｂ（０），Ｂ(1)，…Ｂ（30）およびＢ（31）
；
Ｃ（０），Ｃ(1)，…Ｃ（30）およびＣ（31）からなる
とき、Ｃ＝Ａ＋Ｂなるベクトル演算はベクトルレジスタVR０，
VR１およびVR２を用い４つのベクトル命令で
実行される。すなわち、命令(1)：VR０←Ａ命令(2)：VR１←Ｂ命令(3)：VR２←VR０＋VR１命令(4)：Ｃ←VR２のようである。命令(1)はアクセスパイプライン６を介して主メ
モリ１上のベクトルデータＡの32個の要素をベク
トルレジスタ部３内のベクトルレジスタVR０に
セツトし、命令(2)はアクセスパイプライン７を介
して主メモリ１上のベクトルデータＢの32個の要
素をベクトルレジスタ部３内のベクトルレジスタ
VR１にセツトする。命令(3)はベクトルレジスタ
部３内の２本のベクトルレジスタVR０および
VR１からデータを読出し演算パイプライン部４
で加算を実行し、和をベクトルレジスタ３内のベ
クトルレジスタVR２にセツトする。命令(4)はベ
クトルレジスタ３内のベクトルレジスタVR２の
32個の要素を読出しアクセスパイプライン６を介
して主メモリ１にベクトルデータＣとして格納す
る。一般に主メモリ１に使用するメモリ素子のサイ
クル時間はマシンサイクル時間より長く、数倍程
度長いこともめずらしくなく、主メモリをいくつ
かのパンクに分割することがしばしば行われてい
る。例えば第２図を参照すると、主メモリ１は４つ
のモジユール１１ａないし１１ｄに分かれ、各モ
ジユールは１マシンサイクルに１つのベクトル要
素を読出し／書込みできる。各モジユールは４つ
のバンク１２ａないし１２ｄからなり、各バンク
は４マシンサイクルに１つのベクトル要素を読出
し／書込みできる。各バンクには＃０ないし＃15
のバンク番号が付与されており、＃ｉのバンクに
は番地を16で割つたときの剰余がｉであるデータ
が格納されている。ベクトルデータは隣接した要
素が主メモリ上に格納される番地の差を要素間距
離という。要素間距離は１とは限らない。例えば
35行35列の行列Ｍ（ｉ，ｊ）を行方向に主メモリ
に格納（Ｍ（０，０），Ｍ（１，０），…Ｍ（34，
０），Ｍ（０，１），Ｍ（１，１），…，Ｍ（34，１）
，
Ｍ（０，２）…）すると列ベクトルＭ（０，ｊ），
Ｍ（１，ｊ），…，Ｍ（34，ｊ）の要素距離は１で
あるが、行ベクトルＭ（ｉ，０），Ｍ（ｉ，１），
…，Ｍ（ｉ，34）の要素間距離は35である。第３図を参照すると、ベクトルＡの要素間距離
を１、ベクトルＢの要素間距離を35とし、Ａ（０）
Ｂ（０）ともにMB＃０に格納されているとす
ると、Ａ(1)はMB＃１，Ａ(2)はMB＃２，…，Ａ
（15）はMB（15）、Ａ（16）はMB＃０，…，Ａ
（31）はMB＃15に格納され、Ｂ(1)はMB＃３、
Ｂ(2)はMB＃６，…，Ｂ(6)はMB＃２，…，Ｂ
（31）はMB＃13に格納される。この時の主メモ
リ１に対する各要素のアクセスの状況を第３図に
示す。まず命令１が発行され、１マシンサイクルに２
要素ずつ、Ａ（０）とＡ(1)がMB＃０とMB＃１
から、Ａ(2)とＡ(3)がMB＃２とMB＃３から、
…，と読出されていく。MB＃０とMB＃１は時
刻０から時刻３まで使用中となり、MB＃２と
MB＃３は時刻１から時刻４まで使用中となる。
命令(1)に引継き命令２が発行されＢ（０）とＢ(1)
を主メモリ１から読出す。Ｂ（０），Ｂ(1)はそれぞ
れMB＃０，MB＃３をアクセスしなければなら
ない。MB＃３は命令(1)により時刻１から時刻４
まで使用中なのでＢ（０），Ｂ(1)のアクセス時刻５
から時刻８の間に行われる。Ｂ(2)，Ｂ(3)のアクセ
スはMB＃９が命令(1)で時刻７まで使用中のため
時刻８から時刻11の間に行われる。従つて、例え
ば、時刻５から時刻14の10マシンサイクルの間に
Ａ(10)〜Ａ（27），Ｂ（０）〜Ｂ(5)の24の要素のアク
セスが開始され、１マシンサイクルに４要素毎ア
クセスを開始する場合に比べ60％の効率しかな
い。このように従来この種のデータ処理装置では要
素間距離の異るアクセスを同時に行うとメモリア
クセスの効率が著しく低下するという欠点があ
る。発明の目的本発明の目的は上述の欠点を除去しメモリバン
ク使用中によるメモリアクセスの待ち時間を減ら
しメモリアクセスの効率が高いデータ処理装置を
提供することにある。発明の構成本発明の装置は主メモリと、複数のベクトルレ
ジスタを有するベクトルレジスタ部と、主メモリ
とベクトルレジスタ部との間でデータ転送を行う
アクセスパイプライン部と、ベクトルレジスタの
要素に対して演算を行う演算パイプライン部を有
するデータ処理装置において、主メモリをアクセスパイプライン部の１組のメ
モリアクセスポートに接続し、１つのベクトルレ
ジスタに対応するデータ転送を１マシンサイクル
に2m個の要素の割合で直列に実行し、ベクトル
レジスタ部をアクセスパイプライン部の２組のベ
クトルアクセスポートに接続し、１つのベクトル
レジスタに対応するデータ転送を１マシンサイク
ルにｍ個の要素の割合で並列に転送し、アクセス
パイプラインが２組のバツフアでできており、メ
モリアクセスポートで転送されるデータは要素毎
に交互に２組のバツフアのいずれかに対応し、２
組のベクトルアクセスポートは２×２クロスバー
で２組のバツフアと接続されているように構成さ
れている。次に本発明について図面を参照して詳細に説明
する。第４図を参照すると、本発明の一実施例
は、主メモリ１、メモリ制御装置２、ベクトルレ
ジスタ部３、加算パイプライン部４、乗算パイプ
ライン部５およびアクセスパイプライン部８から
構成されている。主メモリ１とメモリ制御装置２
は４本の読出しラインと４本の書込みラインで接
続され、メモリ制御装置２は中央処理装置
（CPU）と入出力処理装置（LOP）とアクセスパ
イプライン部８からのメモリアクセスを制御し、
アクセスパイプライン部８とは４本の読出しライ
ンと４本の書込みラインで接続される。アクセス
パイプライン部８とベクトルレジスタ部３とはそ
れぞれ２本の読出しラインと２本の書込みライン
をもつ２つのポートで接続される。加算パイプラ
イン部４と乗算パイプライン部５とはそれぞれベ
クトルレジスタ部３から２本×２組のオペランド
の供給を受け２本の出力をベクトルレジスタ部３
に返す。主メモリ１とメモリ制御部２との間の転送レー
トは読出し／書込みとも４語／マシンサイクル、
メモリ制御部２とアクセスパイプライン部８との
間の転送レートは読出し／書込みとも４語／マシ
ンサイクル、アクセスパイプライン部８とベクト
ルレジスタ部３との間の転送レートはポート当り
読出し／書込みとも２語／マシンサイクル、加算
パイプライン部と乗算パイプライン部の演算レー
トはともに２語／マシンサイクルである。前記主メモリ１は第２図に示すように４つのモ
ジユール１１ａないし１１ｄからなり、各モジユ
ールは４つのバンク１２ａないし１２ｄからなつ
ている。第５図を参照すると、アクセスパイプライン部
８は２つのバツフアBF0 81とEF1 82および２×
２クロスバ83から構成され、バツフアBF0 81と
BF1 82はそれぞれアクセスパイプライン部８の
読出しライン、書込みラインの半数のラインと接
続され、かつ、クロスバ８３の一方のポート群
Ａ，Ｂ，ＷおよびＸに接続されている。クロスバ
８３のもう一方のポート群Ｃ，Ｄ，ＹおよびＺは
ベクトルレジスタ部と接続されている。第６図を参照すると、主メモリの各バンクのサ
イクルが４マシンサイクルとし、ベクトルＡの各
要素Ａ（０），Ａ(1)，…，Ａ（30）およびＡ（31）が
それぞれ主メモリのバンクMB＃０、MB＃１，
…，MB＃30およびMB＃31に格納されており、
ベクトルＢの各要素Ｂ（０），Ｂ(1)，…，Ｂ（30）
およびＢ（31）が３つ置きの主メモリのバンク
MB＃０，MB＃３，MB＃６，…MB＃８，MB
＃11，MB＃14に格納されている場合の主メモリ
１のバンクのサイクルの状態が示されている。時
刻のきざみはマシンサイクルであり、時刻０でＡ
（０），Ａ(1)，Ａ(2)およびＡ(3)が格納されているバ
ンクMB＃０，MB＃１，MB＃２およびMB
＃３がアクセスされ、時刻０〜３の４マシンサイ
クルの間ビジーとなる。時刻１でＡ(4)，Ａ(5)，Ａ
(6)およびＡ(7)が格納されているバンクMB＃４，
MB＃５，MB＃６およびMB＃７がアクセスさ
れ、時刻１〜４の４マシンサイクルの間ビジーと
なる。同様にしてMB＃８〜MB＃11は時刻２〜
５の間、MB＃12〜MB＃15は時刻３〜６の間ビ
ジーとなる。時刻４でＡ（16）〜Ａ（19）のアクセ
スを行うが、このときMB＃０〜MB＃３は先行
アクセスによるビジー期間を終了しているので、
バンクビジーによる待ち合せを行うことなくＡ
（16）〜Ａ（19）のアクセスが行われる。Ａ（20）
〜Ａ（31）についても同様にバンクビジーによる
待合せなしにアクセスが行われる。時刻７でベクトルＡの全要素についてのアクセ
スが終了し時刻８でベクトルＢのアクセスを開始
する。ベクトルＢの最初の４要素Ｂ（０），Ｂ(1)，
Ｂ(2)およびＢ(3)の格納されているメモリのバンク
はMB＃０，MB＃３，MB＃６およびMB＃９
であるが、MB＃９が先行するアクセスのため時
刻６〜９の間ビジーのため、Ｂ（０）〜Ｂ(3)のア
クセスは２マシンサイクル遅れて時刻10に行われ
る。以後はバンクビジーによる待合せは発生せず
時刻17にＢ（28）〜Ｂ（31）が格納されているメモ
リバンクMB＃５，MB＃８，MB＃11および
MB＃14のアクセスが行われ、ベクトルＢの全要
素についてのアクセスが終了する。第７図はアクセスパイプライン部８におけるバ
ツフア動作を説明するものである。ここではＡ
（０）〜Ａ(3)がアクセスパイプライン部８に到着
する時刻を０とした時間で表している。時刻０でＡ（０）〜Ａ(3)の４語がアクセスパイ
プライン部８に到着するが、そのうちＡ（０），Ａ
(1)の２語をバツフアBF0にＡ(2)，Ａ(3)の２語をバ
ツフアBF１に格納する。時刻１でＡ(4)〜Ａ(7)の
２語が到着するのでＡ(4)，Ａ(5)の２語をバツフア
BF０に、Ａ(6)，Ａ(7)の２語をバツフアBF１に格
納する。同様にしてＡ(8)，Ａ(9)，Ａ（12），Ａ
（13），…Ａ（28），Ａ（29）がバツフアBF０に、Ａ
(10)，Ａ(11)，…，Ａ（30），Ａ（31）がバツフアBF１
に格納される。ベクトルＢのバツフアへの格納も
同様にして時刻10でＢ（０），Ｂ(1)がバツフアBF
０に、Ｂ(2)，Ｂ(3)がバツフアBF１に格納され時
刻11でＢ(4)，Ｂ(5)がバツフアBF０に、Ｂ(4)，Ｂ
(5)がバツフアBF１に格納され、時刻17でＢ（28），
Ｂ（29）がバツフアBF０に、Ｂ（30），Ｂ（31）が
バツフアBF１に格納される。時刻１でバツフア
BF０からＡ（00），Ａ(1)を時刻(2)でバツフアBF１
からＡ(2)，Ａ(3)を読出し以後バツフアBF０およ
びBF１から交互にＡベクトルの要素を２語ずつ
読み出し、クロスバ８３を制御してクロスバ８３
のポートＣにＡベクトルの要素が２語／マシンサ
イクルの割合でベクトルレジスタ部３に送られ
る。Ａベクトルの要素がベクトルレジスタ部３に送
られている間にＢベクトルがアクセスパイプライ
ン部８に送られてきており、時刻11から読郎し可
能となる。時刻１１ではバツフアBF０の読出し
ポートはＡベクトルのために占有されているので
Ｂ（０），Ｂ(1)のBF０からの読出しは時刻12に行
われる。続いて、時刻13にＢ(2)，Ｂ(3)をバツフア
BF１から読出し、以後バツフアBF０とBF１か
ら交互にＢベクトルの要素を２語ずつ読出し、ク
ロスバ８３を制御してクロスバ８３のポートＤに
Ｂベクトルの要素が２語／マシンサイクルの割合
でベクトルレジスタ部３に送られる。発明の効果本発明にはアクセスパイプライン部を仲介し
て、主メモリとの間は１つのベクトルレジスタに
対応するデータ転送を直列に実行し、ベクトルレ
ジスタ部との間は２つのベクトルレジスタに対応
するデータ転送を並列に実行する構成をとること
により、主メモリのメモリアクセス効率の低減を
防ぐことができるという効果がある。 TECHNICAL FIELD The present invention relates to a data processing device that transfers data in a main memory to a vector register and performs vector operations using the data transferred to the vector register. Prior Art Referring to FIG. 1, in this type of conventional data processing device, data of each element is transferred from a main memory 1 to a memory control device 2 via two access pipelines 6 and 7, two times per machine cycle. The elements are sequentially read out and set in the vector register in the vector register section 3. The data in each element of this vector register is processed twice in one machine cycle by the arithmetic pipeline unit 4 or 5.
Operations are performed sequentially element by element. For example, vector data A, B, and C on main memory each have 32 elements A(0), A(1),...A(30) and A
(31); B(0), B(1),...B(30) and B(31)
;
When consisting of C(0), C(1),...C(30) and C(31), the vector operation such as C=A+B is performed by vector register VR0,
It is executed using four vector instructions using VR1 and VR2. That is, instruction (1): VR0←A instruction (2): VR1←B instruction (3): VR2←VR0+VR1 instruction (4): C←VR2. Instruction (1) sets 32 elements of vector data A in main memory 1 to vector register VR0 in vector register section 3 via access pipeline 6, and instruction (2) sets 32 elements of vector data A on main memory 1 to vector register VR0 in vector register section 3. The 32 elements of vector data B on main memory 1 are stored in the vector register in vector register section 3.
Set to VR1. Instruction (3) writes the two vector registers VR0 and
Read data from VR1 and calculation pipeline section 4
performs the addition and sets the sum in vector register VR2 in vector register 3. Instruction (4) reads vector register VR2 in vector register 3.
The 32 elements are stored as vector data C in the main memory 1 via the read access pipeline 6. Generally, the cycle time of the memory elements used in the main memory 1 is longer than the machine cycle time, and it is not uncommon for it to be several times longer, and the main memory is often divided into several blocks. For example, referring to FIG. 2, main memory 1 is divided into four modules 11a-11d, each module capable of reading/writing one vector element per machine cycle. Each module consists of four banks 12a-12d, each bank capable of reading/writing one vector element every four machine cycles. Each bank has #0 to #15
A bank number of #i is assigned, and data whose remainder when the address is divided by 16 is i is stored in the bank #i. For vector data, the difference in the addresses at which adjacent elements are stored in main memory is called the inter-element distance. The distance between elements is not necessarily 1. for example
A matrix M(i,j) of 35 rows and 35 columns is stored in the main memory in the row direction (M(0,0), M(1,0),...M(34,
0), M(0,1), M(1,1),..., M(34,1)
，
M(0,2)...) then the column vector M(0,j),
The element distance of M(1,j),...,M(34,j) is 1, but the row vectors M(i,0), M(i,1),
..., M(i, 34) has an inter-element distance of 35. Referring to Figure 3, the distance between elements of vector A is 1, the distance between elements of vector B is 35, and A(0)
Assuming that both B(0) are stored in MB#0, A(1) is stored in MB#1, A(2) is stored in MB#2, ..., A
(15) is MB(15), A(16) is MB#0,...,A
(31) is stored in MB#15, B(1) is stored in MB#3,
B(2) is MB#6,...,B(6) is MB#2,...,B
(31) is stored in MB#13. FIG. 3 shows the access status of each element to the main memory 1 at this time. First, instruction 1 is issued, and 2 instructions are issued in 1 machine cycle.
Element by element, A(0) and A(1) are MB#0 and MB#1
Therefore, A(2) and A(3) are from MB#2 and MB#3,
..., is read out. MB#0 and MB#1 are in use from time 0 to time 3, and MB#2 and MB#1 are in use from time 0 to time 3.
MB#3 is in use from time 1 to time 4.
Instruction 2 is issued to take over instruction (1) and B(0) and B(1)
is read from main memory 1. B(0) and B(1) must access MB#0 and MB#3, respectively. MB#3 changes from time 1 to time 4 by instruction (1)
Access time of B(0) and B(1) is 5 because it is in use until
and time 8. B(2) and B(3) are accessed between time 8 and time 11 because MB#9 is being used by instruction (1) until time 7. Therefore, for example, access to 24 elements A(10) to A(27) and B(0) to B(5) is started during 10 machine cycles from time 5 to time 14, and access is performed in one machine cycle. This is only 60% more efficient than starting access every 4 elements. As described above, conventional data processing devices of this type have the disadvantage that memory access efficiency is significantly reduced when accesses with different distances between elements are performed simultaneously. OBJECTS OF THE INVENTION It is an object of the present invention to provide a data processing device that eliminates the above-mentioned drawbacks, reduces the waiting time for memory access due to memory banks being in use, and has high efficiency in memory access. Structure of the Invention The device of the present invention includes a main memory, a vector register section having a plurality of vector registers, an access pipeline section that transfers data between the main memory and the vector register section, and an access pipeline section that transfers data between the main memory and the vector register section. In a data processing device that has an arithmetic pipeline section that performs arithmetic operations, the main memory is connected to a set of memory access ports of the access pipeline section, and data transfer corresponding to one vector register is performed with 2m elements in one machine cycle. The vector register section is connected to two sets of vector access ports in the access pipeline section, and the data transfer corresponding to one vector register is performed in parallel at a rate of m elements per machine cycle. The access pipeline is made up of two sets of buffers, and the data transferred at the memory access port alternately corresponds to one of the two sets of buffers for each element.
The set of vector access ports is configured to be connected to the two sets of buffers by a 2x2 crossbar. Next, the present invention will be explained in detail with reference to the drawings. Referring to FIG. 4, one embodiment of the present invention includes a main memory 1, a memory control device 2, a vector register section 3, an addition pipeline section 4, a multiplication pipeline section 5, and an access pipeline section 8. There is. Main memory 1 and memory control device 2
are connected by four read lines and four write lines, and the memory control unit 2 controls memory access from the central processing unit (CPU), input/output processing unit (LOP), and access pipeline unit 8,
It is connected to the access pipeline section 8 by four read lines and four write lines. The access pipeline section 8 and the vector register section 3 are connected through two ports each having two read lines and two write lines. The addition pipeline section 4 and the multiplication pipeline section 5 are each supplied with 2 x 2 sets of operands from the vector register section 3 and send two outputs to the vector register section 3.
Return to. The transfer rate between the main memory 1 and the memory control unit 2 is 4 words/machine cycle for both reading and writing.
The transfer rate between the memory control section 2 and the access pipeline section 8 is 4 words/machine cycle for both reading and writing, and the transfer rate between the access pipeline section 8 and the vector register section 3 is per port for both reading and writing. The calculation rate of both the addition pipeline section and the multiplication pipeline section is 2 words/machine cycle. As shown in FIG. 2, the main memory 1 consists of four modules 11a to 11d, each module consisting of four banks 12a to 12d. Referring to FIG. 5, the access pipeline unit 8 includes two buffers BF0 81 and EF1 82 and 2×
Consists of 2 crossbars 83, crossbars BF0 81 and
BF1 82 is connected to half of the read lines and half of the write lines of the access pipeline unit 8, and is also connected to one port group A, B, W, and X of the crossbar 83. The other port group C, D, Y, and Z of the crossbar 83 are connected to the vector register section. Referring to FIG. 6, the cycle of each bank of the main memory is 4 machine cycles, and each element A(0), A(1), ..., A(30) and A(31) of the vector A is the main memory Banks MB#0, MB#1,
..., stored in MB#30 and MB#31,
Each element of vector B B(0), B(1),..., B(30)
and B (31) every third main memory bank
MB#0, MB#3, MB#6, ...MB#8, MB
The state of the bank cycle of main memory 1 when stored in #11 and MB #14 is shown. The time increments are machine cycles, and at time 0 A
Banks MB#0, MB#1, MB#2 and MB where (0), A(1), A(2) and A(3) are stored
#3 is accessed and becomes busy for four machine cycles from times 0 to 3. A(4), A(5), A at time 1
Bank MB#4 where (6) and A(7) are stored,
MB#5, MB#6, and MB#7 are accessed and are busy for four machine cycles from times 1 to 4. Similarly, MB#8 to MB#11 are from time 2 to
During time 5, MB#12 to MB#15 are busy during time 3 to time 6. At time 4, A(16) to A(19) are accessed, but at this time MB#0 to MB#3 have finished their busy period due to advance access, so
A without waiting due to bank busy
Accesses (16) to A(19) are performed. A (20)
~A (31) is similarly accessed without waiting due to bank busy. Access to all elements of vector A ends at time 7, and access to vector B begins at time 8. The first four elements of vector B are B(0), B(1),
The memory banks where B(2) and B(3) are stored are MB#0, MB#3, MB#6 and MB#9.
However, since MB#9 is busy from time 6 to time 9 due to the preceding access, access to B(0) to B(3) is performed at time 10 with a delay of two machine cycles. After that, no waiting occurs due to bank busy, and at time 17, memory banks MB#5, MB#8, MB#11 and MB#11, which store B(28) to B(31)
MB#14 is accessed, and access to all elements of vector B is completed. FIG. 7 explains the buffer operation in the access pipeline unit 8. Here A
The time is expressed by setting the time when (0) to A(3) arrive at the access pipeline unit 8 as 0. At time 0, four words A(0) to A(3) arrive at the access pipeline section 8, of which A(0), A
Two words (1) are stored in buffer BF0, and two words A(2) and A(3) are stored in buffer BF1. Two words A(4) to A(7) arrive at time 1, so the two words A(4) and A(5) are buffered.
Two words A(6) and A(7) are stored in buffer BF1 in BF0. Similarly, A(8), A(9), A(12), A
(13),...A(28), A(29) are in buffer BF0, A
(10), A(11), ..., A(30), A(31) are buffer BF1
is stored in Similarly, vector B is stored in the buffer at time 10, and B(0) and B(1) are stored in the buffer BF.
0, B(2), B(3) are stored in buffer BF1, and at time 11, B(4), B(5) are stored in buffer BF0, B(4), B
(5) is stored in buffer BF1, and at time 17, B(28),
B(29) is stored in buffer BF0, and B(30) and B(31) are stored in buffer BF1. Batsuhua at time 1
Buffer BF1 from BF0 to A(00), A(1) at time (2)
After reading A(2) and A(3) from buffers BF0 and BF1, elements of the A vector are read two words at a time alternately from buffers BF0 and BF1, and the crossbar 83 is controlled.
The elements of the A vector are sent to the port C of the vector register unit 3 at a rate of 2 words/machine cycle. While the elements of the A vector are being sent to the vector register section 3, the B vector is being sent to the access pipeline section 8, and can be read from time 11. At time 11, the read port of buffer BF0 is occupied by the A vector, so B(0) and B(1) are read from BF0 at time 12. Next, at time 13, B(2) and B(3) are buffered.
The elements of the B vector are read from BF1, and then the elements of the B vector are read out alternately from the buffers BF0 and BF1, two words at a time, and the crossbar 83 is controlled so that the elements of the B vector are transferred to port D of the crossbar 83 at a rate of two words/machine cycle in the vector register section. Sent to 3. Effects of the Invention The present invention serially executes data transfer corresponding to one vector register between the main memory and the vector register section via the access pipeline section, and transfers data corresponding to two vector registers between the main memory and the vector register section. By adopting a configuration in which data transfers are executed in parallel, it is possible to prevent a reduction in memory access efficiency of the main memory.

[Brief explanation of the drawing]

第１図は従来技術を示す図、第２図は第１図お
よび第４図に示す主メモリ部分を示す図、第３図
は従来技術のメモリバンクのビジーの状態を示す
タイムチヤート、第４図は本発明の一実施例を示
す図、第５図は第４図に示したアクセスパイプラ
イン部を示す図、第６図は第４図の動作を説明す
るためのメモリバンクのビジーのタイムチヤート
および第７図はアクセスパイプライン部のバツフ
アおよびクロスバの動作を説明するためのタイム
チヤートである。第１図から第７図において、１……主メモリ、
２……メモリ制御部、３……ベクトルレジスタ
部、４……加算パイプライン部、５……乗算パイ
プライン部、６〜８……アクセスパイプライン
部、１１ａ〜１１ｄ……メモリモジユール、１２
ａ〜１２ｄ……メモリバンク、８１〜８２……バ
ツフア、８３……２×２クロスバ。 FIG. 1 is a diagram showing the prior art, FIG. 2 is a diagram showing the main memory portion shown in FIGS. 1 and 4, FIG. 3 is a time chart showing the busy state of the memory bank in the prior art, and FIG. 5 is a diagram showing an embodiment of the present invention, FIG. 5 is a diagram showing the access pipeline section shown in FIG. 4, and FIG. 6 is a diagram showing the busy time of the memory bank to explain the operation of FIG. 4. The chart and FIG. 7 are time charts for explaining the operations of the buffer and crossbar in the access pipeline section. In FIGS. 1 to 7, 1...main memory;
2...Memory control unit, 3...Vector register unit, 4...Addition pipeline unit, 5...Multiplication pipeline unit, 6-8...Access pipeline unit, 11a-11d...Memory module, 12
a-12d...memory bank, 81-82...buffer, 83...2x2 crossbar.

Claims

[Claims] 1. A main memory, a vector register section having a plurality of vector registers, an access pipeline section that transfers data between the main memory and the vector register section, and an access pipeline section that transfers data between the main memory and the vector register section; A data processing device having an arithmetic pipeline section that performs arithmetic operations on the memory access ports, wherein the access pipeline section has one set of memory access ports and two sets of vector access ports; Data transfer corresponding to the main memory connected to the main memory and one vector register of the vector register section is sequentially executed, and the vector register section connected to the two sets of vector access ports transfers data corresponding to the two vector registers. are simultaneously executed, and one arithmetic pipeline outputs m-word results in one machine cycle.
A data processing device characterized in that m word elements are transferred in a machine cycle, and a memory access port transfers 2m word elements in one machine cycle. 2 The access pipeline includes two sets of buffers, and the data transferred by the memory access port alternately corresponds to one of the two sets of buffers for each element, and the two sets of vector access ports are arranged in a 2×2 crossbar. 2. The data processing device according to claim 1, wherein the data processing device is connected to two sets of buffers.