JP2704121B2

JP2704121B2 - Vector processing equipment

Info

Publication number: JP2704121B2
Application number: JP6226484A
Authority: JP
Inventors: 康宏井川
Original assignee: 甲府日本電気株式会社
Priority date: 1994-09-21
Filing date: 1994-09-21
Publication date: 1998-01-26
Anticipated expiration: 2013-01-26
Also published as: JPH0895956A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明はベクトル処理装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a vector processing device.

【０００２】[0002]

【従来の技術】一般に、ベクトル処理装置は、主メモリ
とレジスタあるいは演算器とのあいだで大量のデータを
高速に処理するために、同一タイミングにおいて複数の
データを同時に、メモリアクセス処理部に連続的に供給
して、高速化を実現している。2. Description of the Related Art In general, in order to process a large amount of data at a high speed between a main memory and a register or an arithmetic unit, a vector processing apparatus continuously transmits a plurality of data to a memory access processing unit at the same timing. To achieve higher speeds.

【０００３】従来のこの種のベクトル処理装置は、図６
に示すように、ベクトルリクエストの各要素単位の複数
（例として４要素）の入力ポート、各入力ポート毎に入
力レジスタ３１ａ〜３１ｄ、また、ポート競合発生時の
緩衝用に、同一動作するバッファ３２ａ〜３２ｄ、バッ
ファ３３ａ〜３３ｄ、ポート競合の検出およびバッファ
制御を行うポート競合検出回路３７、入力要素のメモリ
アドレスから、出力ポートを生成し、該タイミングで出
力する要素を決定する出力要素検出回路３６、出力要素
検出回路３６の制御信号により、各出力ポートの、入力
要素を選択するセレクタ３４ａ〜３４ｄおよび出力ポー
ト対応の出力レジスタ３５ａ〜３５ｄを有している。A conventional vector processing apparatus of this type is shown in FIG.
As shown in the figure, a plurality (for example, four elements) of input ports for each element of a vector request, input registers 31a to 31d for each input port, and a buffer 32a operating the same for buffering when a port conflict occurs. 32d, buffers 33a-33d, port conflict detection circuit 37 for detecting port conflict and buffer control, output element detection circuit 36 for generating an output port from a memory address of an input element and determining an element to be output at the timing And selectors 34a to 34d for selecting an input element of each output port according to a control signal of the output element detection circuit 36, and output registers 35a to 35d corresponding to the output ports.

【０００４】この従来のベクトル処理装置は、出力ポー
トが８バイト単位であり、４バイトメモリアクセス命令
も、８バイト単位に行っている。In this conventional vector processing device, the output port is in units of 8 bytes, and a 4-byte memory access instruction is also executed in units of 8 bytes.

【０００５】図２は、従来例および本願発明を説明する
ためのベクトルストア命令時における各ベクトル要素
（以下、要素と記す）のアドレスと、各要素が出力され
る出力ポートを要素ごとに示す。図２において、ベクト
ル命令のベースアドレスを“０”、ディスタンスを
“４”とすると、要素０，１、要素２，３…、と連続２
要素ずつが同一出力ポートとなることがわかる。FIG. 2 shows an address of each vector element (hereinafter, referred to as an element) at the time of a vector store instruction for explaining the conventional example and the present invention, and an output port to which each element is output for each element. In FIG. 2, when the base address of the vector instruction is “0” and the distance is “4”, elements 0, 1, element 2, 3,.
It can be seen that each element has the same output port.

【０００６】次に、図２のベクトル命令時における従来
のベクトル処理装置の動作を説明する。先ず、入力レジ
スタ３１ａ〜３１ｄにはそれぞれ要素０から順番に、４
要素毎に連続的に格納される。あるタイミングで読み出
しレジスタ３３ａ〜３３ｄに格納された要素０〜３はポ
ート競合検出回路３７によりポート競合の検出が行われ
る。この場合、要素０，１はそれぞれ出力ポート０に、
要素２，３はそれぞれ出力ポート１に向かうので、ポー
ト競合が発生することになる。ポート競合が発生した場
合、競合した要素の優先順位の高い要素（要素番号の最
も小さい要素）が出力要素検出回路３６により検出さ
れ、セレクタ３４ａは読み出しレジスタ３３ａの要素
０、セレクタ３４ｂでは読み出しレジスタ３３ｃの要素
２がそれぞれ選択され、出力レジスタ３５ａ，３５ｂに
格納される。Next, the operation of the conventional vector processing device at the time of the vector instruction shown in FIG. 2 will be described. First, the input registers 31a to 31d store 4
It is stored continuously for each element. At a certain timing, port conflicts are detected by the port conflict detection circuit 37 for the elements 0 to 3 stored in the read registers 33a to 33d. In this case, elements 0 and 1 are respectively connected to output port 0,
Since elements 2 and 3 each go to output port 1, a port conflict will occur. When a port conflict occurs, the output element detection circuit 36 detects the element having the highest priority (the element with the smallest element number) of the conflicting element, the selector 34a reads the element 0 of the read register 33a, and the selector 34b reads the register 33c. Are selected and stored in the output registers 35a and 35b.

【０００７】このときポート競合検出回路３７からは、
競合が発生したのでホールド要求が出され、読み出しレ
ジスタ３３ａ〜３３ｄはホールドし、バッファ３２ａ〜
３２ｄはバッファのリードアドレスをホールドし、ホー
ルド要求が解除されるまでその状態を保つことになる。
つまり、連続的に入力レジスタ３１ａ〜３１ｄに入力し
てくる要素は、バッファ３２ａ〜３２ｄにバッファリン
グされていくことになる。競合に敗れて残った要素１，
３は、次のタイミングに競合検出が行われ、今度はポー
ト競合が発生しないので、出力レジスタ３５ａ，３５ｂ
にそれぞれ格納される。また競合が発生しないのでポー
ト競合検出３６からのホールド要求は解除され、読み出
しレジスタ３３ａ〜３３ｄには次のタイミングで要素４
〜７の４要素が格納される。以降、同様な処理が全ての
要素の終了まで行われる。At this time, from the port conflict detection circuit 37,
Since a contention has occurred, a hold request is issued, the read registers 33a to 33d hold, and the buffers 32a to 32d hold.
32d holds the read address of the buffer and keeps that state until the hold request is released.
That is, the elements continuously input to the input registers 31a to 31d are buffered in the buffers 32a to 32d. Elements 1, which survived the competition
No. 3 indicates that the conflict detection is performed at the next timing and no port conflict occurs this time, so that the output registers 35a and 35b
Are stored respectively. Since no conflict occurs, the hold request from the port conflict detection 36 is released, and the read registers 33a to 33d store the element 4 at the next timing.
7 are stored. Thereafter, the same processing is performed until the end of all elements.

【０００８】図７は、上記動作を示したタイミング図で
あり、要素０、２が出力レジスタ３５ａ〜３５ｄに到着
するタイミングを１としたときの各要素の出力ポート到
着タイミングを示している。また１つのタイミングにお
ける、ポート競合検出対象要素、および出力ポートレジ
スタ到着要素数も示している。図２のベクトル命令の場
合、ポート競合により、各タイミングに４要素ずつの入
力に対して、出力が２要素ずつであることがわかる。こ
れは、最大スループットの１／２となる。FIG. 7 is a timing chart showing the above operation, and shows the output port arrival timing of each element when the timing at which elements 0 and 2 arrive at the output registers 35a to 35d is 1. Also, the port conflict detection target element and the number of output port register arriving elements at one timing are shown. In the case of the vector instruction in FIG. 2, it can be seen that, due to port contention, two elements are output for four elements at each timing. This is の of the maximum throughput.

【０００９】[0009]

【発明が解決しようとする課題】上述したように従来の
ベクトル処理装置では、４バイトのベクトルロード、ス
トア命令（ディスタンスが４バイトの奇数倍のケース）
において、スループットが最大値の１／２となってしま
い、ベクトルロードおよびストア命令の、主メモリのア
クセス時間に深刻な性能低下を引き起こすという欠点が
ある。As described above, in the conventional vector processing apparatus, a 4-byte vector load / store instruction (a case where the distance is an odd multiple of 4 bytes)
In this case, the throughput is １／ of the maximum value, and there is a drawback that the performance of the vector load and store instructions in the main memory is seriously degraded.

【００１０】[0010]

【課題を解決するための手段】本発明のベクトル処理装
置は、ベクトル要素ごとにベクトル演算を行う１つ以上
のベクトル演算部と、複数バイトのデータを最小リクエ
スト単位として並列動作が可能な複数のメモリモジュー
ルで構成される主記憶部と、前記ベクトル演算部と前記
主記憶部間で出力ポートを介して前記ベクトル要素ごと
にデータ転送を制御するメモリアクセス制御部を備えた
ベクトル処理装置において、前記メモリアクセス制御部
は、前記ベクトル演算部からパイプライン方式で入力す
るベクトル要素を保持するｎ個（ｎ≧１）の入力レジス
タと、各組が前記入力レジスタの保持内容をｎ個のベク
トル要素単位に、かつ要素番号順に格納するｍ組（ｍ≧
２）のバッファと、前記ｍ組のバッファからクロックに
応答して読み出されたベクトル要素を保持するｍ×ｎ個
の読み出しレジスタと、前記各読み出しレジスタに保持
されたベクトル要素について、ベクトル要素のビット幅
が前記最小アクセス単位以下であるときに発生する前記
出力ポートの競合を検出するポート競合検出回路と、前
記競合が検出されると競合するベクトル要素のうちで前
記要素番号の最小のものを検出する出力要素検出回路
と、該出力要素の検出に応答して前記読み出しレジスタ
から当該ベクトル要素を選択する前記出力ポートと同数
のセクレタとを設けたことを特徴とする。According to the present invention, there is provided a vector processing apparatus comprising: at least one vector operation unit for performing a vector operation for each vector element; A vector processing device comprising: a main storage unit configured by a memory module; and a memory access control unit that controls data transfer for each vector element via an output port between the vector calculation unit and the main storage unit. The memory access control unit includes n (n ≧ 1) input registers for holding vector elements input in a pipeline manner from the vector operation unit, and each set stores the content of the input registers in n vector element units. And m sets stored in the order of element numbers (m ≧
2) buffers, m × n read registers for holding vector elements read in response to clocks from the m sets of buffers, and vector elements held in the read registers. A port conflict detection circuit for detecting a conflict between the output ports that occurs when the bit width is equal to or less than the minimum access unit; and An output element detection circuit for detecting, and the same number of secretors as the number of the output ports for selecting the vector element from the read register in response to the detection of the output element are provided.

【００１１】[0011]

【実施例】以下、図面を用いて本発明の実施例について
詳述する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１２】図１は本発明の第１の実施例を示すブロッ
ク図であり、本ベクトル処理装置は並列に８バイト単位
のデータ転送を行うことが可能である。FIG. 1 is a block diagram showing a first embodiment of the present invention. The vector processing apparatus can transfer data in units of 8 bytes in parallel.

【００１３】図１において、入力レジスタ１１ａ〜１１
ｄは、ベクトルリクエストの各要素単位の複数（例とし
て４要素）の入力受けレジスタ群である。バッファ１２
ａ〜１２ｈはポート競合による待ち合わせが起きたとき
に連続的に発行される入力リクエストの緩衝用のバッフ
ァ群であり、バッファ１２ａ〜１２ｄおよびバッファ１
２ｅ〜１２ｈはそれぞれ入力レジスタ１１ａ〜１１ｄの
各要素に対応しており、バッファ１２ａ〜１２ｄと１２
ｅ〜１２ｈに入力レジスタ１１ａ〜１１ｄの要素を交互
に入力する。In FIG. 1, input registers 11a to 11
d is a plurality (for example, 4 elements) of input receiving registers for each element of the vector request. Buffer 12
Reference numerals a to 12h denote buffer groups for buffering input requests continuously issued when a wait occurs due to a port conflict. The buffers 12a to 12d and the buffer 1
2e to 12h correspond to the elements of the input registers 11a to 11d, respectively, and the buffers 12a to 12d and 12b
Elements of the input registers 11a to 11d are alternately input to e to 12h.

【００１４】読み出しレジスタ１３ａ〜１３ｈは、バッ
ファ１２ａ〜１２ｈの読み出し用のレジスタ群であり、
ポート競合検出の対象レジスタである。出力要素検出回
路１６は、出力ポート対応のセレクタ１４ａ〜１４ｄの
選択条件を生成している。ポート競合検出回路１７は、
読み出しレジスタ１３ａ〜１３ｈの各要素のポート競合
を検出する回路であり、またバッファ１２ａ〜１２ｈ、
および読み出しレジスタ１３ａ〜１３ｈの制御を行って
いる。出力レジスタ１５ａ〜１５ｄは、出力ポート（こ
の例では４ポート）対応のレジスタ群であり、セレクタ
１４ａ〜１４ｄで選択された要素を格納する。The read registers 13a to 13h are registers for reading the buffers 12a to 12h.
This is a register for port conflict detection. The output element detection circuit 16 generates selection conditions for the selectors 14a to 14d corresponding to the output ports. The port conflict detection circuit 17
This is a circuit for detecting a port conflict of each element of the read registers 13a to 13h, and buffers 12a to 12h,
And the read registers 13a to 13h. The output registers 15a to 15d are a group of registers corresponding to output ports (four ports in this example), and store the elements selected by the selectors 14a to 14d.

【００１５】図２は、例としてベースアドレスを“０”
ディスタンスを“４”とした、４バイトのベクトルスト
ア命令の各要素のアドレスと各要素が出力される出力ポ
ートの関係を示したものである。４バイトのベクトルス
トア命令では要素番号の連続する２つの要素が、同一出
力ポートへ出力されることがわかる。FIG. 2 shows an example in which the base address is "0".
It shows the relationship between the address of each element of the 4-byte vector store instruction with the distance being "4" and the output port from which each element is output. It can be seen that in the 4-byte vector store instruction, two elements having consecutive element numbers are output to the same output port.

【００１６】次に、図２に示したリクエスト発行時にお
ける第１の実施例の動作を説明する。要素０〜３の４要
素は、入力ポート０〜３より入力して、入力レジスタ１
１ａ〜１１ｄにそれぞれ格納される。次のタイミングで
は入力ポートには要素４〜７の４要素が送られて来てお
り、以降各タイミング毎に、４要素ずつ連続的に発行さ
れ、連続的に入力レジスタ１１ａ〜１１ｄに格納され
る。Next, the operation of the first embodiment at the time of issuing the request shown in FIG. 2 will be described. Elements 0 to 3 are input from input ports 0 to 3 and input register 1
1a to 11d. At the next timing, four elements of elements 4 to 7 are sent to the input port. Thereafter, at each timing, four elements are continuously issued and stored in the input registers 11a to 11d continuously. .

【００１７】最初の４要素、要素０〜３は、バッファ１
２ａ〜１２ｄを介して、読み出しレジスタ１３ａ〜１３
ｄに要素番号順に格納される。それまでは、読み出しレ
ジスタ１３ｅ〜１３ｈには、有効な要素は格納されてい
なかった。ここで、ポート競合検出回路１７により、要
素０〜３を対象にポート競合検出が行われる。図２に示
したとおり、要素０、１はともに出力ポート０、要素
２、３は共に出力ポート１に出力されるので、要素０と
１の間および要素２と３の間でそれぞれポート競合が発
生する。ポート競合が発生した場合、出力ポートには１
つの要素のみが出力可能なので、競合の発生した要素の
中で、要素番号が最も小さい要素を優先させ出力する。
このとき、その他の要素は出力できないので、読み出し
レジスタ１３ａ〜１３ｄにおいて出力を待たされる。つ
まり、要素０、２は出力されて出力レジスタ１５ａ、１
５ｂに格納され、要素１、３は読み出しレジスタ１３
ｂ、１３ｄでホールドする。The first four elements, elements 0-3, are buffer 1
Readout registers 13a to 13a through 2a to 12d
d is stored in the order of element numbers. Until then, valid elements were not stored in the read registers 13e to 13h. Here, the port conflict detection circuit 17 performs port conflict detection on the elements 0 to 3. As shown in FIG. 2, since elements 0 and 1 are both output to output port 0 and elements 2 and 3 are both output to output port 1, port conflicts occur between elements 0 and 1 and between elements 2 and 3, respectively. Occur. If a port conflict occurs, 1
Since only one element can be output, the element having the smallest element number among the elements in which contention has occurred is output with priority.
At this time, since other elements cannot be output, the output is waited for in the read registers 13a to 13d. That is, elements 0 and 2 are output and output registers 15a, 1
5b, elements 1 and 3 are read register 13
Hold at b and 13d.

【００１８】このとき次の４要素である要素４〜７はバ
ッファ１２ｅ〜１２ｈを介して、読み出しレジスタ１３
ｅ〜１３ｈに格納する。そして前のタイミングで競合に
敗れた要素１，３および要素４〜７が次のポート競合検
出の対象要素となる。図２より各要素の出力ポートをみ
ると、要素４と５、要素６と７がそれぞれポート競合し
ていることがわかる。したがって出力ポートへ向かう要
素は要素１，３，４，６の４要素となり、要素５，７
は、ポート競合検出回路１７により、競合が検出され、
読み出しレジスタ１３ｆ，１３ｈでホールドされ待たさ
れる。At this time, elements 4 to 7, which are the next four elements, are read out from the read register 13 via buffers 12e to 12h.
e to 13h. The elements 1 and 3 and the elements 4 to 7 that have lost the contention at the previous timing are the target elements for the next port contention detection. Looking at the output ports of the respective elements from FIG. 2, it can be seen that the elements 4 and 5 and the elements 6 and 7 are in port conflict, respectively. Therefore, the elements going to the output port are the four elements of elements 1, 3, 4, and 6, and the elements 5, 7
Is detected by the port conflict detection circuit 17,
The data is held in the read registers 13f and 13h and is kept waiting.

【００１９】次のタイミングでは残りの要素５，７が出
力される。このとき、後続の要素は、最初の８要素を出
力するのに３Ｔかかっているので、１Ｔ分、つまり４要
素がバッファに残っている。したがって４Ｔめ以降のタ
イミングでは、常に８要素がポート競合検出の対象要素
となり、ポート競合により常に４要素の出力となる。At the next timing, the remaining elements 5 and 7 are output. At this time, for the subsequent elements, it takes 3T to output the first eight elements, so that 1T, that is, four elements remain in the buffer. Therefore, at the timing after the 4T, eight elements are always target elements for port conflict detection, and four elements are always output due to port conflict.

【００２０】図３は、図２で与えられたベクトルストア
命令時における本実施例の動作の様子を示したものであ
り、要素０，２が出力レジスタ１５ａ〜１５ｄに到着し
たタイミングを１としたときの各要素の出力レジスタ１
５ａ〜１５ｄへの到着タイミングごとに、ポート競合検
出対象要素および出力要素数を示している。図３よりタ
イミング１，２，３は出力要素数が２，４，２となり、
スループットは最大スループットより落ちるが、タイミ
ング４以降の要素８以降は出力要素数は４要素となり、
入力要素数と等しくなり、最大スループットとなること
がわかる。FIG. 3 shows the operation of this embodiment at the time of the vector store instruction given in FIG. 2. The timing at which the elements 0 and 2 arrive at the output registers 15a to 15d is set to 1. Output register 1 of each element at the time
For each arrival timing at 5a to 15d, the port conflict detection target element and the number of output elements are shown. From FIG. 3, at timings 1, 2, and 3, the number of output elements is 2, 4, 2, and
Although the throughput is lower than the maximum throughput, the number of output elements becomes 4 elements after element 8 after timing 4 and
It can be seen that the number of input elements is equal to the maximum and the maximum throughput is obtained.

【００２１】図４は本発明の第２の実施例を示すブロッ
ク図であり、本実施例も第１の実施例と同様に、並列に
８バイト単位のデータ転送を行うことが可能である。本
実施例における入力レジスタ２１ａ〜２１ｄ，バッファ
２１ａ〜２２ｈ，読み出しレジスタ２３ａ〜２３ｈ，セ
レクタ２４ａ〜２４ｄ，出力レジスタ２５ａ〜２５ｄ，
出力要素検出回路２６，ポート競合検出回路２７は、そ
れぞれ第１の実施例における入力レジスタ１１ａ〜１１
ｄ，バッファ１２ａ〜１２ｈ，読み出しレジスタ１３ａ
〜１３ｈ，セレクタ１４ａ〜１４ｄ，出力レジスタ１５
ａ〜１５ｄ，出力要素検出回路１６，ポート競合検出回
路１７に相当する。FIG. 4 is a block diagram showing a second embodiment of the present invention. In this embodiment, similarly to the first embodiment, data can be transferred in units of 8 bytes in parallel. In the present embodiment, the input registers 21a to 21d, the buffers 21a to 22h, the read registers 23a to 23h, the selectors 24a to 24d, the output registers 25a to 25d,
The output element detection circuit 26 and the port conflict detection circuit 27 are the input registers 11a to 11 in the first embodiment, respectively.
d, buffers 12a to 12h, read register 13a
To 13h, selectors 14a to 14d, output register 15
a to 15d, the output element detection circuit 16, and the port conflict detection circuit 17.

【００２２】本実施例では、そのうえに、ポート競合発
生時、読み出しレジスタ２３ａ〜２３ｈに格納する要素
の並び代えを行う要素アライン回路２８が付加されてい
る。ここで、ポート競合の検出において、例として４要
素を１つのブロックとしている。つまり、読み出しレジ
スタ２３ａ〜２３ｄに格納された要素を１ブロック、読
み出しレジスタ２３ｅ〜２３ｈに格納された要素を１ブ
ロックとして、各ブロック単位に出力済みか、ポート競
合により出力未かを判断し、ブロック内の全要素が出力
済みとなったことを検出して、要素アライン回路によっ
て、全要素出力済みとなったブロックをシフトして、後
続ブロックを、ポート競合検出対象レジスタに格納する
動作を行う。In this embodiment, an element alignment circuit 28 for rearranging the elements stored in the read registers 23a to 23h when a port conflict occurs is additionally provided. Here, in the detection of port conflict, four elements are taken as one block as an example. In other words, the elements stored in the read registers 23a to 23d are regarded as one block, and the elements stored in the read registers 23e to 23h are regarded as one block. Detects that all the elements in have been output, shifts the block in which all the elements have been output by the element aligning circuit, and stores the subsequent block in the port conflict detection target register.

【００２３】次に、図２に示したリクエスト発行時にお
ける第２の実施例の動作を説明する。Next, the operation of the second embodiment at the time of issuing the request shown in FIG. 2 will be described.

【００２４】本実施例の特徴は、タイミング２におい
て、競合に敗れた要素５，７は、要素アライン回路２８
の機能により、読み出しレジスタ２３ｆ，２３ｈではな
く２３ｂ，２３ｄに格納され、この結果、要素５，７の
出力を待つことなく、後続ブロックの要素８〜１１がレ
ジスタ２３ｅ〜２３ｈに格納され、次回のポート競合検
出対象要素となることにある。この結果により、次のタ
イミングでは要素５，７および要素８，１０が出力され
る。以降同様にブロック単位のシフトが行われ、毎タイ
ミング４要素ずつ出力される。The feature of this embodiment is that, at timing 2, the elements 5 and 7 that have lost the contention
Is stored in the read registers 23f and 23h instead of the read registers 23f and 23h. As a result, the elements 8 to 11 of the subsequent block are stored in the registers 23e to 23h without waiting for the output of the elements 5 and 7, and It may be a port conflict detection target element. Based on this result, elements 5 and 7 and elements 8 and 10 are output at the next timing. Thereafter, the shift is similarly performed in units of blocks, and four elements are output at each timing.

【００２５】以上の動作の様子は、図２で与えられたベ
クトルストア命令に対する本実施例の動作を示す図５を
参照することにより、一層明らかになるであろう。図５
によると、タイミング１では出力要素数が２であるが、
それ以降のタイミングでは、出力要素数は４要素とな
り、入力要素数と等しくなり、最大スループットとなる
ので、第２の実施例は第１の実施例を更に改良したもの
であることがわかる。The above operation will be more apparent with reference to FIG. 5 showing the operation of the present embodiment for the vector store instruction given in FIG. FIG.
According to the above, at timing 1, the number of output elements is 2,
At subsequent timings, the number of output elements becomes four, which is equal to the number of input elements, and the maximum throughput is obtained. Therefore, it is understood that the second embodiment is a further improvement of the first embodiment.

【００２６】以上のように、本発明ではベクトルのメモ
リアクセス命令のスループットが向上する。また、実施
例では入力４ポート、出力４ポートとしたが、いかなる
ポート数でも同様に対応できることは明らかである。さ
らに、拡張を入力４ポートの２倍の８ポートとしてが、
これも任意の入力ポートの倍数にすることが可能であ
る。また実施例ではベクトルストア命令としたが、ロー
ド命令でも同様に実現できる。As described above, in the present invention, the throughput of a vector memory access instruction is improved. In the embodiment, four input ports and four output ports are used. However, it is apparent that any number of ports can be used in the same manner. In addition, the expansion is 8 ports, twice the input 4 ports,
This can also be a multiple of any input port. In the embodiment, a vector store instruction is used, but a load instruction can be similarly implemented.

【００２７】[0027]

【発明の効果】以上説明したように本発明はベクトル処
理装置において、ベクトルロード、ストア命令のポート
競合発生時に、入力ポートを疑似的に拡張する手段を設
けることにより、ポート競合検出対象要素が増え、これ
によりポート競合が起こっても、出力可能な要素が増
え、スループットが向上するという効果を有する。例え
ば、４バイトのベクトルのメモリアクセス命令の場合、
スループットが約２倍にも向上する。As described above, according to the present invention, in a vector processing apparatus, when a port conflict of a vector load / store instruction occurs, a means for artificially expanding an input port is provided, thereby increasing the number of port conflict detection target elements. Thus, even if a port conflict occurs, the number of elements that can be output is increased, and the throughput is improved. For example, in the case of a 4-byte vector memory access instruction,
Throughput is improved about twice.

[Brief description of the drawings]

【図１】本発明の第１の実施例のベクトル処理装置のブ
ロック図である。FIG. 1 is a block diagram of a vector processing device according to a first embodiment of the present invention.

【図２】ベクトルストア命令時における各要素のアドレ
スおよび出力ポートを示す図である。FIG. 2 is a diagram showing an address and an output port of each element at the time of a vector store instruction.

【図３】本発明の第１の実施例の動作を説明するための
タイミング図である。FIG. 3 is a timing chart for explaining the operation of the first embodiment of the present invention.

【図４】本発明の第２の実施例のベクトル処理装置のブ
ロック図である。FIG. 4 is a block diagram of a vector processing device according to a second embodiment of the present invention.

【図５】本発明の第２の実施例の動作を説明するための
タイミング図である。FIG. 5 is a timing chart for explaining the operation of the second exemplary embodiment of the present invention.

【図６】従来例のベクトル処理装置のブロック図であ
る。FIG. 6 is a block diagram of a conventional vector processing device.

【図７】従来例の動作を説明するためのタイミング図で
ある。FIG. 7 is a timing chart for explaining the operation of the conventional example.

[Explanation of symbols]

１１ａ〜１１ｄ入力レジスタ１２ａ〜１２ｈバッファ１３ａ〜１３ｈ読み出しレジスタ１４ａ〜１４ｄセレクタ１５ａ〜１５ｄ出力レジスタ１６出力要素検出回路１７ポート競合検出回路２１ａ〜２１ｄ入力レジスタ２２ａ〜２２ｄバッファ２３ａ〜２３ｄ読み出しレジスタ２４ａ〜２４ｄセレクタ２５ａ〜２５ｄ出力レジスタ２６出力要素検出回路２７ポート競合検出回路３１ａ〜３１ｄ入力レジスタ３２ａ〜３２ｈバッファ３３ａ〜３３ｈ読み出しレジスタ３４ａ〜３４ｄセレクタ３５ａ〜３５ｄ出力レジスタ３６出力要素検出回路３７ポート競合検出回路３８要素アライン回路 11a to 11d Input register 12a to 12h Buffer 13a to 13h Read register 14a to 14d Selector 15a to 15d Output register 16 Output element detection circuit 17 Port conflict detection circuit 21a to 21d Input register 22a to 22d Buffer 23a to 23d Read register 24a to 24d Selectors 25a to 25d Output registers 26 Output element detection circuits 27 Port conflict detection circuits 31a to 31d Input registers 32a to 32h Buffers 33a to 33h Read registers 34a to 34d Selectors 35a to 35d Output registers 36 Output element detection circuits 37 Port conflict detection circuits 38 Element alignment circuit

Claims

(57) [Claims]

A main storage unit comprising one or more vector operation units for performing a vector operation for each vector element, a plurality of memory modules capable of performing a parallel operation using a plurality of bytes of data as a minimum request unit; In a vector processing device including a memory access control unit that controls data transfer for each vector element via an output port between a vector operation unit and the main storage unit, the memory access control unit includes a pipe from the vector operation unit. N (n ≧ 1) input registers for holding vector elements to be input in the line system, and m sets (m) for storing the contents of the input registers in n vector element units and in the order of element numbers. ≧ 2) buffers, and m × n readouts holding vector elements read out from the m sets of buffers in response to clocks A register, and a port conflict detection circuit for detecting a conflict between the output ports, which occurs when a bit width of the vector element is equal to or smaller than the minimum access unit, for the vector element held in each of the read registers. An output element detecting circuit for detecting the smallest one of the element numbers among the vector elements competing for the output element, and the same number as the number of the output ports for selecting the vector element from the read register in response to the detection of the output element cell of
Vector processing apparatus characterized by comprising a selector.

2. The buffer unit shifts the remaining vector elements that have lost the contention of the output port from the buffer to the read register so as to provide a storage area for a vector element of a subsequent input. 2. The vector processing apparatus according to claim 1, further comprising an element aligning circuit for performing the operation.