JP2000029703A

JP2000029703A - Dsp architecture optimized for memory access

Info

Publication number: JP2000029703A
Application number: JP11094975A
Authority: JP
Inventors: Didier Fuin; フュワンディディエ
Original assignee: STMicroelectronics SA
Current assignee: STMicroelectronics SA
Priority date: 1998-04-09
Filing date: 1999-04-01
Publication date: 2000-01-28
Also published as: FR2777370A1; FR2777370B1; EP0949565A1; US6564309B1

Abstract

PROBLEM TO BE SOLVED: To provide a super-scalar processor which has maximum efficiency for the execution of a loop, including a memory access instruction. SOLUTION: A processor includes at least one memory access unit (MENU) 10 which provides a readout or write-in address for the address bus of a memory 16 as a readout or write-in instruction is executed, a computing and logic unit(ALU) 12, which operates in parallel to the memory access unit and is arranged at least to provide data for the data bus of the memory while the memory access unit provides a write address, and a stored address quene (STAQ) in which respective write addresses provided by the memory access unit waiting until the availability of the data is written are stored.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、デジタル信号プロ
セッサ（ＤＳＰ）に関し、特にメモリレイテンシ(laten
cy)に起因する問題を回避するアーキテクチャに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a digital signal processor (DSP), and more particularly to a memory latency (laten).
cy).

【０００２】[0002]

【従来の技術】図１は、従来のＤＳＰアーキテクチャの
一部を概略的に表している。該ＤＳＰは、平行に動作す
る４つの処理ユニットを含む。これらユニットの２つ
は、メモリアクセスユニット（ＭＥＭＵ）１０である。
更に、計算及び論理ユニット（ＡＬＵ）１２及び分岐ユ
ニット（ＢＲＵ）１４が設けられている。ＭＥＭＵユニ
ットの各々は、独立したバスを介してメモリ１６に係合
する。FIG. 1 schematically illustrates a portion of a conventional DSP architecture. The DSP includes four processing units operating in parallel. Two of these units are memory access units (MEMU) 10.
Further, a calculation and logic unit (ALU) 12 and a branch unit (BRU) 14 are provided. Each of the MEMU units engages the memory 16 via an independent bus.

【０００３】分岐ユニットＢＲＵは、平行に各ユニット
に設けられるべき４つの要素命令を含むことができる複
合命令ＩＮＳＴを、表されていない命令メモリから受信
する。ユニットＢＲＵは、その意味する命令を検索し、
ＡＬＵ及びＭＥＭＵユニットへ３つの残存命令Ｉ１、Ｉ
２及びＩ３を平行に分散する。[0003] The branch unit BRU receives from a non-represented instruction memory a compound instruction INST, which may include four element instructions to be provided in each unit in parallel. The unit BRU searches for the instruction that means,
3 surviving instructions I1, I to ALU and MEMU units
2 and I3 are distributed in parallel.

【０００４】ＡＬＵ及びＭＥＭＵユニットの各々は、通
常、ＦＩＦＯ形式の命令キュー１８を含んでおり、命令
が対応するユニットによって処理される前に該命令がそ
の中でウエイトされる。[0004] Each of the ALU and MEMU units typically includes an instruction queue 18 in the form of a FIFO in which instructions are weighted before the instructions are processed by the corresponding unit.

【０００５】図１のタイプのＤＳＰは、タイプＸ［ｉ］
ＯＰＹ［ｊ］のベクトルオペレーションを行うため
に最適化される。ｉ及びｊは通常ループ内で変化し、Ｏ
Ｐは計算ユニット１２によって実行すべきいずれかのオ
ペレーションを指示する。実際に、オペランドＸ［ｉ］
及びＹ［ｊ］を、２つのバスを介してメモリ１６に一緒
にフェッチでき、ＡＬＵ１２によって同じサイクルにお
いて理論通り処理される。A DSP of the type shown in FIG. 1 is of the type X [i]
Optimized to perform OP Y [j] vector operations. i and j usually change in a loop, and O
P indicates any operation to be performed by the computation unit 12. In fact, the operand X [i]
And Y [j] can be fetched together into memory 16 via the two buses and are processed by ALU 12 in the same cycle in theory.

【０００６】[0006]

【発明が解決しようとする課題】実際に、通常ＳＲＡＭ
である、現在用いられているメモリの構造のために問題
が生じる。メモリアクセスを各サイクルで実行できるけ
れども、従来のＳＲＡＭからのデータの読み出しは、通
常、２サイクルのレイテンシを有する。実際に、読み出
し命令の実行において、アドレスをメモリへ提供する。
読み出しアクセス信号を有するメモリを提供するために
追加のサイクルが必要とされ、最後のサイクルが、その
データバスにデータを提供するメモリのために必要とさ
れる。In practice, a conventional SRAM
A problem arises because of the structure of the currently used memory. Although memory access can be performed in each cycle, reading data from a conventional SRAM typically has a latency of two cycles. In effect, in executing the read instruction, the address is provided to the memory.
An additional cycle is required to provide the memory with the read access signal, and the last cycle is required for the memory providing data on its data bus.

【０００７】その結果的な問題を説明するために、メモ
リにストアされた一定の連続する値によって増加するそ
の関数である共通ループが、一例として以下で検討され
る。このループを、以下のように直接記載することがで
きる。To illustrate the resulting problem, the common loop, which is a function of which is incremented by certain successive values stored in memory, is discussed below as an example. This loop can be described directly as follows:

【０００８】ＬＤ：Ｒ１＝［ｉ］（１）ＯＰ：Ｒ１＝Ｒ１＋Ｒ２ＳＴ：［ｉ］＝Ｒ１ＢＲ：ｔｅｓｔｉ，ｉ＋＋，ｌｏｏｐLD: R1 = [i] (1) OP: R1 = R1 + R2 ST: [i] = R1 BR: test i, i ++, loop

【０００９】このループは、明確には、１つのＭＥＭＵ
ユニットを用いる。それは、アドレスｉでメモリにスト
アされた値をレジスタＲ１内にロードし（ＬＤ）、レジ
スタＲ２に含まれた値だけレジスタＲ１の内容を増分し
（ＯＰ）、レジスタＲ１の新しい内容をアドレス１でス
トアし（ＳＴ）、最後にループを再び始めるためにアド
レスｉを増分し且つテストする（ＢＲ）ことからなる。
アドレスｉが予め決められた値に達したことを分岐ユニ
ットＢＲＵが検出したときに、ループから出る。ＤＳＰ
においては、通常、非ＢＲ型命令である。ループは、テ
スト、増分及び分岐を独立して実行するユニットＢＲＵ
内に、このために設けられたレジスタを先にセットする
ことによってプログラムされる。This loop is clearly one MEMU
Use a unit. It loads the value stored in memory at address i into register R1 (LD), increments the contents of register R1 by the value contained in register R2 (OP), and stores the new contents of register R1 at address 1. Store (ST) and finally increment and test (BR) the address i to start the loop again.
When the branch unit BRU detects that the address i has reached a predetermined value, the process exits the loop. DSP
Is usually a non-BR type instruction. The loop is a unit BRU that independently executes tests, increments and branches
, By first setting a register provided for this purpose.

【００１０】レジスタＲ１はＡＬＵの作業レジスタであ
り、アドレスｉが分岐ユニットＢＲＵのレジスタ内にス
トアされる。オペレーションＬＤ及びＳＴは、ユニット
ＭＥＭＵの一方によって実行すべきオペレーションであ
り、オペレーションＯＰは、ユニットＡＬＵによって実
行すべきであり、オペレーションＢＲはユニットＢＲＵ
によって実行すべきである。オペレーションＬＤ及びＯ
Ｐは、同じ複合命令においてユニットＭＥＭＵ及びＡＬ
Ｕに平行に提供され、一方、オペレーションＳＴ及びＢ
Ｒは、第２の複合命令においてユニットＭＥＭＵ及びＢ
ＲＵに平行に提供される。The register R1 is a working register of the ALU, and the address i is stored in the register of the branch unit BRU. Operations LD and ST are operations to be performed by one of the units MEMU, operation OP is to be performed by the unit ALU, and operation BR is a unit BRU
Should be done by Operations LD and O
P is the unit MEMU and AL in the same compound instruction
Provided in parallel to U, while operations ST and B
R is the unit MEMU and B in the second compound instruction
Provided parallel to the RU.

【００１１】実際に、幾つかの複合命令は、一度に幾つ
かのユニットに提供されるフィールドを含む。例えば、
ユニットＭＥＭＵを意味するロード命令ＬＤはまた、メ
モリによって提供されるデータを受信するそのレジスタ
（Ｒ１）の１つを準備するために、ユニットＡＬＵを意
味するするフィールドをも含む。同時にストア命令ＳＴ
は、レジスタを選択するためにユニットＡＬＵを意味す
るフィールドを含み、その内容はメモリバスに提供され
る。従って、図１に表されたように、ユニットＭＥＭＵ
に提供された命令Ｉ２及びＩ３の各々のフィールドｆ
が、通常の命令Ｉ１と共に平行にユニットＡＬＵの命令
キュー１８へ提供され、ユニットＡＬＵは、通常の命令
とフィールドｆによって指示されたオペレーションとを
１サイクルで実行することができる。In practice, some compound instructions contain fields that are provided to several units at a time. For example,
The load instruction LD, meaning unit MEMU, also includes a field, meaning unit ALU, to prepare one of its registers (R1) to receive the data provided by the memory. At the same time, store instruction ST
Contains a field which means unit ALU for selecting a register, the contents of which are provided on the memory bus. Therefore, as shown in FIG.
F of each of the instructions I2 and I3 provided to
Are provided in parallel with the normal instruction I1 to the instruction queue 18 of the unit ALU, and the unit ALU can execute the normal instruction and the operation indicated by the field f in one cycle.

【００１２】以下の表は、ループのいくつかの繰り返し
に対して、一方のメモリアクセスユニットＭＥＭＵによ
って及び計算ユニットＡＬＵによって実行されたオペレ
ーションを説明する。分岐命令ＢＲはいずれの問題をも
生じず、表では明確にはそれらを説明しない。The following table describes the operations performed by one memory access unit MEMU and by the calculation unit ALU for some iterations of the loop. The branch instructions BR do not cause any problems and the table does not explicitly explain them.

【００１３】表の各列は命令サイクルに対応しており、
表にマークされた各オペレーションは、ループの繰り返
しに対応する番号で割り当てられる。Each column of the table corresponds to an instruction cycle,
Each operation marked in the table is assigned a number corresponding to the loop iteration.

【００１４】[0014]

【表１】 [Table 1]

【００１５】第１のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、第１の命令ＬＤ及びＯＰ（ＬＤ１，Ｏ
Ｐ１）を受信する。ユニットＭＥＭＵは、直ぐに命令Ｌ
Ｄ１を実行し、メモリ内でアドレスｉにストアされた値
の読み出しサイクルを開始する。命令ＬＤ１は、ユニッ
トＭＥＭＵの命令キューから削除される。命令ＬＤ１に
よってフェッチされた値を必要とする命令ＯＰ１は、未
だ実行できない。この命令ＯＰ１は、ユニットＡＬＵの
命令キュー内でウエイトする。In the first cycle, the unit MEM
U and ALU are the first instructions LD and OP (LD1, O
P1) is received. The unit MEMU immediately executes the instruction L
Execute D1 to start a read cycle of the value stored in the memory at address i. The instruction LD1 is deleted from the instruction queue of the unit MEMU. The instruction OP1 requiring the value fetched by the instruction LD1 cannot be executed yet. This instruction OP1 waits in the instruction queue of the unit ALU.

【００１６】第２のサイクルにおいて、ユニットＭＥＭ
Ｕは、第１の命令ＳＴ（ＳＴ１）を受信する。オペレー
ションＯＰ１の結果を必要とする命令ＳＴ１は、未だ実
行することができず、ユニットＭＥＭＵの命令キュー内
でウエイトする。それが必要とするオペランドをメモリ
が返送しないために、命令ＯＰ１はユニットＡＬＵのキ
ュー内でウエイトする。In the second cycle, the unit MEM
U receives the first command ST (ST1). The instruction ST1 requiring the result of the operation OP1 cannot be executed yet, and waits in the instruction queue of the unit MEMU. Instruction OP1 waits in the queue of unit ALU so that the memory does not return the operands it requires.

【００１７】第３のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤ２及びＯＰ２を受信する。こ
れら命令は、未だ実行されない命令ＳＴ１及びＯＰ１の
後でキュー内に置かれる。メモリは最後に命令ＯＰ１に
よって必要とされるオペランドを返送する。従って、こ
の命令ＯＰ１が実行され、命令キューから削除される。In the third cycle, the unit MEM
U and ALU receive commands LD2 and OP2. These instructions are placed in the queue after the instructions ST1 and OP1 that have not yet been executed. The memory finally returns the operands required by instruction OP1. Therefore, the instruction OP1 is executed and is deleted from the instruction queue.

【００１８】サイクル４において、ユニットＭＥＭＵ
は、命令ＳＴ２を受信する。命令ＳＴ２は、命令ＬＤ２
の後でユニットＭＥＭＵのキュー内でウエイトするため
に置かれる。命令ＯＰ１が先のサイクルで実行されたた
めに、その結果が利用できる。従って、命令ＳＴ１が実
行され、キューから削除される。命令ＯＰ２がユニット
ＡＬＵのキュー内にただ１つあるけれども、命令ＬＤ２
の実行によってフェッチされるオペランドを必要とする
ために、この命令は未だ実行できない。In cycle 4, the unit MEMU
Receives the command ST2. The instruction ST2 is an instruction LD2
After to wait in the queue of the unit MEMU. The result is available because instruction OP1 was executed in the previous cycle. Therefore, the instruction ST1 is executed and deleted from the queue. Although there is only one instruction OP2 in the queue of the unit ALU, the instruction LD2
This instruction cannot yet be executed because it requires an operand to be fetched by the execution of.

【００１９】サイクル５において、ユニットＭＥＭＵ及
びＡＬＵは、命令ＬＤ３及びＯＰ３を受信する。命令Ｌ
Ｄ２は、実行され、キューから削除される。In cycle 5, units MEMU and ALU receive instructions LD3 and OP3. Instruction L
D2 is executed and removed from the queue.

【００２０】命令ＬＤ２に応じてメモリによって遅れた
２つのサイクルを返送するオペランドを必要とするため
に、命令ＯＰ２はキュー内で未だウエイトしなければな
らない。Instruction OP2 must still wait in the queue to require an operand to return two cycles delayed by the memory in response to instruction LD2.

【００２１】第５のサイクルから、ループの第２の繰り
返しの命令の実行は、サイクル１で開始する第１の繰り
返しをする限り進行する。From the fifth cycle, execution of the instructions of the second iteration of the loop proceeds as long as the first iteration starting in cycle 1 is performed.

【００２２】表によって表されたように、プロセッサが
各サイクルで１つのメモリアクセスを実行することがで
きるが、それは、４サイクルのうちの２サイクルでメモ
リアクセスだけを実行する。即ちループ実行効率はわず
か５０％である。As represented by the table, the processor can perform one memory access each cycle, but it only performs memory accesses in two out of four cycles. That is, the loop execution efficiency is only 50%.

【００２３】更に、ループの各新しい繰り返しにおい
て、ユニットＭＥＭＵの命令キューは、追加の命令で満
たされ、オーバフローして終了する。オーバフローを避
けるために、命令の提供は、キューを空にできるように
規則正しく停止されなければならない。この検討は効率
を低下させる。Further, at each new iteration of the loop, the instruction queue of unit MEMU is filled with additional instructions and overflows and ends. To avoid overflow, instruction provision must be stopped regularly to allow the queue to be emptied. This consideration reduces efficiency.

【００２４】実際に、そのまっすぐな形式のループプロ
グラミングは、メモリレイテンシのために全く最適では
ない。In fact, its straight form of loop programming is not at all optimal due to memory latencies.

【００２５】効率の改善のために、メモリレイテンシを
考慮して、ループアンローリング技術と称されるものが
しばしば用いられる。この技術は、マクロループをプロ
グラミングすることからなり、その各繰り返しは、通常
ループのいくつかの繰り返しに対応する。従って、先の
ループ（１）は、以下のように書き込まれる。In order to improve the efficiency, what is called a loop unrolling technique is often used in consideration of memory latency. This technique consists of programming a macro loop, each iteration of which usually corresponds to several iterations of the loop. Therefore, the previous loop (1) is written as follows.

【００２６】Ｌｄａ：Ｒ１＝［ｉ］（２）Ｏｐａ：Ｒ１＝Ｒ１＋Ｒ２Ｌｄｂ：Ｒ３＝［ｉ＋１］Ｏｐｂ：Ｒ３＝Ｒ３＋Ｒ２Ｓｔａ：［ｉ］＝Ｒ１Ｓｔｂ：［ｉ＋１］＝Ｒ３ＢＲ：ｔｅｓｔｉ，ｉ＝＋２，ｌｏｏｐLda: R1 = [i] (2) Opa: R1 = R1 + R2 Ldb: R3 = [i + 1] Opb: R3 = R3 + R2 Sta: [i] = R1 Stb: [i + 1] = R3 BR: test i, i = + 2, loop

【００２７】このループにおいて、アドレスｉに含まれ
た値はレジスタＲ１にロードされ（ＬＤａ）、レジスタ
Ｒ２の内容がレジスタＲ１の内容によって増分され（Ｏ
Ｐａ）、アドレスｉ＋１に含まれた値がレジスタＲ３に
ロードされ（ＬＤｂ）、レジスタＲ３の内容がレジスタ
Ｒ２に含まれた値だけ増分され（ＯＰｂ）、レジスタＲ
１の内容がアドレスｉにストアされ（ＳＴａ）、レジス
タＲ３の内容がアドレスｉ＋１にストアされ（ＳＴ
ｂ）、変数ｉはループを再スタートするために２だけ増
分される。In this loop, the value contained in the address i is loaded into the register R1 (LDa), and the contents of the register R2 are incremented by the contents of the register R1 (O
Pa), the value contained in the address i + 1 is loaded into the register R3 (LDb), the content of the register R3 is incremented by the value contained in the register R2 (OPb), and the register R
1 is stored at address i (STa), and the content of register R3 is stored at address i + 1 (STa).
b) The variable i is incremented by 2 to restart the loop.

【００２８】このループは、４つの複合命令でプログラ
ミングされる。第１の命令は、オペレーションＬＤａ及
びＯＰａから形成され、第２の命令はＬＤｂ及びＯＰｂ
から形成され、第３の命令はＳＴａから形成され、第４
の命令はＳＴｂ及びＢＲから形成される。以下の表は、
いくつかのループの繰り返しに対するオペレーションの
シーケンスを説明する。This loop is programmed with four compound instructions. The first instruction is formed from operations LDa and OPa and the second instruction is LDb and OPb
And the third instruction is formed from STa and the fourth
Are formed from STb and BR. The following table shows
A sequence of operations for several loop iterations will be described.

【００２９】[0029]

【表２】 [Table 2]

【００３０】第１のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、第１の命令ＬＤａ及びＯＰａ（ＬＤａ
１、ＯＰａ１）を受信する。命令ＬＤａ１は、直ぐに実
行され、キューから削除される。In the first cycle, the unit MEM
U and ALU are the first instructions LDa and OPa (LDa
1, OPa1) is received. The instruction LDa1 is executed immediately and is deleted from the queue.

【００３１】第２のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤｂ１及びＯＰｂ１を受信す
る。命令ＬＤｂ１は直ぐに実行され、キューから削除さ
れる。命令ＯＰａ１及びＯＰｂ１は、メモリが命令ＬＤ
ａ１及びＬＤｂ１に応じて返送されなければならない、
対応するオペランドを待つユニットＡＬＵのキュー内に
残存する。In the second cycle, the unit MEM
U and ALU receive commands LDb1 and OPb1. The instruction LDb1 is executed immediately and is deleted from the queue. The instructions OPa1 and OPb1 are stored in the memory as the instruction LD.
must be returned according to a1 and LDb1,
It remains in the queue of the unit ALU waiting for the corresponding operand.

【００３２】第３のサイクルにおいて、ユニットＭＥＭ
Ｕは命令ＳＴａ１を受信する。メモリは命令ＬＤａ１に
よって質問され且つ命令ＯＰａ１によって必要とされる
オペランドを返送する。従って、命令ＯＰａ１は、実行
され、キューから削除される。In the third cycle, the unit MEM
U receives the command STa1. The memory is queried by instruction LDa1 and returns the operands required by instruction OPa1. Therefore, the instruction OPa1 is executed and deleted from the queue.

【００３３】第４のサイクルにおいて、ユニットＭＥＭ
Ｕは命令ＳＴｂ１を受信する。メモリは、命令ＬＤｂ１
によって質問され且つ命令ＯＰｂ１によって必要とされ
るオペランドを返送する。従って、命令ＯＰｂ１を実行
できる。In the fourth cycle, the unit MEM
U receives the command STb1. The memory stores the instruction LDb1
And returns the operands required by instruction OPb1. Therefore, the instruction OPb1 can be executed.

【００３４】第５のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤａ２及びＯＰａ２を受信す
る。それが必要とする値が前の２サイクルで命令ＯＰａ
１によって計算されたために、命令ＳＴａ１を実行でき
る。In the fifth cycle, the unit MEM
U and ALU receive commands LDa2 and OPa2. The value it needs is the instruction OPa in the previous two cycles.
The instruction STa1 can be executed because it has been calculated by 1.

【００３５】第６のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤｂ２及びＯＰｂ２を受信す
る。それが必要とする値が前の２サイクルで命令ＯＰｂ
１によって計算されたために、命令ＳＴｂ１を実行でき
る。In the sixth cycle, the unit MEM
U and ALU receive commands LDb2 and OPb2. The value required by the instruction OPb in the previous two cycles
The instruction STb1 can be executed because it is calculated by 1.

【００３６】第７のサイクルにおいて、ユニットＭＥＭ
Ｕは命令ＳＴａ２を受信する。ループの新しい繰り返し
は、命令ＬＤａ２の実行によって開始される。In the seventh cycle, the unit MEM
U receives the command STa2. A new iteration of the loop is started by execution of instruction LDa2.

【００３７】プロセッサは、６サイクル毎に４つのメモ
リアクセスを実行することがこの表で明らかにされてお
り、前述の解決策に対して６６％の効率と３３％の利得
との合計になる。The table reveals that the processor performs four memory accesses every six cycles, for a total of 66% efficiency and 33% gain over the above solution.

【００３８】しかしながら、ユニットＭＥＭＵのキュー
は進んでいっぱいになることが明らかであり、命令の提
供を規則正しく停止することを必要とする。前述の解決
策の場合のように、全４サイクルを２つの命令で満たす
代わりに、それは全６サイクルを２つの命令で満たす。However, it is clear that the queue of the unit MEMU is willingly full, requiring the provision of instructions to be stopped regularly. Instead of filling all four cycles with two instructions, as in the previous solution, it fills all six cycles with two instructions.

【００３９】ループアンローリング技術は、効率を実質
的に改善するけれども、スーパスカラプロセッサに対す
る最適な解決策ではない。実際に、それはスーパスカラ
において更に良く動作する。While the loop unrolling technique substantially improves efficiency, it is not an optimal solution for superscalar processors. In fact, it works even better in superscalar.

【００４０】本発明の目的は、メモリアクセス命令を含
むループの実行に対して最大効率を有するスーパスカラ
プロセッサを提供することにある。An object of the present invention is to provide a superscalar processor having the highest efficiency for executing a loop including a memory access instruction.

【００４１】[0041]

【課題を解決するための手段】この目的及びその他の目
的は、読み出し又は書き込み命令の実行に応じてメモリ
のアドレスバスに読み出し又は書き込みアドレスを提供
する、少なくとも１つのメモリアクセスユニットと、該
メモリアクセスユニットと平行に動作し、メモリアクセ
スユニットが書き込みアドレスを提供すると同時にメモ
リのデータバスにデータを提供するために少なくとも配
置される、計算及び論理ユニットとを含むプロセッサを
用いて達成される。該プロセッサは、データのアベイラ
ビリティが書き込まれるまでウエイトするメモリアクセ
スユニットによって提供された各書き込みアドレスがそ
の中にストアされる、書き込みアドレスキューを含む。SUMMARY OF THE INVENTION This and other objects are at least one memory access unit for providing a read or write address to an address bus of a memory in response to execution of a read or write instruction, and the memory access unit. This is accomplished using a processor that operates in parallel with the unit and includes a computation and logic unit that is at least arranged to provide data to the data bus of the memory while the memory access unit provides the write address. The processor includes a write address queue in which each write address provided by a memory access unit that waits until data availability is written is stored therein.

【００４２】本発明の一実施形態によれば、計算及び論
理ユニットは、実行を待つ命令を受信するための２つの
独立した命令キューを含んでおり、該第１の命令キュー
は、論理及び計算命令を受信するためのものであり、該
第２の命令キューは、読み出し又は書き込みオペレーシ
ョンに含まれた計算及び論理ユニットのレジスタを確認
するためにメモリアクセスユニットに提供された命令フ
ィールドを受信するためのものである。According to one embodiment of the present invention, the computation and logic unit includes two independent instruction queues for receiving instructions awaiting execution, the first instruction queue comprising a logic and a computational queue. For receiving instructions, wherein the second instruction queue is for receiving instruction fields provided to the memory access unit to identify registers of computation and logical units involved in a read or write operation. belongs to.

【００４３】本発明の一実施形態によれば、計算及び論
理ユニットは、メモリに書き込まれるべき各データが、
書き込みアドレスキューで書き込みアドレスの存在をそ
の中で待つストアデータキューを含む。According to one embodiment of the present invention, the computation and logic unit comprises:
A write data queue includes a store data queue that waits for the presence of a write address therein.

【００４４】本発明の一実施形態によれば、計算及び論
理ユニットは、該計算及び論理ユニットに対してメモリ
からの各データがその中に書き込まれるロードデータキ
ューを含み、前記計算及び論理ユニットが利用できるま
でウエイトする。According to one embodiment of the invention, the computation and logic unit includes a load data queue into which each data from the memory is written, the computation and logic unit comprising: Wait until available.

【００４５】本発明の一実施形態によれば、プロセッサ
は、命令を受信し、それ自身のメモリアクセスユニット
及び計算及び論理ユニットの間で平行に命令を分散する
ための分岐ユニットを含む。According to one embodiment of the invention, the processor includes a branch unit for receiving the instructions and distributing the instructions in parallel between its own memory access unit and the computation and logic units.

【００４６】本発明の一実施形態によれば、ユニットの
各々は、メモリに書き込むべき各データが、書き込みア
ドレスキューの書き込みアドレスが存在するまでその中
でウエイトするストアデータキューを含む。According to one embodiment of the present invention, each of the units includes a store data queue in which each data to be written to memory waits until a write address in the write address queue exists.

【００４７】本発明の一実施形態によれば、ユニットの
各々が、ユニットに対してメモリから各データをその中
に書き込むべきロードデータキューを含み、ユニットが
利用できるまでウエイトする。According to one embodiment of the present invention, each of the units includes a load data queue into which each data is to be written from memory for the unit and waits until the unit is available.

【００４８】[0048]

【発明の実施の形態】本発明の前述した目的、特徴及び
効果は、添付図面と共に以下の具体的な実施形態の限定
しない説明について詳細に説明する。BRIEF DESCRIPTION OF THE DRAWINGS The foregoing objects, features and advantages of the present invention will be described in detail in the following non-limiting description of specific embodiments with reference to the accompanying drawings.

【００４９】図２において、本発明によるプロセッサ
は、図１のように、分岐ユニット１４に全て結合された
２つのメモリアクセスユニット１０（ＭＥＭＵ）と１つ
の論理及び計算ユニット（ＡＬＵ）とを含む。ユニット
１０及び１２は、分岐ユニットＢＲＵが実行を待つ命令
をその中にスタックする命令キューを含む。In FIG. 2, the processor according to the invention comprises, as in FIG. 1, two memory access units 10 (MEMU) all coupled to a branch unit 14 and one logic and computation unit (ALU). Units 10 and 12 include an instruction queue in which instructions waiting for execution by branch unit BRU are stacked.

【００５０】本発明によれば、各メモリアクセスユニッ
トＭＥＭＵは、メモリ１６への書き込みアクセスに対し
て用いられるアドレスをそれから提供するストアアドレ
スキューＳＴＡＱを含む。読み出しアドレスは、キュー
ＳＴＡＱを横切る破線によって表されるように、従来メ
モリ１６へ提供される。In accordance with the present invention, each memory access unit MEMU includes a store address queue STAQ from which the addresses used for write access to memory 16 are provided. The read address is provided to conventional memory 16, as represented by the dashed line across queue STAQ.

【００５１】更に、ユニットＡＬＵは、メモリと交換さ
れるデータが通過する、ストアデータキューＳＴＤＱと
ロードデータキューＬＤＤＱとを備える。実際問題とし
て、プロセッサの全てのユニットは、メモリとデータを
交換することは従来と同様であり、従って、表されたよ
うにキューＳＴＤＱ及びＬＤＤＱを含む。キューＳＴＡ
Ｑ、ＳＴＤＱ及びＬＤＤＱは全てＦＩＦＯ型である。The unit ALU further includes a store data queue STDQ and a load data queue LDDQ through which data exchanged with the memory passes. As a practical matter, all units of the processor exchange data with memory as before, and thus include the queues STDQ and LDDQ as shown. Queue STA
Q, STDQ and LDDQ are all of FIFO type.

【００５２】ユニットＭＥＭＵがストア命令を実行する
毎に、書き込みアドレスは対応するキューＳＴＡＱ内に
スタックされる。この命令によって書き込まれるデータ
は、通常、ユニットＡＬＵの、対応するストアデータキ
ューＳＴＤＱにスタックされる。例えば、ユニットＡＬ
ＵのレジスタＲ１の内容がアドレスｉに書き込まれるべ
きならば、書き込み命令を実行するユニットＭＥＭＵ
は、そのキューＳＴＡＱのアドレスｉをスタックし、同
時にユニットＡＬＵは、そのキューＳＴＤＱのレジスタ
Ｒ１の内容をスタックする。書き込まれるべきデータが
キューＳＴＤＱにスタックされる必要がなく、同時にア
ドレスがキューＳＴＡＱにスタックされることに注目す
べきである。これは、本発明の本質的態様である。Each time the unit MEMU executes a store instruction, the write address is stacked in the corresponding queue STAQ. The data written by this instruction is normally stacked in the corresponding store data queue STDQ of the unit ALU. For example, unit AL
If the contents of U's register R1 are to be written to address i, a unit MEMU executing the write instruction
Stacks the address i of the queue STAQ, and at the same time, the unit ALU stacks the contents of the register R1 of the queue STDQ. It should be noted that the data to be written does not need to be stacked on the queue STDQ, while the addresses are stacked on the queue STAQ. This is an essential aspect of the present invention.

【００５３】各サイクルにおいて、キューＳＴＤＱ及び
ＳＴＡＱの内容はポールされる。キューＳＴＡＱが１つ
のアドレスを含み、キューＳＴＤＱの１つがデータを含
むならば、データはキューＳＴＡＱに含まれたアドレス
に書き込まれ、従ってキューＳＴＤＱ及びＳＴＡＱは更
新される。これら条件が満たされないならば、即ちキュ
ーＳＴＡＱが空である又はキューＳＴＤＱが全て空であ
るならば、書き込みは実行されない。２つのキューＳＴ
ＤＱが同時にデータを含む状態は、決して発生しない。In each cycle, the contents of queues STDQ and STAQ are polled. If the queue STAQ contains one address and one of the queues STDQ contains data, the data is written to the address contained in the queue STAQ, so that the queues STDQ and STAQ are updated. If these conditions are not met, ie if the queue STAQ is empty or if the queues STDQ are all empty, no writing is performed. Two queues ST
The situation where the DQ contains data at the same time never occurs.

【００５４】２つのメモリアクセスユニットＭＥＭＵを
含むＤＳＰの場合において、キューＳＴＤＱの各位置
は、ユニットＭＥＭＵの各々に対して１つである、２つ
のデータワードを含む。従って、発生する書き込みに対
して、キューＳＴＤＱの位置に含まれたデータワード
は、更に、空でないキューＳＴＡＱを有するユニットＭ
ＥＭＵに対応しなければならない。In the case of a DSP containing two memory access units MEMU, each location of the queue STDQ contains two data words, one for each of the units MEMU. Thus, for a write to take place, the data word contained at the position of the queue STDQ will also have a unit M with a nonempty queue STAQ.
EMU must be supported.

【００５５】このメカニズムは、メモリバス又は書き込
まれるデータの利用率を考慮することなく、直ぐに書き
込み命令を実行することを可能にする。書き込むべきデ
ータが連続して書き込まれ、直ぐにデータ及びメモリバ
スが全て利用できる。このメカニズムを、一例を用いて
ここで詳細に説明する。This mechanism makes it possible to execute a write command immediately without taking into account the utilization of the memory bus or the data to be written. Data to be written is written continuously, and all data and memory buses are immediately available. This mechanism will now be described in detail using an example.

【００５６】ループ（１）の例を再び検討して、以下の
ようなまっすぐな形式にプログラムされる。Considering again the example of loop (1), it is programmed into a straight form as follows.

【００５７】ＬＤ：Ｒ１＝［ｉ］（１）ＯＰ：Ｒ１＝Ｒ１＋Ｒ２ＳＴ：［ｉ］＝Ｒ１ＢＲ：ｔｅｓｔｉ，ｉ＋＋，ｌｏｏｐLD: R1 = [i] (1) OP: R1 = R1 + R2 ST: [i] = R1 BR: test i, i ++, loop

【００５８】前述したように、ロード命令ＬＤ又はスト
ア命令ＳＴは２つのフィールドから形成され、ユニット
ＭＥＭＵに対する一方のフィールドは、読み出し又は書
き込みアドレスを指示するためのものであり、ユニット
ＡＬＵに対する他方のフィールドは、読み出されたデー
タ又は書き込むべきその内容を受信すべきレジスタを指
示するためのものである。従って、命令ＬＤは、前述し
たようにＡＬＵ命令内で、ロードキューＬＤＤＱの第１
の要素をレジスタＲ１にロードするためのものであっ
て、ＭＥＭＵ命令である、ＬＤＡ：Ｒ１＝ＬＤＤＱと、
読み出しモードでアドレスｉを提供するための、ＬＤ
Ｍ：Ｒ（ｉ）とに分解する。As described above, the load instruction LD or the store instruction ST is formed of two fields, one field for the unit MEMU is for indicating a read or write address, and the other field for the unit ALU. Is for indicating the register to receive the read data or its contents to be written. Therefore, the instruction LD is, as described above, within the ALU instruction, the first of the load queue LDDQ.
LDA: R1 = LDDQ, which is a MEMU instruction for loading the elements of
LD for providing address i in read mode
M: Decomposed into R (i).

【００５９】前述のストア命令ＳＴは、ＡＬＵ命令内で
ストアキューＳＴＤＱ内でレジスタＲ１の内容をスタッ
クするためであって、ＭＥＭＵ命令である、ＳＴＡ：Ｓ
ＴＤＱ＝Ｒ１と、アドレスキューＳＴＡＱ内で書き込み
アドレスをスタックするための、ＳＴＭ：ＳＴＡＱ＝ｉ
とに分解する。The above-mentioned store instruction ST is for stacking the contents of the register R1 in the store queue STDQ in the ALU instruction, and is a MEMU instruction, STA: S
TDQ = R1 and STM: STAQ = i for stacking the write address in the address queue STAQ
And decompose into

【００６０】ユニットＡＬＵに対する読み出し／書き込
み命令のフィールドｆは、本発明によれば、（命令ＯＰ
のような）ユニットＡＬＵに対する通常の命令Ｉ１がそ
れによって受信されるキュー１８の独立した命令キュー
１９にスタックされる。According to the present invention, the field f of the read / write command for the unit ALU is (command OP)
The normal instruction I1 for the unit ALU (as in) is stacked in a separate instruction queue 19 of the queue 18 by which it is received.

【００６１】以下の表は、ループ（１）のいくつかの繰
り返しに対して、ユニットＭＥＭＵ及びユニットＡＬＵ
の一方によって実行されたオペレーションと、ユニット
ＭＥＭＵ及びＡＬＵの命令キューの内容、ユニットＭＥ
ＭＵのキューＳＴＡＱの内容、及び最後にユニットＡＬ
ＵのキューＳＴＤＱ及びＬＤＤＱの内容とを説明する。
その表において、オペレーションは、繰り返しに対応す
る番号で割り当てられる。The following table shows, for some iterations of loop (1), the units MEMU and ALU
And the contents of the instruction queues of units MEMU and ALU, unit ME
The contents of the MU queue STAQ, and finally the unit AL
The contents of the queues STDQ and LDDQ of U will be described.
In that table, operations are assigned numbers corresponding to the iterations.

【００６２】[0062]

【表３】 [Table 3]

【００６３】第１のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤ１及びＯＰ１を受信する。命
令ＬＤ１は、ユニットＭＥＭＵに提供された命令ＬＤＭ
１とユニットＡＬＵに提供された命令ＬＤＡ１とに分解
される。読み出しアドレスの送信を指示する命令ＬＤＭ
１が、直ぐに実行される。空である読み出しキューＬＤ
ＤＱからの値を必要とするために、命令ＬＤＡ１はウエ
イトにセットされる。この同じ値を必要とする命令ＯＰ
１もまた、ウエイトにセットされる。In the first cycle, the unit MEM
U and ALU receive commands LD1 and OP1. The instruction LD1 is the instruction LDM provided to the unit MEMU.
1 and the instruction LDA1 provided to the unit ALU. Instruction LDM for instructing transmission of read address
1 is executed immediately. Read queue LD that is empty
To require the value from DQ, instruction LDA1 is set to wait. Instruction OP requiring this same value
1 is also set to the weight.

【００６４】第２のサイクルにおいて、ユニットＭＥＭ
Ｕは、ユニットＭＥＭＵに提供された命令ＳＴＭ１とユ
ニットＡＬＵに提供された命令ＳＴＡ１とに分解された
命令ＳＴ１を受信する。命令ＳＴＭ１は、実行され、オ
ペレーションＯＰ１の結果が書き込まれるアドレスｉの
キューＳＴＡＱにスタックする。命令ＳＴＡ１は、命令
ＬＤＡ１の後ろでウエイトにセットされる。In the second cycle, the unit MEM
U receives the instruction ST1 that has been decomposed into the instruction STM1 provided to the unit MEMU and the instruction STA1 provided to the unit ALU. The instruction STM1 is executed and is stacked on the queue STAQ at the address i where the result of the operation OP1 is written. Instruction STA1 is set to wait after instruction LDA1.

【００６５】第３のサイクルにおいて、ユニットＭＥＭ
Ｕ及びＡＬＵは、命令ＬＤ２（ＬＤＭ２、ＬＤＡ２）及
びＯＰ２を受信する。メモリは、命令ＬＤＭ１によって
必要とされる値［ｉ］を有するキューＬＤＤＱを提供す
る。命令ＬＤＡ１は、実行され、キューＬＤＤＱからレ
ジスタＲ１へ値［ｉ］を転送させる。命令ＯＰ１は、実
行され、レジスタＲ１の内容を更新する。ユニットＭＥ
ＭＵは解放されており、命令ＬＤＭ２も実行される。In the third cycle, the unit MEM
U and ALU receive commands LD2 (LDM2, LDA2) and OP2. The memory provides a queue LDDQ with the value [i] required by the instruction LDM1. Instruction LDA1 is executed, causing value [i] to be transferred from queue LDDQ to register R1. Instruction OP1 is executed and updates the contents of register R1. Unit ME
The MU has been released and the instruction LDM2 is also executed.

【００６６】第４のサイクルにおいて、ユニットＭＥＭ
Ｕは命令ＳＴ２（ＳＴＭ２、ＳＴＡ２）を受信する。命
令ＳＴＭ２及びＳＴＡ１が実行される。命令ＳＴＡ１
は、キューＳＴＤＱへレジスタＲ１の内容Ｒ１_１をコピ
ーする。命令ＳＴＭ２はオペレーションＯＰ２の結果が
書き込まれるアドレスｉ＋１のキューＳＴＡＱにスタッ
クされる。同じサイクルにおいて、キューＳＴＡＱはア
ドレスｉを含むことが検出され、ユニットＡＬＵのキュ
ーＳＴＤＱはデータＲ１_１を含むことが検出される。メ
モリバスが利用できるために、データＲ１_１は直ぐにア
ドレスｉで書き込まれ、キューＳＴＡＱ及びＳＴＤＱが
更新される。In the fourth cycle, the unit MEM
U receives the command ST2 (STM2, STA2). Instructions STM2 and STA1 are executed. Instruction STA1
Copies the contents of R1 ₁ of the register to the queue STDQ R1. The instruction STM2 is stacked on the queue STAQ at the address i + 1 where the result of the operation OP2 is written. In the same cycle, queue STAQ is detected to contain an address i, queue STDQ unit ALU is detected to contain data R1 _1. For memory bus is available, the data R1 ₁ is immediately written in the address i, the queue STAQ and STDQ are updated.

【００６７】第５のサイクルにおいて、第３のサイクル
のオペレーションは、新しいループの繰り返しのために
反復される。In the fifth cycle, the operations of the third cycle are repeated for a new loop iteration.

【００６８】前述の表は、メモリユニットＭＥＭＵが各
サイクルに対してアクセス命令を実行することを説明す
る。従って、効率が最大となる。更に、命令キューがい
っぱいにならない。それは、キューのオーバフローを避
けるために、命令の提供を規則正しく停止することを必
要としない。The above table illustrates that the memory unit MEMU executes an access instruction for each cycle. Therefore, efficiency is maximized. Further, the instruction queue does not fill. It does not require regular stopping of instruction serving to avoid queue overflow.

【００６９】キューＳＴＤＱ及びＬＤＤＱは、この例に
おいてせいぜい１つの要素を受信することを表してい
る。従って、これらキューを、単なるレジスタにするこ
とができ、そのレジスタは、入力し及び出力するデータ
をラッチするために従来のユニット内に通常設けられ
る。The queues STDQ and LDDQ represent receiving at most one element in this example. Thus, these queues can be simply registers, which are usually provided in conventional units for latching input and output data.

【００７０】より複雑なループに対する他の場合におい
て、キューＳＴＤＱ及びＬＤＤＱは、１つ以上の要素を
受信することが明らかである。そのとき、キューＳＴＤ
Ｑ及びＬＤＤＱの位置の番号は、オーバフローの何らか
のリスクを避けるために増加される。In other cases for more complex loops, it is clear that the queues STDQ and LDDQ receive one or more elements. At that time, the queue STD
The number of Q and LDDQ locations is incremented to avoid any risk of overflow.

【００７１】もちろん、本発明は、当業者によれば容易
にできるであろう種々の変更、修正及び改善をすること
ができる。このような変更、修正及び改善は、この開示
の一部分をしようとするものであり、本発明の技術的思
想及び見地の中でしようとするものである。従って、前
述の説明は単に例であって、限定しようとするものでは
ない。本発明は、特許請求の範囲及びその均等物に規定
されるものにのみ限定される。Of course, the present invention is capable of various alterations, modifications and improvements which will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only and is not intended as limiting. The invention is limited only as defined in the following claims and equivalents thereof.

[Brief description of the drawings]

【図１】従来のＤＳＰアーキテクチャである。FIG. 1 is a conventional DSP architecture.

【図２】本発明によるＤＳＰアーキテクチャである。FIG. 2 is a DSP architecture according to the present invention.

[Explanation of symbols]

１０メモリアクセスユニット、ＭＥＭＵ１２計算及び論理ユニット、ＡＬＵ１４分岐ユニット、ＢＲＵ１６メモリ１８第１の命令キュー１９第２の命令キュー Reference Signs List 10 memory access unit, MEMU 12 calculation and logical unit, ALU 14 branch unit, BRU 16 memory 18 first instruction queue 19 second instruction queue

Claims

[Claims]

At least one memory access unit (MEMU) for providing a read or write address to an address bus of a memory (16) in response to execution of a read or write instruction, and operating in parallel with the memory access unit. , A computation and logic unit (AL) arranged at least to provide data to a data bus of the memory while the memory access unit provides a write address.
U) and a write address queue (STAQ) in which each write address provided by the memory access unit that waits until data availability is written is stored therein.

2. The computing and logical unit (ALU)
Includes two independent instruction queues (18, 19) for receiving instructions awaiting execution, the first instruction queue (18) for receiving logical and computational instructions. , The second instruction queue (19) may include an instruction field (f) provided to the memory access unit (MEMU) to identify registers of the computation and logical unit (ALU) included in a read or write operation. 2. The processor according to claim 1, wherein the processor is adapted to receive).

3. The computing and logical unit (ALU)
Wherein each data to be written into the memory (16) includes a store data queue (STDQ) that waits until a write address is present in the write address queue (STAQ). The processor according to claim 1.

4. The computing and logical unit (ALU)
Is the memory (1) for the computation and logical units.
4. A processor according to claim 1 or 3, wherein each data from 6) includes a load data queue (LDDQ) into which is written and waits until the computation and logic unit is available.

5. A method for receiving instructions and including a branch unit (BRU) for distributing the instructions in parallel between the memory access unit (MEMU) and the computation and logic unit (ALU) itself. The processor of claim 1, wherein:

6. Each of the units includes a store data queue (STDQ) in which each data to be written to the memory waits until a write address of the write address queue (STAQ) exists. The processor according to claim 5, wherein

7. Each of said units includes a load data queue (LDDQ) into which each data is to be written from said memory for said unit, and waits until said unit is available. Item 7. The processor according to item 5 or 6.