JP4020500B2

JP4020500B2 - Memory access instruction reduction device and recording medium

Info

Publication number: JP4020500B2
Application number: JP18873498A
Authority: JP
Inventors: 政人森島
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1998-07-03
Filing date: 1998-07-03
Publication date: 2007-12-12
Anticipated expiration: 2018-07-03
Also published as: JP2000020318A

Description

【０００１】
【発明の属する技術分野】
本発明は、ソースプログラムを解析してメモリアクセス命令を削減するメモリアクセス命令削減装置および記録媒体に関するものである。
【０００２】
【従来の技術】
従来、コンピュータにおいて、演算、例えば加算
Ａ（１）＝Ｂ（１）＋Ｂ（２）（１）
を実行する場合、
ＬＷＧＲ２，Ｂ
ＬＷＧＲ３，Ｂ＋４
ＡＤＤＧＲ４，ＧＲ２，ＧＲ３
ＳＷＡ，ＧＲ４
と展開して実行していた。ここで、ＬＷＧＲ２，Ｂは、メモリのＢ番地のデータをレジスタＧＲ２にロードするという命令である。ＬＷＧＲ３，Ｂ＋４は、メモリの（Ｂ＋４）番地のデータをレジスタＧＲ３にロードするという命令である。ＡＤＤＧＲ４，ＧＲ２，ＧＲ３は、レジスタＧＲ２の内容とレジスタＧＲ３の内容を加算してその結果をレジスタＧＲ４に代入するという命令である。ＳＷＡ，ＧＲ４は、レジスタＧＲ４の内容（加算結果）をメモリのＡ番地にストアするという命令である。以上のように、式（１）を実行するのに、４回の実行回数（マシンサイクル）が必要であった。
【０００３】
また、ＤＯループである
ＤＯＩ＝１，Ｎ，４（２）
Ａ（Ｉ）＝Ｂ（Ｉ）＋Ｃ（Ｉ）
Ａ（Ｉ＋１）＝Ｂ（Ｉ＋１）＋Ｃ（Ｉ＋１）
Ａ（Ｉ＋２）＝Ｂ（Ｉ＋２）＋Ｃ（Ｉ＋２）
Ａ（Ｉ＋３）＝Ｂ（Ｉ＋３）＋Ｃ（Ｉ＋３）
ＥＮＤＯ
の場合には、下記のようにして実行していた。
【０００４】
LOAD T.1 , (B,BASE=B,INDEX=I,DISP=0 )
LOAD T.2 , (B,BASE=B,INDEX=I,DISP=8 )
LOAD T.3 , (B,BASE=B,INDEX=I,DISP=16)
LOAD T.4 , (B,BASE=B,INDEX=I,DISP=24)
LOAD T.5 , (C,BASE=C,INDEX=I,DISP=0 )
LOAD T.6 , (C,BASE=C,INDEX=I,DISP=8 )
LOAD T.7 , (C,BASE=C,INDEX=I,DISP=16)
LOAD T.8 , (C,BASE=C,INDEX=I,DISP=24)
ADD T.9 , T.1, T.5
ADD T.10, T.2, T.6
ADD T.11, T.3, T.7
ADD T.12, T.4, T.8
STORE (A,BASE=A,INDEX=I,DISP=0 ),T.9
STORE (A,BASE=A,INDEX=I,DISP=8 ),T.10
STORE (A,BASE=A,INDEX=I,DISP=16),T.11
STORE (A,BASE=A,INDEX=I,DISP=24),T.12
この場合には、
・ロード命令が８回
・加算命令が４回
・ストア命令が４回
の合計１６回の実行回数（マシンサイクル）が必要であった。
【０００５】
【発明が解決しようとする課題】
上述した式（１）、式（２）を実行する場合にメモリアクセスに要する実行回数が多く高速に実行し得ないという問題があった。このため、複数のメモリアクセス命令をまとめて高速実行することが望まれている。
【０００６】
本発明は、これらの問題を解決するため、ソースプログラムを解析してメモリアクセスについて所定条件を満たす隣接あるいは近傍にある複数のメモリアクセス命令をまとめて連続実行メモリアクセス命令で置換し、アクセス回数を削減してメモリアクセスの高速化を図ることを目的としている。
【０００７】
【課題を解決するための手段】
図１を参照して課題を解決するための手段を説明する。
図１において、ソースプログラム解析手段３は、ソースプログラム１を形態素および構文解析などを行うものである。
【０００８】
最適化・レジスタ割付け手段４は、ソースプログラムを解析した結果をもとにレジスタ割り付けを最適化するものであって、当該レジスタ割り付けの最適時に本発明に係るメモリ命令を削減する命令削減手段５などから構成されるものである。
【０００９】
命令削減手段５は、レジスタ割り付けを最適化する際に、隣接あるいは近傍の連続したメモリアクセス命令をまとめて実行できる条件を満たしているか判別したり、条件を満たしていると判別したときに隣接あるいは近傍の連続したメモリアクセス命令を連続実行メモリアクセル命令（例えばペアメモリアクセス命令）に置換したりなどするものである。
【００１０】
尚、最適化・レジスタ割付け手段４は、最適化は主に中間テキストを削減し、レジスタ割付けはその名の通りレジスタ割付けを行い、実際には最適化した後にレジスタ割付けを行っています。本発明では主に最適化の部分で行い、ＭＯＶを利用した分解回避のときはレジスタ割付けで行っています。以下同様です。
【００１１】
次に、動作を説明する。
ソースプログラム１を解析してあるいはレジスタ割り付けの最適化時に、命令削減手段５が検出した隣接あるいは近傍の連続したメモリアクセス命令をまとめて実行できる条件を満たしているか判別し、条件を満たしていると判別されたときに隣接あるいは近傍の連続したメモリアクセス命令を連続実行メモリアクセス命令（例えばペアメモリアクセス命令）に置換し、置換後のオブジェクトプログラムを生成させるようにしている。
【００１２】
この際、検出した隣接あるいは近傍の連続したメモリアクセス命令を連続実行メモリアクセス命令（例えばペアメモリアクセス命令）に置換する場合に、メモリアクセス命令の実行順番が逆のときにＭＯＶ命令を使用してメモリアクセス命令の実行順番を入れ換えた後に連続実行メモリアクセス命令（例えばペアメモリアクセス命令）で置換するようにしている。
【００１３】
また、近傍の連続したメモリアクセス命令をまとめて実行できる条件として、メモリアクセス命令と次のメモリアクセス命令との間の命令数が予め設定した所定数以下とし、レジスタが不当に長く専有されることを防止するようにしている。
【００１４】
また、条件として、メモリアクセス命令の変数の型、基底、インデックス、基底からのオフセットが等しい、および連続するメモリアクセス命令間にデータ依存関係のある命令が存在しないとするようにしている。
【００１５】
従って、ソースプログラム１を解析してあるいはレジスタ割り付けの最適化時に、メモリアクセスについて所定条件を満たす隣接あるいは近傍にある複数のメモリアクセス命令をまとめて連続実行メモリアクセス命令（例えばペアメモリアクセル命令）で置換することにより、実行回数を削減してメモリアクセスの高速化を図ることが可能となる。
【００１６】
【発明の実施の形態】
次に、図１から図９を用いて本発明の実施の形態および動作を順次詳細に説明する。
【００１７】
図１は、本発明のシステム構成図を示す。
図１において、ソースプログラム１は、メモリアクセス命令を削減する対象のソースプログラムである。
【００１８】
コンパイラ２は、ソースプログラム１を読み込んで解析し、最適化などを行い、実行可能形式のオブジェクトプログラム７を生成するものであって、ここでは、ソースプログラム解析手段３、最適化・レジスタ割付け手段４およびスケジューリング・コード生成手段６などから構成されるものである。
【００１９】
ソースプログラム解析手段３は、ソースプログラム１を形態素および構文解析して内部で処理に適した中間テキスト（中間言語）を生成するものである。
最適化・レジスタ割付け手段４は、ソースプログラム１を解析した結果をもとにレジスタ割り付けの最適化を図るものであって、当該レジスタ割り付けの最適時に本発明に係るメモリ命令を削減する命令削減手段５などから構成されるものである。
【００２０】
命令削減手段５は、レジスタ割り付けを最適化する際に、隣接あるいは近傍の連続したメモリアクセス命令をまとめて実行できる条件を満たしているか判別したり、条件を満たしていると判別したときに隣接あるいは近傍の連続したメモリアクセス命令をペアメモリアクセス命令に置換したりなどするものである。
【００２１】
スケージュリング・コード生成手段６は、最適化およびレジスタ割付け後にスケージュリングし、実行可能形式のコードを生成してオブジェクトプログラム７を出力するものである。
【００２２】
オブジェクトプログラム７は、実行可能形式のプログラムである。
次に、図２のフローチャートに示す順序に従い、図１の構成の本発明に係る命令削減手段５の動作を詳細に説明する。
【００２３】
図２は、本発明の動作説明フローチャート（その１）を示す。
図２において、Ｓ１は、処理単位情報を取り出す。これは、ソースプログラム１を解析した結果をもとに、まとまりのある処理単位、例えば後述する図４の（ｂ）のようにあるまとまりのある処理単位（図４の（ｂ）では、図４の（ａ）に記載したＡ（１）＝Ｂ（１）＋Ｂ（２）という加算を行う処理単位）の情報を取り出す。
【００２４】
Ｓ２は、処理単位情報から中間テキストを取り出す。これは、図４の（ｂ）から１行づつ中間テキストを取り出す。
Ｓ３は、ＬＯＡＤテキストか判別する。これは、Ｓ２で図４の（ｂ）から１行つづ取り出した中間テキストがＬＯＡＤ命令（メモリからレジスタにデータをロードする命令）か判別する。ＹＥＳの場合には、ＬＯＡＤテキストと判明したので、Ｓ４でＬＯＡＤテキストをテーブルに登録し、Ｓ７に進む。ＮＯの場合には、ＬＯＡＤテキストでないと判明したので、Ｓ５に進む。
【００２５】
Ｓ５は、ＳＴＯＲＥテキストか判別する。これは、Ｓ２で図４の（ｂ）から１行つづ取り出した中間テキストがＳＴＯＲＥ命令（レジスタからメモリにデータをストアする命令）か判別する。ＹＥＳの場合には、ＳＴＯＲＥテキストと判明したので、Ｓ６でＳＴＯＲＥテキストをテーブルに登録し、Ｓ７に進む。ＮＯの場合には、ＳＴＯＲＥテキストでないと判明したので、Ｓ７に進む。
【００２６】
Ｓ７は、次の中間テキストがあるか判別する。ＹＥＳの場合には、その中間テキストを取り出してＳ３に戻り繰り返す。ＮＯの場合には、Ｓ８に進む。
Ｓ８は、次の処理単位情報があるか判別する。ＹＥＳの場合には、その処理単位情報を取り出してＳ２に戻り繰り返す。ＮＯの場合には、Ｓ９で登録されたテーブル情報をもとに図３を呼び出す。
【００２７】
以上によって、ソースプログラム１を解析した処理単位情報からメモリアクセス命令（ＬＯＡＤテキストおよびＳＴＯＲＥテキスト）がテーブルに登録（行情報などと共に登録）されたこととなる。
【００２８】
図３は、本発明の動作説明フローチャート（その２）を示す。これは、図２のＳ９から呼び出される処理である。
図３において、Ｓ１１は、テーブルからテキストＡを取り出す。これは、テーブルの先頭から１つのテキストＡを取り出す。
【００２９】
Ｓ１２は、テーブルからテキストＢを取り出す、これは、テーブルから次の１つのテキストＢを取り出す。
Ｓ１３は、型が等しいか判別する。これは、Ｓ１１、Ｓ１２で取り出したテキストＡとテキストＢとの両者の型（例えば後述する図６の（ａ）のＲＥＡＬ＊８Ａ（１００），Ｂ（１００）の定義中のＲＥＡＬ（実数浮動小数点型））が等しいか判別する。ＹＥＳの場合には、Ｓ１４に進む。ＮＯの場合には、条件を満たさないので、Ｓ２０に進む。
【００３０】
Ｓ１５は、テキストＡとテキストＢのインデックスが等しいか判別する。これは、例えば後述する図６の（ａ）のテキストＡとテキストＢが
LOAD(R8) T.1,(B,BASE=B,INDEX=NULL,DISP=0)
LOAD(R8) T.2,(B,BASE=B,INDEX=NULL,DISP=8)
の場合に、インデックスがINDEX=NULLと両者が等しいか判別する。ＹＥＳの場合には、Ｓ１６に進む。ＮＯの場合には、条件を満たさないので、Ｓ２０に進む。
【００３１】
Ｓ１６は、テキストＡとテキストＢのディスプレースメントの差の絶対値が型の大きさと等しいか判別する。これは、例えば後述する図６の（ａ）のテキストＡとテキストＢが
LOAD(R8) T.1,(B,BASE=B,INDEX=NULL,DISP=0)
LOAD(R8) T.2,(B,BASE=B,INDEX=NULL,DISP=8)
の場合に、ディスプレースメントＤＩＳＰ＝０とＤＩＳＰ＝８の差の絶対値「８」が、型の大きさ（図６の（ａ）のＲＥＡＬ＊８Ａ（１００），Ｂ（１００）の定義中の「８」（バイト）と等しいか判別する。この例では、ＹＥＳとなり、Ｓ１７に進む。ＮＯの場合には、条件を満たさないので、Ｓ２０に進む。
【００３２】
Ｓ１７は、テキストＡとテキストＢの間にデータ依存関係のあるテキストが存在しないか判別する。これは、テキストＡとテキストＢの間にデキストＡあるいはテキストＢのデータに依存関係があってデータが変わってしまう可能性のある他のテキストがないか判別する。ＹＥＳの場合には、データに依存関係のある他のテキストが両者の間にないと判明したので、Ｓ１８に進む。ＮＯの場合には、条件を満たさないので、Ｓ２０に進む。
【００３３】
Ｓ１８は、テキストＡとテキストＢの距離がＮ以下か判別する。これは、テキストＡとテキストＢとの間にある他のテキストの数がＮ以下か判別（Ｎの間、レジスタが専有されて他に使用できなくなるので、通常はＮ＝１０位に制限して無闇にレジスタが専有されてしまう事態を制限し、当該制限の範囲内か判別）する。ＹＥＳの場合には、Ｓ１９に進む。ＮＯの場合には、条件を満たさないので、Ｓ２０に進む。
【００３４】
Ｓ１９は、テキストＡとテキストＢが全ての条件を満たしたので、テキストＡとテキストＢをペア化（ペアメモリアクセス命令で置換）し、テーブルから削除し、Ｓ２０に進む。ここで、テキストＡとテキストＢをペア化する条件をまとめると下記のようになる。
【００３５】
・型が一致
・インデックスが一致
・ベースアドレス（基底）が一致
・両者のディスプレースメント（ＤＩＳＰ）の差の絶対値が型の大きさと一致
・両者の間にデータ依存関係がない。
【００３６】
・その他
Ｓ２０は、次のテキストＢがあるか判別する。ＹＥＳの場合には、そのテキストＢを取り出し、Ｓ１４以降を繰り返す（これはより、テキストＡを固定に、テーブルから他の全てのテキストをテキストＢとして取り出して条件判定を繰り返すことが可能となる）。ＮＯの場合には、Ｓ２１に進む。
【００３７】
Ｓ２１は、次のテキストＡがあるか判別する。ＹＥＳの場合には、そのテキストＡを取り出し、Ｓ１２以降を繰り返す。ＮＯの場合には、終了する。
以上によって、テキストＡとテキストＢをテーブルから取り出して条件を満たすペアを見つけてペア化（ペアメモリアクセス命令で置換）することが可能となる。以下具体例について詳細に説明する。
【００３８】
図４は、本発明の具体例（その１）を示す。
図４の（ａ）は、ソースプログラム１のイメージを示す。ここで
Ａ（１）＝Ｂ（１）＋Ｂ（２）
は、Ｂ（１）の内容と、Ｂ（２）の内容とを加算してＡ（１）に入れるという演算である。これを、実際に実行するためのテキストに展開すると、図４の（ｂ）に示す下記のようになる。
【００３９】
LW GR2,B /メモリBの内容をレジスタGR2にロード
LW GR3,B+4 /メモリ(B+4)の内容をレジスタGR3にロード
ADD GR4,GR2,GR3 /GR2とGR3の内容を加算して結果をGR4に入れる
SW A,GR4 /GR4の内容をメモリAにストア
これらテキスト中の第１行目のLW GR3,Bと第２行目のLW GR3,B+4とは、既述した図3のフローチャートで説明した条件を満たすので、ペア化し、ペアメモリアクセス命令LWP GR2,Bの１つに置換し、２サイクル必要であってものを１サイクルで実行するようにし、図４の（ｃ）に示す下記のようにする。
【００４０】
LWP GR2,B /メモリBとB+4の内容をレジスタGR2、GR3にロード
ADD GR4,GR2,GR3 /GR2とGR3の内容を加算して結果をGR4に入れる
SW A,GR4 /GR4の内容をメモリAにストア
以上のように、隣接する２つのロード命令が図３で既述した条件を満たすのでペア化し、サイクルを１サイクル削減し、全体として４サイクルから３サイクルに削減できたこととなる。
【００４１】
図５は、本発明の具体例（その２）を示す。これは、いわゆるループアンローリングというループの回転数を展開（ここでは、４つに展開）して、分岐を少なくして最適化（高速化）を図る手法であり、この手法の際に、本願発明を適用したものであり、４つの連続するロードおよびストアをそれぞれペア化したものである。
【００４２】
図５の（ａ）は、ループの例を示す。ここでは、Ａ（Ｉ）＝Ｂ（Ｉ）＋Ｃ（Ｉ）という演算をＩ＝１から１０２４回ループして求めるというものである。
図５の（ｂ）は、ループの回転数を展開、ここでは、４つに展開し、図示の下記のようにしたものである。
【００４３】
ＤＯＩ＝１，１０２４，４
Ａ（Ｉ）＝Ｂ（Ｉ）＋Ｃ（Ｉ）
Ａ（Ｉ＋１）＝Ｂ（Ｉ＋１）＋Ｃ（Ｉ＋１）
Ａ（Ｉ＋２）＝Ｂ（Ｉ＋２）＋Ｃ（Ｉ＋２）
Ａ（Ｉ＋３）＝Ｂ（Ｉ＋３）＋Ｃ（Ｉ＋３）
ＥＮＤＯ
この展開した後の、第１番目と第２番目を１つにペア化し（図４で説明したと同様）、第３番目と第４番目を１つにペア化し、２つのペア命令で実行するようにする。これらの結果、
・Ａのストア４つがペア化され、２つ
・Ｂのロード４つがペア化され、２つ
・Ｃのロード４つがペア化され、２つ
以上により、メモリアクセス命令に関しては、半分になり、高速化を図ることが可能となる。
【００４４】
図６は、本発明の具体例（その３）を示す。
図６の（ａ）は、テキストのイメージ例を示す。ここでは、
REAL*8 A(100),B(100) /実数型浮動少数点8バイト、A、Bは要素が100
・・・
A(1)=B(1)+B(2) /B(1)の内容とB(2)の内容を加算してA(1)に入れる
LOAD(R8) T.1,(B,BASE=B,INDEX=NULL,DISP=0)
LOAD(R8) T.2,(B,BASE=B,INDEX=NULL,DISP=8)
ADD(R8) T.3,T.1,T.2
STORE(R8) (A,BASE=A,DISP=0),T.3
ここで、LOAD(R8) T.1,(B,BASE=B,INDEX=NULL,DISP=0)は、メモリ上の(B,BASE=B,INDEX=NULL,DISP=0)のデータを、レジスタT.1にロードする。ここで、BASEは基底であり(ここではBアドレス)、INDEXはインデックスレジスタで示す内容(ここはNULLをBASEに加算))であり、DISPはBASE=Bに加算するディスプレイスメント(オフセット)である。以下同様である。ここで、第１番目のロード命令と第２番目のロード命令が既述した図３の条件を満たすので、これら２つのロード命令をペア化し、図６の（ｂ）の矢印で示すペアメモリアクセス命令に置換する。
【００４５】
図６の（ｂ）は、２つのロード命令をペア化した後のテキストのイメージを示す。ここでは、第１行目の
LOAD(R8) (T.1,T.2),(B,BASE=B,INDEX=NULL,DISP=0)
がペアメモリアクセス命令であって、1サイクルでメモリ上の基底から0および8のアドレスからデータを、レジスタT.1とT.2にそれぞれストアする。これにより、２つのロード命令を１つのペア化した命令に置換し、メモリアクセスを半分にすることが可能となる。
【００４６】
図７は、本発明の具体例（その４）を示す。
図７の（ａ）は、テキストのイメージの例を示す。ここで、第２行目の
REAL*8 A(100),B(100)
が変数Ａ、Ｂが要素が１００で、浮動小数点型で８バイトの定義を表す。下の４行がＬＯＡＤ命令の間にＳＴＯＲＥ命令が入った場合を示し、これを矢印で示すよう２つのＬＯＡＤ命令をペア化したものが、図７の（ｂ）である。
【００４７】
図７の（ｂ）は、ペア化後のテキストのイメージの例を示す。これは、図７の（ａ）の２つの離れたＬＯＡＤ命令を１つにペア化、かつ２つの離れたＳＴＯＲＥ命令を１つにペア化したものである。
【００４８】
図７の（ｃ）は、図７の（ｂ）のペア化した後、ＬＯＡＤ命令を優先してレジスタ割付けした後のテキストのイメージの例を示す。これは、図７の（ｅ）に示すように最初のＬＯＡＤ命令はペア命令でメモリＢ（１）、Ｂ（２）からレジスタＮ番目、（Ｎ＋１）番目（連続した２つのレジスタを表す）にロードするが、ストアが交差するために分解して元の２つのストア命令で、レジスタＮ番目、（Ｎ＋１）番目からメモリＡ（２）、Ａ（１）にストアする。このため、ペア化してもレジスタ割付け時に、１つのペア化したストア命令を分解して元のストア命令を２つ使用せざるを得ない。
【００４９】
図７の（ｄ）は、図７の（ｂ）のペア化した後、レジスタが余裕がある場合のレジスタ割付けした後のテキストのイメージの例を示す。これは、図７の（ｆ）に示すように最初のＬＯＡＤ命令はペア命令でメモリＢ（１）、Ｂ（２）からレジスタＮ番目、（Ｎ＋１）番目（連続した２つのレジスタを表す）にロードするが、ストアが交差するために、ここでは、ＭＯＶ命令を使用してレジスタ（Ｎ＋３）（連続する３つ目のレジスタを表す）に移動した後、ペア化したストア命令でレジスタ（Ｎ＋１）、（Ｎ＋２）からメモリＡ（１）、Ａ（２）にストアする。この場合には、ペア化した命令を分解することなく、ＭＯＶ命令を１つ追加するのみで対処できる。
【００５０】
図７の（ｅ）は、図７の（ｃ）のＬＯＡＤとＳＴＯＲＥの様子の模式図を示す。相互の関係は、矢印で示した通りである。
図７の（ｆ）は、図７の（ｅ）のＬＯＡＤとＳＴＯＲＥの間に挿入したＭＯＶの様子の模式図を示す。相互の関係は、矢印で示した通りである。
【００５１】
以上のように、論理的には図７の（ｂ）のようにペア化できるが、ハードウェアの制限でできない場合（ペア化したロード命令とストア命令の要素がそれぞれ交差しているため、片方を優先すると他方を分解せざるを得ない場合）、空レジスタがある場合には、ＬＯＡＤとＳＴＯＲＥの間にＭＯＶを挿入（レジスタ間転送）することで、ペア化した命令の分解を防ぐことができる。また、一般には、ＳＴＯＲＥより、ＭＯＶの方が実際の実行速度が大幅に速く、実行速度を向上できる。
【００５２】
図８は、本発明の具体例（その５）を示す。これは、初期化を行うイメージであって、ループアンローリングした後のイメージであり、ストア対象が同一要素（Ｔ．１）であると、そのままではペア化してもレジスタ割付け時に分解して元に戻さざるを得ないが、ＭＯＶ命令でストア対象が同一要素でなく（Ｔ．１、Ｔ．２）にすることでペア化した後のレジスタ割付けで分解しなくてもよいようにした例である。
【００５３】
図８の（ａ）は、ループアンローリングにより４つに展開したソースプログラムのイメージの例を示す。
図８の（ｂ）は、図８の（ａ）の展開後のテキストのイメージの例を示す。
【００５４】
図８の（ｃ）は、図８の（ｂ）のＳＴＯＲＥを２つをそれぞれペア化した後のテキストのイメージの例を示す。ここでは、図８の（ｂ）の２つのＳＴＯＲＥ命令を矢印で示すように、ペア化したＳＴＯＲＥ命令に置換している。
【００５５】
図８の（ｄ）は、図８の（ｃ）でペア化してもレジスタ割付け時にペア化したＳＴＯＲＥを分解しないでもよいように、ＭＯＶ命令でレジスタＴ．１、Ｔ．２を使用したテキストのイメージ例を示す。この場合には、ＭＯＶＥＴ．２，１．０を追加し、２つのレジスタＴ．１、Ｔ．２を使用してレジスタ割付け時にＳＴＯＲＥ命令が同一要素のレジスタを使わなくてもよいようにした例を示し、図中の矢印で示すようにペア化されたこととなる。
【００５６】
以上のように、２つのＳＴＯＲＥ命令の複数をペア化してレジスタ割付け時に同一要素のレジスタを使う場合には、ＭＯＶ命令を挿入して同一要素のレジスタを使わないようにして、レジスタ割付け時のペア化した命令の分解を無くすことが可能となる。
【００５７】
図９は、本発明の効果説明図を示す。
図９の（ａ）は、ソースプログラムのイメージの例を示す。ここでは、ループアンローリングによって４つに展開されている。
【００５８】
図９の（ｂ）は、ＬＯＡＤ／ＡＤＤ／ＳＴＯＲＥ命令の実行数を、それぞれ１τとした場合のペア化した場合と、ペア化しない場合の総実行数を示す。
・ペア化しない場合には、１６τとなる。
【００５９】
・ペア化した場合には、１０τとなる。
尚、τとはマシンサイクルのことで、例えば１００ＭＨｚで動作するＣＰＵであれば、１０ｎｓ（ナノセカンド）が１τとなる。
【００６０】
従って、１６／１０＝１．６倍、本発明のペア化によって、図９の（ａ）のループアンローリングした後のソースプログラムは高速実行可能なオブジェクトプログラムとなる。以下詳細に説明する。
【００６１】
図９の（ｃ）は、ペア化しない場合（従来）のテキストのイメージ例を示す。これは、図９の（ａ）のループアンローリングにより４つに展開したソースプログラムを実行する場合のテキストのイメージを示す。ここでは、図９の（ａ）の該当ソースプログラムから矢印でテキストの関係を示す。このペア化しない場合（従来）では、
・ＬＯＡＤ命令が４つ
・ＬＯＡＤ命令が４つ
・ＡＤＤ命令が４つ
・ＳＴＯＲＥ命令が４つ
の合計１６個（１６τ）となる。
【００６２】
図９の（ｄ）は、ペア化した場合（本発明）のテキストのイメージ例を示す。これは、図９の（ａ）のループアンローリングにより４つに展開したソースプログラムを実行する場合のテキストのイメージを示す。ここでは、図９の（ｂ）の従来のテキスト中の２つの連続するＬＯＡＤ命令を１つのペア化したＬＯＡＤ命令、および２つの連続するＳＴＯＲＥ命令を１つのペア化したＳＴＯＲＥ命令に矢印を用いて示すように置換したものである。このペア化した場合（本願発明）では、
・ＬＯＡＤ命令が２つ
・ＬＯＡＤ命令が２つ
・ＡＤＤ命令が４つ
・ＳＴＯＲＥ命令が２つ
の合計１０個（１０τ）となる。これにより、図９の（ａ）で上述したように、従来のペア化しない場合は１６τ、本発明のペア化した場合は１０τとなり、１６／１０＝１．６倍、本願発明のペア化した場合が速く実行できるオブジェクトプログラムを生成することが可能となる。
【００６３】
【発明の効果】
以上説明したように、本発明によれば、ソースプログラム１を解析してあるいはレジスタ割り付けの最適化時に、メモリアクセスについて所定条件を満たす隣接あるいは近傍にある複数のメモリアクセス命令をまとめて連続実行メモリアクセス命令（例えばペアメモリアクセル命令）で置換する構成を採用しているため、実行回数を削減してメモリアクセスの高速化を図るオブジェクトプログラムを生成することができ、実行性能を向上させることが可能となる。
【図面の簡単な説明】
【図１】本発明のシステム構成図である。
【図２】本発明の動作説明フローチャート（その１）である。
【図３】本発明の動作説明フローチャート（その２）である。
【図４】本発明の具体例（その１）である。
【図５】本発明の具体例（その２）である。
【図６】本発明の具体例（その３）である。
【図７】本発明の具体例（その４）である。
【図８】本発明の具体例（その５）である。
【図９】本発明の効果説明図である。
【符号の説明】
１：ソースプログラム
２：コンパイラ
３：ソースプログラム解析手段
４：最適化・レジスタ割付け手段
５：命令削減手段
６：スケジューリング・コード生成手段
７：オブジェクトプログラム[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a memory access instruction reduction device and a recording medium that analyze a source program and reduce memory access instructions.
[0002]
[Prior art]
Conventionally, in a computer, an operation, for example, addition A (1) = B (1) + B (2) (1)
If you run
LW GR2, B
LW GR3, B + 4
ADD GR4, GR2, GR3
SW A, GR4
And expanded and executed. Here, LW GR2, B is an instruction to load data at address B of the memory into the register GR2. LW GR3 and B + 4 are instructions for loading the data at address (B + 4) of the memory into the register GR3. ADD GR4, GR2, and GR3 are instructions for adding the contents of the register GR2 and the contents of the register GR3 and assigning the result to the register GR4. SW A and GR4 are instructions for storing the contents (addition result) of the register GR4 in the address A of the memory. As described above, four times of execution (machine cycle) are required to execute the expression (1).
[0003]
Further, DO I = 1, N, 4 which is a DO loop (2)
A (I) = B (I) + C (I)
A (I + 1) = B (I + 1) + C (I + 1)
A (I + 2) = B (I + 2) + C (I + 2)
A (I + 3) = B (I + 3) + C (I + 3)
ENDO
In the case of, it was executed as follows.
[0004]
LOAD T.1, (B, BASE = B, INDEX = I, DISP = 0)
LOAD T.2, (B, BASE = B, INDEX = I, DISP = 8)
LOAD T.3, (B, BASE = B, INDEX = I, DISP = 16)
LOAD T.4, (B, BASE = B, INDEX = I, DISP = 24)
LOAD T.5, (C, BASE = C, INDEX = I, DISP = 0)
LOAD T.6, (C, BASE = C, INDEX = I, DISP = 8)
LOAD T.7, (C, BASE = C, INDEX = I, DISP = 16)
LOAD T.8, (C, BASE = C, INDEX = I, DISP = 24)
ADD T.9, T.1, T.5
ADD T.10, T.2, T.6
ADD T.11, T.3, T.7
ADD T.12, T.4, T.8
STORE (A, BASE = A, INDEX = I, DISP = 0), T.9
STORE (A, BASE = A, INDEX = I, DISP = 8), T.10
STORE (A, BASE = A, INDEX = I, DISP = 16), T.11
STORE (A, BASE = A, INDEX = I, DISP = 24), T.12
In this case,
• A total of 16 execution times (machine cycles) were required: 8 load instructions, 4 add instructions, and 4 store instructions.
[0005]
[Problems to be solved by the invention]
When executing the above formulas (1) and (2), there is a problem that the number of executions required for memory access is large and the execution is not possible at high speed. For this reason, it is desired that a plurality of memory access instructions are collectively executed at high speed.
[0006]
In order to solve these problems, the present invention analyzes a source program, replaces a plurality of memory access instructions adjacent to or in the vicinity satisfying a predetermined condition for memory access, and replaces them with a continuous execution memory access instruction. The purpose is to reduce the memory access speed.
[0007]
[Means for Solving the Problems]
Means for solving the problem will be described with reference to FIG.
In FIG. 1, a source program analysis means 3 performs morpheme and syntax analysis on the source program 1.
[0008]
The optimization / register allocation unit 4 optimizes the register allocation based on the result of analyzing the source program. The instruction reduction unit 5 reduces the memory instruction according to the present invention when the register allocation is optimized. It is comprised from.
[0009]
When optimizing register allocation, the instruction reduction unit 5 determines whether a condition that allows adjacent or neighboring continuous memory access instructions to be executed collectively is satisfied, or determines that the condition is satisfied when the condition is satisfied. For example, a continuous memory access instruction in the vicinity is replaced with a continuous execution memory accelerator instruction (for example, a pair memory access instruction).
[0010]
The optimization / register allocation means 4 mainly reduces the intermediate text for optimization, and register allocation performs register allocation as the name suggests, and actually allocates registers after optimization. In the present invention, this is mainly done in the optimization part, and when using MOV to avoid disassembly, register allocation is used. The same applies below.
[0011]
Next, the operation will be described.
When analyzing the source program 1 or optimizing the register allocation, it is determined whether or not the condition that the consecutive or adjacent memory access instructions detected by the instruction reduction means 5 can be executed together is satisfied, and the condition is satisfied When the determination is made, adjacent memory access instructions adjacent to or in the vicinity are replaced with continuous execution memory access instructions (for example, pair memory access instructions), and the replaced object program is generated.
[0012]
At this time, when the detected adjacent memory access instruction in the vicinity or near is replaced with a continuous execution memory access instruction (for example, a pair memory access instruction), the MOV instruction is used when the execution order of the memory access instructions is reversed. After the execution order of the memory access instructions is changed, replacement is performed with a continuous execution memory access instruction (for example, a pair memory access instruction).
[0013]
In addition, as a condition for executing consecutive memory access instructions in the vicinity, the number of instructions between the memory access instruction and the next memory access instruction must be less than a predetermined number, and the register is unduly long and occupied. Try to prevent.
[0014]
In addition, as conditions, it is assumed that there is no instruction having a data dependency relationship between consecutive memory access instructions with the same variable type, base, index, and offset from the base.
[0015]
Therefore, when analyzing the source program 1 or optimizing the register allocation, a plurality of memory access instructions adjacent to or in the vicinity satisfying a predetermined condition for memory access are combined into continuous execution memory access instructions (for example, pair memory accelerator instructions). By replacing, it is possible to reduce the number of executions and increase the memory access speed.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments and operations of the present invention will be described in detail sequentially with reference to FIGS.
[0017]
FIG. 1 shows a system configuration diagram of the present invention.
In FIG. 1, a source program 1 is a source program that is a target for reducing memory access instructions.
[0018]
The compiler 2 reads and analyzes the source program 1, performs optimization, etc., and generates an executable object program 7. Here, the source program analysis means 3, the optimization / register allocation means 4 And scheduling code generation means 6 and the like.
[0019]
The source program analysis means 3 generates an intermediate text (intermediate language) suitable for processing inside by analyzing the morpheme and syntax of the source program 1.
The optimization / register allocation unit 4 optimizes register allocation based on the result of analyzing the source program 1, and is an instruction reduction unit that reduces the memory instruction according to the present invention when the register allocation is optimized. 5 or the like.
[0020]
When optimizing register allocation, the instruction reduction unit 5 determines whether a condition that allows adjacent or neighboring continuous memory access instructions to be executed collectively is satisfied, or determines that the condition is satisfied when the condition is satisfied. For example, a continuous memory access instruction in the vicinity is replaced with a pair memory access instruction.
[0021]
The scheduling code generation means 6 performs scheduling after optimization and register allocation, generates an executable code, and outputs an object program 7.
[0022]
The object program 7 is an executable program.
Next, the operation of the instruction reducing unit 5 according to the present invention having the configuration shown in FIG. 1 will be described in detail in the order shown in the flowchart of FIG.
[0023]
FIG. 2 is a flowchart for explaining the operation of the present invention (part 1).
In FIG. 2, S1 extracts processing unit information. This is based on the result of analysis of the source program 1, and a unit of processing, for example, a unit of processing as shown in FIG. 4B described later (in FIG. 4B, FIG. Information of A (1) = B (1) + B (2) added in (a)) is extracted.
[0024]
In S2, intermediate text is extracted from the processing unit information. This takes out the intermediate text line by line from FIG.
S3 determines whether the text is a LOAD text. In step S2, it is determined whether the intermediate text extracted line by line from (b) of FIG. 4 is a LOAD instruction (an instruction for loading data from a memory to a register). In the case of YES, since it is determined that the LOAD text is obtained, the LOAD text is registered in the table in S4, and the process proceeds to S7. In the case of NO, since it is determined that it is not a LOAD text, the process proceeds to S5.
[0025]
S5 determines whether the text is a STORE text. In step S2, it is determined whether the intermediate text extracted line by line from FIG. 4B is a STORE instruction (an instruction to store data from a register to a memory). If YES, it is determined that the text is a STORE text, so the STORE text is registered in the table in S6, and the process proceeds to S7. In the case of NO, it is determined that the text is not a STORE text, so the process proceeds to S7.
[0026]
S7 determines whether there is a next intermediate text. If YES, the intermediate text is taken out and the process returns to S3 and is repeated. If no, the process proceeds to S8.
In S8, it is determined whether there is next processing unit information. In the case of YES, the processing unit information is taken out and the process returns to S2 and is repeated. In the case of NO, FIG. 3 is called based on the table information registered in S9.
[0027]
As described above, the memory access instruction (LOAD text and STORE text) is registered in the table (registered together with the line information) from the processing unit information obtained by analyzing the source program 1.
[0028]
FIG. 3 shows a flowchart (part 2) for explaining the operation of the present invention. This is a process called from S9 of FIG.
In FIG. 3, S11 takes out the text A from the table. This retrieves one text A from the top of the table.
[0029]
S12 retrieves text B from the table, which retrieves the next single text B from the table.
S13 determines whether the types are equal. This is the type of both text A and text B extracted in S11 and S12 (for example, REAL (real floating point) in the definition of REAL * 8A (100), B (100) in FIG. Type)) is equal. If YES, the process proceeds to S14. In the case of NO, since the condition is not satisfied, the process proceeds to S20.
[0030]
In S15, it is determined whether the indexes of the text A and the text B are equal. This is because, for example, text A and text B in FIG.
LOAD (R8) T.1, (B, BASE = B, INDEX = NULL, DISP = 0)
LOAD (R8) T.2, (B, BASE = B, INDEX = NULL, DISP = 8)
In this case, it is determined whether the index is INDEX = NULL and both are equal. If YES, the process proceeds to S16. In the case of NO, since the condition is not satisfied, the process proceeds to S20.
[0031]
S16 determines whether the absolute value of the difference between the displacements of text A and text B is equal to the size of the type. This is because, for example, text A and text B in FIG.
LOAD (R8) T.1, (B, BASE = B, INDEX = NULL, DISP = 0)
LOAD (R8) T.2, (B, BASE = B, INDEX = NULL, DISP = 8)
In this case, the absolute value “8” of the difference between the displacement DISP = 0 and DISP = 8 is the definition of the mold size (REAL * 8 A (100), B (100) in FIG. 6A). In this example, the determination is YES and the process proceeds to S17, and if NO, the condition is not satisfied, and the process proceeds to S20.
[0032]
In S17, it is determined whether there is a text having a data dependency relationship between the text A and the text B. This is to determine whether there is any other text in which the text A and the text B are dependent on the data of the text A or text B and the data may change. In the case of YES, since it has been found that there is no other text having a dependency relationship with the data, the process proceeds to S18. In the case of NO, since the condition is not satisfied, the process proceeds to S20.
[0033]
S18 determines whether the distance between the text A and the text B is N or less. This is because it is determined whether the number of other texts between text A and text B is N or less (during N, the register is exclusively used and cannot be used elsewhere. Limit the situation where the register is occupied indefinitely, and determine whether it is within the limit). If YES, the process proceeds to S19. In the case of NO, since the condition is not satisfied, the process proceeds to S20.
[0034]
In S19, since the text A and the text B satisfy all the conditions, the text A and the text B are paired (replaced by a pair memory access instruction), deleted from the table, and the process proceeds to S20. Here, the conditions for pairing the text A and the text B are summarized as follows.
[0035]
-Type matches-Index matches-Base address (base) matches-The absolute value of the difference between the two displacements (DISP) matches the size of the type-There is no data dependency between the two.
[0036]
Other S20 determines whether there is the next text B. In the case of YES, the text B is taken out and S14 and subsequent steps are repeated (this allows the text A to be fixed and all other texts to be taken out as text B from the table and the condition determination can be repeated). . If NO, the process proceeds to S21.
[0037]
In S21, it is determined whether or not there is the next text A. In the case of YES, the text A is taken out and S12 and subsequent steps are repeated. If NO, the process ends.
As described above, it is possible to take out the text A and the text B from the table, find a pair satisfying the condition, and make a pair (replace with the pair memory access instruction). Specific examples will be described in detail below.
[0038]
FIG. 4 shows a specific example (part 1) of the present invention.
FIG. 4A shows an image of the source program 1. Where A (1) = B (1) + B (2)
Is an operation in which the contents of B (1) and the contents of B (2) are added and put into A (1). When this is developed into text for actual execution, it becomes as shown in FIG.
[0039]
Load the contents of LW GR2, B / memory B into register GR2.
LW GR3, B + 4 / Load the contents of memory (B + 4) into register GR3
ADD GR4, GR2, GR3 Add the contents of / GR2 and GR3 and put the result in GR4
The contents of SW A, GR4 / GR4 are stored in memory A. LW GR3, B on the first line and LW GR3, B + 4 on the second line in these texts are explained in the flowchart of FIG. Therefore, it is paired and replaced with one of the pair memory access instructions LWP GR2 and B, and even if two cycles are required, it is executed in one cycle, and the following is shown in FIG. Like that.
[0040]
Load the contents of LWP GR2, B / Memory B and B + 4 into registers GR2 and GR3
ADD GR4, GR2, GR3 Add the contents of / GR2 and GR3 and put the result in GR4
Since the contents of SW A, GR4 / GR4 are stored in memory A and more than two adjacent load instructions satisfy the conditions described in FIG. 3, they are paired and the number of cycles is reduced by 1 cycle. This is a reduction to 3 cycles.
[0041]
FIG. 5 shows a specific example (No. 2) of the present invention. This is a so-called loop unrolling technique in which the number of rotations of the loop is expanded (in this case, expanded to four) to reduce the number of branches for optimization (acceleration). The invention is applied, and four consecutive loads and stores are each paired.
[0042]
FIG. 5A shows an example of a loop. Here, the calculation of A (I) = B (I) + C (I) is obtained by looping from I = 1 to 1024 times.
(B) of FIG. 5 expand | deploys the rotation speed of a loop, and expands to four here, and is as follows of illustration.
[0043]
DO I = 1,1024,4
A (I) = B (I) + C (I)
A (I + 1) = B (I + 1) + C (I + 1)
A (I + 2) = B (I + 2) + C (I + 2)
A (I + 3) = B (I + 3) + C (I + 3)
ENDO
After this expansion, the first and second are paired as one (same as described in FIG. 4), the third and fourth are paired and executed with two pair instructions Like that. These results
-4 stores of A are paired, 2-4 loads of B are paired, 2-4 loads of C are paired, 2 or more, and the memory access instruction is halved and faster Can be achieved.
[0044]
FIG. 6 shows a specific example (part 3) of the present invention.
FIG. 6A shows an example of a text image. here,
REAL * 8 A (100), B (100) / Real type floating point 8 bytes, A and B have 100 elements
...
Add the contents of A (1) = B (1) + B (2) / B (1) and B (2) to A (1)
LOAD (R8) T.1, (B, BASE = B, INDEX = NULL, DISP = 0)
LOAD (R8) T.2, (B, BASE = B, INDEX = NULL, DISP = 8)
ADD (R8) T.3, T.1, T.2
STORE (R8) (A, BASE = A, DISP = 0), T.3
Here, LOAD (R8) T.1, (B, BASE = B, INDEX = NULL, DISP = 0) is the data of (B, BASE = B, INDEX = NULL, DISP = 0) on the memory, Load into register T.1. Here, BASE is the base (here, B address), INDEX is the contents shown in the index register (here, NULL is added to BASE)), and DISP is the displacement (offset) to be added to BASE = B . The same applies hereinafter. Here, since the first load instruction and the second load instruction satisfy the conditions of FIG. 3 described above, these two load instructions are paired and a pair memory access indicated by an arrow in FIG. 6B. Replace with an instruction.
[0045]
FIG. 6B shows an image of text after pairing two load instructions. Here, the first line
LOAD (R8) (T.1, T.2), (B, BASE = B, INDEX = NULL, DISP = 0)
Is a pair memory access instruction, and stores data from addresses 0 and 8 from the base in the memory in registers T.1 and T.2 in one cycle, respectively. This makes it possible to replace two load instructions with one paired instruction and halve memory access.
[0046]
FIG. 7 shows a specific example (No. 4) of the present invention.
FIG. 7A shows an example of a text image. Where the second line
REAL * 8 A (100), B (100)
Is a variable A, B is an element 100, and represents a floating point type 8-byte definition. The lower four lines show the case where the STORE instruction is inserted between the LOAD instructions. FIG. 7B shows a pair of two LOAD instructions as indicated by arrows.
[0047]
FIG. 7B shows an example of a text image after pairing. This is a pair of two distant LOAD instructions in FIG. 7A and one pair of two distant STORE instructions.
[0048]
FIG. 7C shows an example of an image of text after pairing in FIG. 7B and register allocation with priority given to the LOAD instruction. This is because, as shown in FIG. 7E, the first LOAD instruction is a pair instruction and goes from the memory B (1), B (2) to the Nth register and (N + 1) th (representing two consecutive registers). Although it is loaded, it is disassembled because the stores cross each other, and the original two store instructions are used to store from the registers Nth and (N + 1) th to the memories A (2) and A (1). For this reason, even when paired, at the time of register allocation, one paired store instruction must be disassembled and two original store instructions must be used.
[0049]
FIG. 7D shows an example of a text image after register allocation in the case where the register has a margin after pairing in FIG. 7B. As shown in FIG. 7 (f), the first LOAD instruction is a pair instruction, and the memory B (1) and B (2) are registered in the Nth and (N + 1) th registers (representing two consecutive registers). Load, but because the store crosses, here we move to register (N + 3) (representing the third consecutive register) using the MOV instruction, then register (N + 1) with the paired store instruction , (N + 2) to the memories A (1) and A (2). In this case, it can be dealt with by adding only one MOV instruction without disassembling the paired instructions.
[0050]
FIG. 7E shows a schematic diagram of the state of LOAD and STORE in FIG. The mutual relationship is as shown by the arrows.
FIG. 7F shows a schematic diagram of the state of the MOV inserted between LOAD and STORE in FIG. The mutual relationship is as shown by the arrows.
[0051]
As described above, logically pairing is possible as shown in FIG. 7B, but when it is not possible due to hardware limitations (the paired load instruction and store instruction elements intersect each other, If there is an empty register, MOV is inserted between LOAD and STORE (transfer between registers) to prevent paired instructions from being decomposed. it can. In general, the actual execution speed of MOV is significantly faster than STORE, and the execution speed can be improved.
[0052]
FIG. 8 shows a specific example (No. 5) of the present invention. This is an image to be initialized and is an image after loop unrolling. If the store target is the same element (T.1), it will be decomposed at the time of register allocation even if it is paired as it is. In this example, the store target is not the same element (T.1, T.2) by the MOV instruction, but it is not necessary to disassemble by register allocation after pairing. .
[0053]
FIG. 8A shows an example of an image of a source program developed into four by loop unrolling.
FIG. 8B shows an example of an image of the text after the development of FIG.
[0054]
FIG. 8C shows an example of a text image after pairing two STOREs in FIG. 8B. Here, the two STORE instructions in FIG. 8B are replaced with paired STORE instructions as indicated by arrows.
[0055]
(D) in FIG. 8 shows that the register T.P is read by the MOV instruction so that the paired STORE at the time of register allocation may not be decomposed even if paired in FIG. 1, T. An example of a text image using 2 is shown. In this case, MOVE T. 2, 1.0 and two registers T. 1, T. 2 shows an example in which the STORE instruction does not have to use the register of the same element at the time of register allocation, and is paired as indicated by an arrow in the figure.
[0056]
As described above, when a plurality of two STORE instructions are paired and the register of the same element is used at the time of register allocation, the MOV instruction is inserted so that the register of the same element is not used. It is possible to eliminate the disassembly of the converted instructions.
[0057]
FIG. 9 is a diagram for explaining the effect of the present invention.
FIG. 9A shows an example of an image of a source program. Here, it is developed into four by loop unrolling.
[0058]
(B) of FIG. 9 shows the total number of executions when the number of executions of the LOAD / ADD / STORE instructions is 1τ and when they are not paired.
・ If not paired, 16τ.
[0059]
・ When paired, it becomes 10τ.
Note that τ is a machine cycle. For example, if the CPU operates at 100 MHz, 10 ns (nanosecond) becomes 1τ.
[0060]
Therefore, 16/10 = 1.6 times, and by pairing of the present invention, the source program after the loop unrolling of FIG. 9A becomes an object program that can be executed at high speed. This will be described in detail below.
[0061]
FIG. 9C shows an example of a text image when not paired (conventional). This shows an image of text when the source program expanded into four by loop unrolling in FIG. 9A is executed. Here, the text relationship is indicated by an arrow from the corresponding source program in FIG. If this is not paired (previous),
-4 LOAD instructions-4 LOAD instructions-4 ADD instructions-4 STORE instructions, a total of 16 (16τ)
[0062]
FIG. 9D shows an example of a text image when paired (invention). This shows an image of text when the source program expanded into four by loop unrolling in FIG. 9A is executed. Here, arrows are used for a paired LOAD instruction for two consecutive LOAD instructions and a paired STORE instruction for two consecutive STORE instructions in the conventional text of FIG. 9B. Substitution as shown. In the case of this pairing (the present invention),
-2 LOAD instructions-2 LOAD instructions-4 ADD instructions-2 STORE instructions, a total of 10 (10τ) As a result, as described above with reference to FIG. 9A, 16τ is obtained when the conventional pairing is not performed, and 10τ when the pairing is performed according to the present invention. It is possible to generate an object program that can be executed quickly.
[0063]
【The invention's effect】
As described above, according to the present invention, when analyzing the source program 1 or optimizing the register allocation, a plurality of memory access instructions adjacent to or near a predetermined condition for memory access are collected and continuously executed. Since it uses a configuration that replaces with an access instruction (for example, a pair memory accelerator instruction), it is possible to generate an object program that speeds up memory access by reducing the number of executions, which can improve execution performance It becomes.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of the present invention.
FIG. 2 is a flowchart (part 1) illustrating the operation of the present invention.
FIG. 3 is a flowchart (part 2) illustrating the operation of the present invention.
FIG. 4 is a specific example (part 1) of the present invention.
FIG. 5 is a specific example (part 2) of the present invention.
FIG. 6 is a specific example (part 3) of the present invention.
FIG. 7 is a specific example (part 4) of the present invention.
FIG. 8 is a specific example (No. 5) of the present invention.
FIG. 9 is an explanatory diagram of effects of the present invention.
[Explanation of symbols]
1: source program 2: compiler 3: source program analysis means 4: optimization / register allocation means 5: instruction reduction means 6: scheduling code generation means 7: object program

Claims

In a memory access instruction reduction device that analyzes a source program and reduces memory access instructions,
As each means that the computer has,
First store instruction and for transferring the contents of the first register in the first memory address, and a second store instruction to transfer the contents of the first register to a second memory address, continuous or predetermined distance Means for detecting the first store instruction and the second store instruction that continue away in range;
Means for determining whether or not a condition capable of collectively executing the detected first store instruction and the second store instruction is satisfied;
The first store instruction and the second store instruction that are determined to satisfy the condition are set to 1 for the content of the first register and the content of the second register different from the first register. The STORE instruction that transfers the contents of the first register to the second register is replaced with the continuous store instruction that is transferred to the first memory address and the second memory address with one instruction. And a memory access instruction reduction device characterized by comprising:

Computer
First store instruction and for transferring the contents of the first register in the first memory address, and a second store instruction to transfer the contents of the first register to a second memory address, continuous or predetermined distance Means for detecting the first store instruction and the second store instruction that continue away in range;
Means for determining whether or not a condition capable of collectively executing the detected first store instruction and the second store instruction is satisfied;
The first store instruction and the second store instruction that are determined to satisfy the condition are set to 1 for the content of the first register and the content of the second register different from the first register. The STORE instruction that transfers the contents of the first register to the second register is replaced with the continuous store instruction that is transferred to the first memory address and the second memory address with one instruction. The computer-readable recording medium which recorded the program made to function as a means inserted before.