JPWO2017056427A1

JPWO2017056427A1 - Program rewriting device, method and storage medium

Info

Publication number: JPWO2017056427A1
Application number: JP2017542715A
Authority: JP
Inventors: 悠記小林
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2015-09-30
Filing date: 2016-09-14
Publication date: 2018-07-19
Also published as: WO2017056427A1

Abstract

本発明は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、アクセラレータに対して効率的にデータを供給するようプログラムを書き換える技術を提供する。プログラム解析部１１は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいてオフロード対象処理を特定する。高位合成部１２は、オフロード対象処理をアクセラレータ上に実装する演算装置の設計内容情報を生成するとともに演算装置の構造情報を生成する。プログラムコード片生成部１３は、構造情報に基づいて、演算装置に対して同時に供給する入力データの組をメモリ上の連続領域に配置し連続領域を用いてオフロード対象処理を呼び出す処理を記述したプログラムコード片を生成する。プログラム書換部１４は、プログラムにおけるオフロード対象処理の呼び出し箇所に、プログラムコード片を埋め込む。The present invention provides a technique for rewriting a program to be efficiently supplied to an accelerator in a program to be executed by a computer system including a processor and an accelerator. The program analysis unit 11 specifies an offload target process in a program to be executed by a computer system including a processor and an accelerator. The high-level synthesis unit 12 generates design content information of the arithmetic device that implements the offload target process on the accelerator and generates structure information of the arithmetic device. Based on the structure information, the program code fragment generation unit 13 describes a process of placing a set of input data to be simultaneously supplied to the arithmetic device in a continuous area on the memory and calling the offload target process using the continuous area. Generate a piece of program code. The program rewriting unit 14 embeds a piece of program code at the location where the offload target process is called in the program.

Description

本発明は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムを書き換える技術に関する。 The present invention relates to a technique for rewriting a program to be executed by a computer system including a processor and an accelerator.

プロセッサおよびアクセラレータを含むコンピュータシステムが知られている。このようなコンピュータシステムに実行させるプログラムを記載する際には、アクセラレータに割り当てる処理を決める必要がある。例えば、特許文献１には、アクセラレータに割り当てる関数を実行時に動的に決定する技術が記載されている。以降、アクセラレータに処理を割り当てることを、オフロードするとも記載する。 Computer systems that include a processor and an accelerator are known. When describing a program to be executed by such a computer system, it is necessary to decide a process to be assigned to an accelerator. For example, Patent Document 1 describes a technique for dynamically determining a function to be assigned to an accelerator at the time of execution. Hereinafter, allocating processing to an accelerator is also referred to as offloading.

特開２０１２−１３３７７８号公報JP 2012-133778 A

ここで、オフロードされた処理が実装されたアクセラレータには、効率的に入力データが供給されることが求められる。例えば、オフロードされた処理がパイプライン演算器として実装される場合、パイプライン演算器に対して、毎サイクル入力データが供給されることが望ましい。この場合、パイプライン演算のそれぞれのステージにおいて必要となる入力データが、パイプライン演算器に対して同時に供給される必要がある。つまり、パイプライン演算器に対しては、同じ組の入力データ群を、パイプライン演算器の構造に応じた異なるタイミングで供給する必要がある。そのために、多段のパイプラインレジスタを追加して入力データを遅延させることが考えられる。しかしながら、多段のパイプラインレジスタは、大きな回路面積を消費するため受け入れられない。 Here, it is required that the input data is efficiently supplied to the accelerator in which the offloaded processing is mounted. For example, when an offloaded process is implemented as a pipeline computing unit, it is desirable to supply input data for each cycle to the pipeline computing unit. In this case, input data required in each stage of the pipeline operation needs to be simultaneously supplied to the pipeline operation unit. That is, it is necessary to supply the same set of input data groups to the pipeline arithmetic unit at different timings according to the structure of the pipeline arithmetic unit. Therefore, it is conceivable to delay input data by adding a multi-stage pipeline register. However, multistage pipeline registers are not acceptable because they consume a large circuit area.

このように、オフロードされた処理が実装されたアクセラレータは、あるサイクルにおいて、異なる組の複数個の入力データを必要とする。ところが、通常のメモリシステムでは、あるサイクルに要求できるアドレスは１種類である。メモリバス幅が広ければ、連続領域に配置された複数個の入力データを一度に読み込むことは可能である。しかしながら、非連続な領域に配置された複数個の入力データを一度に読み込むことはできない。したがって、オフロードされた処理が実装されたアクセラレータが、あるサイクルにおいて必要とする複数個の入力データを効率的に読み込むためには、それらの入力データが連続領域に配置されている必要がある。 As described above, an accelerator in which offloaded processing is implemented requires a plurality of different sets of input data in a certain cycle. However, in a normal memory system, one type of address can be requested in a certain cycle. If the memory bus width is wide, it is possible to read a plurality of input data arranged in the continuous area at a time. However, a plurality of input data arranged in non-contiguous areas cannot be read at a time. Therefore, in order for an accelerator equipped with an offloaded process to efficiently read a plurality of input data required in a certain cycle, the input data needs to be arranged in a continuous area.

このように、オフロードされる処理が実装されたアクセラレータに対して入力データを効率的に供給するためには、各入力データの必要なタイミングを考慮して異なる組の複数個の入力データを、連続領域に配置するプログラムを記述する必要がある。しかし、そのようなプログラムを記述することは、プログラマにとって困難かつ非効率的である。 As described above, in order to efficiently supply input data to an accelerator in which processing to be offloaded is implemented, a plurality of different sets of input data are considered in consideration of the necessary timing of each input data. It is necessary to write a program to be placed in a continuous area. However, writing such a program is difficult and inefficient for programmers.

特許文献１には、アクセラレータにオフロードする処理を実行時に動的に決めることについては記載されているものの、アクセラレータに対して入力データを効率的に供給することについては記載がない。 Japanese Patent Application Laid-Open No. 2004-228867 describes that the process of offloading to an accelerator is dynamically determined at the time of execution, but does not describe efficiently supplying input data to the accelerator.

本発明は、上述の課題を解決するためになされたものである。すなわち、本発明は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、アクセラレータに対して効率的にデータを供給するようプログラムを書き換える技術を提供することを目的とする。 The present invention has been made to solve the above-described problems. That is, an object of the present invention is to provide a technique for rewriting a program to efficiently supply data to an accelerator in a program executed by a computer system including a processor and an accelerator.

上記目的を達成するために、本発明のプログラム書換装置は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、前記アクセラレータにオフロードされる処理（オフロード対象処理）を特定するプログラム解析手段と、前記オフロード対象処理を前記アクセラレータ上に実装する演算装置の設計内容を表す設計内容情報を生成し、生成した設計内容情報に基づいて、前記演算装置の構造を表す構造情報を生成する高位合成手段と、前記構造情報に基づいて、前記演算装置に対して同時に供給する入力データの組を読み込んでメモリ上の連続領域に配置するとともに前記連続領域を用いて前記オフロード対象処理を呼び出す処理を記述したプログラムコード片を生成するプログラムコード片生成手段と、前記プログラムにおける前記オフロード対象処理の呼び出し箇所に、前記プログラムコード片を埋め込むプログラム書換手段と、を備える。 In order to achieve the above object, a program rewriting apparatus according to the present invention includes a program analysis unit for specifying a process (offload target process) to be offloaded to an accelerator in a program to be executed by a computer system including a processor and an accelerator. High-level synthesis that generates design content information representing the design content of an arithmetic device that implements the offload target process on the accelerator, and generates structure information that represents the structure of the arithmetic device based on the generated design content information And a process of reading a set of input data to be simultaneously supplied to the arithmetic unit based on the structure information and placing the set in a continuous area on a memory and calling the offload target process using the continuous area Program code fragment generator that generates the described program code fragment If, on the call point of the offload target process in the program, and a program rewrite means for embedding the program code fragment.

また、本発明の方法は、コンピュータ装置が、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、前記アクセラレータにオフロードされる処理（オフロード対象処理）を特定し、前記オフロード対象処理を前記アクセラレータ上に実装する演算装置の設計内容を表す設計内容情報を生成し、生成した設計内容情報に基づいて、前記演算装置の構造を表す構造情報を生成し、前記構造情報に基づいて、前記演算装置に対して同時に供給する入力データの組を読み込んでメモリ上の連続領域に配置するとともに前記連続領域を用いて前記オフロード対象処理を呼び出す処理を記述したプログラムコード片を生成し、前記プログラムにおける前記オフロード対象処理の呼び出し箇所に、前記プログラムコード片を埋め込む。 Further, the method of the present invention specifies a process (offload target process) to be offloaded to the accelerator in a program that is executed by a computer system including a processor and an accelerator. Generates design content information representing the design content of the arithmetic device mounted on the accelerator, generates structure information representing the structure of the arithmetic device based on the generated design content information, and performs the computation based on the structure information. A program code fragment describing a process for reading a set of input data to be simultaneously supplied to the apparatus and arranging the read data in a continuous area on the memory and calling the offload target process using the continuous area is generated in the program. The program at the location where the offload target process is called Embed the over-de-piece.

また、本発明の記憶媒体は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、前記アクセラレータにオフロードされる処理（オフロード対象処理）を特定するプログラム解析ステップと、前記オフロード対象処理を前記アクセラレータ上に実装する演算装置の設計内容を表す設計内容情報を生成し、生成した設計内容情報に基づいて、前記演算装置の構造を表す構造情報を生成する高位合成ステップと、前記構造情報に基づいて、前記演算装置に対して同時に供給する入力データの組を読み込んでメモリ上の連続領域に配置するとともに前記連続領域を用いて前記オフロード対象処理を呼び出す処理を記述したプログラムコード片を生成するプログラムコード片生成ステップと、前記プログラムにおける前記オフロード対象処理の呼び出し箇所に、前記プログラムコード片を埋め込むプログラム書換ステップと、をコンピュータ装置に実行させるプログラムを記憶している。 The storage medium of the present invention includes a program analysis step for specifying a process (offload target process) to be offloaded to the accelerator in a program executed by a computer system including a processor and an accelerator, and the offload target process. A high-level synthesis step for generating design content information representing the design content of the arithmetic device mounted on the accelerator, and generating structure information representing the structure of the arithmetic device based on the generated design content information, Based on this, a set of input data to be simultaneously supplied to the arithmetic device is read and arranged in a continuous area on the memory, and a program code fragment describing a process for calling the offload target process using the continuous area is generated. A program code fragment generation step, and the program Wherein the call point offload target process, and stores a program to be executed and the program rewriting step of embedding the program code fragments, to the computer system in.

本発明は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、アクセラレータに対して効率的にデータを供給するようプログラムを書き換える技術を提供することができる。 The present invention can provide a technique for rewriting a program to be efficiently supplied to an accelerator in a program executed by a computer system including a processor and an accelerator.

本発明の実施の形態としてのプログラム書換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the program rewriting apparatus as embodiment of this invention. 本発明の実施の形態としてのプログラム書換装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the program rewriting apparatus as embodiment of this invention. 本発明の実施の形態としてのプログラム書換装置の動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the program rewriting apparatus as embodiment of this invention. 本発明の実施の形態の具体例において入力されるプログラム、ならびに、そのうちオフロード対象処理およびオフロード非対象処理の一例を示す図である。It is a figure which shows an example of the program input in the specific example of embodiment of this invention, and an offload object process and an offload non-object process among them. 本発明の実施の形態の具体例におけるプログラム解析部の動作の一例を説明するフローチャートである。It is a flowchart explaining an example of operation | movement of the program analysis part in the specific example of embodiment of this invention. 本発明の実施の形態の具体例における高位合成部の動作の一例を説明するフローチャートである。It is a flowchart explaining an example of operation | movement of the high level synthetic | combination part in the specific example of embodiment of this invention. 本発明の実施の形態の具体例において決定されるパイプライン演算器の構成の一例を示す図である。It is a figure which shows an example of a structure of the pipeline arithmetic unit determined in the specific example of embodiment of this invention. 本発明の実施の形態の具体例において生成される設計内容情報の一例を示す図である。It is a figure which shows an example of the design content information produced | generated in the specific example of embodiment of this invention. 本発明の実施の形態の具体例において生成される構造情報の一例を示す図である。It is a figure which shows an example of the structure information produced | generated in the specific example of embodiment of this invention. 本発明の実施の形態の具体例におけるプログラムコード片生成部の動作の一例を説明するフローチャートである。It is a flowchart explaining an example of operation | movement of the program code piece production | generation part in the specific example of embodiment of this invention. 本発明の実施の形態の具体例において生成されるプログラムコード片の一例を示す図である。It is a figure which shows an example of the program code piece produced | generated in the specific example of embodiment of this invention. 本発明の実施の形態の具体例においてパイプライン演算器に同時に供給される入力データの組の一例を説明する図である。It is a figure explaining an example of the set of input data supplied simultaneously to a pipeline calculator in the specific example of embodiment of this invention. 本発明の実施の形態の具体例におけるプロローグおよびエピローグのプログラムコード片の一例を示す図である。It is a figure which shows an example of the program code piece of a prologue and an epilogue in the specific example of embodiment of this invention. 本発明の実施の形態の具体例において出力されるプログラムの一例を示す図である。It is a figure which shows an example of the program output in the specific example of embodiment of this invention.

以下、本発明の実施の形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明の実施の形態としてのプログラム書換装置１の機能ブロック構成を図１に示す。図１において、プログラム書換装置１は、プログラム解析部１１と、高位合成部１２と、プログラムコード片生成部１３と、プログラム書換部１４とを備える。 FIG. 1 shows a functional block configuration of a program rewriting device 1 as an embodiment of the present invention. In FIG. 1, the program rewriting device 1 includes a program analysis unit 11, a high-level synthesis unit 12, a program code fragment generation unit 13, and a program rewriting unit 14.

ここで、プログラム書換装置１は、図２に示すようなハードウェア要素によって構成可能である。図２において、プログラム書換装置１は、ＣＰＵ（Central Processing Unit）１００１およびメモリ１００２を含む。メモリ１００２は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、補助記憶装置（ハードディスク等）等によって構成される。この場合、プログラム書換装置１の各機能ブロックは、メモリ１００２に格納されるコンピュータ・プログラムを読み込んで実行するとともに他の各部を制御するＣＰＵ１００１によって構成される。なお、プログラム書換装置１およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 Here, the program rewriting device 1 can be configured by hardware elements as shown in FIG. In FIG. 2, the program rewriting device 1 includes a CPU (Central Processing Unit) 1001 and a memory 1002. The memory 1002 includes a RAM (Random Access Memory), a ROM (Read Only Memory), an auxiliary storage device (such as a hard disk), and the like. In this case, each functional block of the program rewriting device 1 is configured by a CPU 1001 that reads and executes a computer program stored in the memory 1002 and controls other units. Note that the hardware configuration of the program rewriting device 1 and each functional block thereof is not limited to the above-described configuration.

プログラム解析部１１は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、アクセラレータにオフロードされる処理（オフロード対象処理）を特定する。例えば、プログラム解析部１１は、入力として、プログラムを表す情報と、そのうちオフロード対象処理を指定する情報とを取得することにより、オフロード対象処理を特定してもよい。また、例えば、プログラム解析部１１は、入力としてプログラムを表す情報を取得し、そのうちオフロード対象処理を決定する公知の技術を用いて、オフロード対象処理を特定してもよい。 The program analysis unit 11 specifies a process (offload target process) to be offloaded to the accelerator in a program to be executed by a computer system including a processor and an accelerator. For example, the program analysis unit 11 may specify an offload target process by acquiring information representing a program and information designating the offload target process as input. In addition, for example, the program analysis unit 11 may acquire information representing a program as an input, and may specify the offload target process using a known technique for determining the offload target process.

高位合成部１２は、オフロード対象処理をアクセラレータ上に実装する演算装置の設計内容を表す設計内容情報を生成する。設計内容情報は、例えば、ＲＴＬ（Register Transfer Level）記述であってもよい。また、高位合成部１２は、生成した設計内容情報に基づいて、演算装置の構造を表す構造情報を生成する。構造情報には、例えば、各入力端子の遅延に関する情報が含まれていてもよい。 The high-level synthesis unit 12 generates design content information representing the design content of the arithmetic device that mounts the offload target process on the accelerator. The design content information may be, for example, RTL (Register Transfer Level) description. In addition, the high-level synthesis unit 12 generates structure information representing the structure of the arithmetic device based on the generated design content information. The structure information may include, for example, information regarding the delay of each input terminal.

プログラムコード片生成部１３は、演算装置の構造情報に基づいて、プログラムコード片を生成する。プログラムコード片は、アクセラレータ上に実装される演算装置に対して同時に供給する入力データの組をメモリ１００２上の連続領域に配置する処理の記述と、その連続領域を用いてオフロード対象処理を呼び出す処理の記述とを含む。例えば、構造情報に各入力端子の遅延に関する情報が含まれる場合、プログラムコード片生成部１３は、各入力端子について遅延を考慮することにより同時に供給する入力データの組を読み込み、読み込んだ入力データの組を連続領域に配置する処理を記述する。また、プログラムコード片生成部１３は、オフロード対象処理の呼び出し文として、上述の連続領域を示す情報を引数とする呼び出し文を記述する。 The program code fragment generation unit 13 generates a program code fragment based on the structure information of the arithmetic device. The program code fragment is a description of a process for arranging a set of input data to be simultaneously supplied to an arithmetic device mounted on an accelerator in a continuous area on the memory 1002, and an offload target process is called using the continuous area. Process description. For example, when the structure information includes information regarding the delay of each input terminal, the program code fragment generation unit 13 reads a set of input data to be supplied simultaneously by considering the delay for each input terminal, and the read input data Describes the process of placing a set in a continuous area. In addition, the program code fragment generation unit 13 describes a call statement having information indicating the above-described continuous area as an argument as a call statement for the offload target process.

プログラム書換部１４は、プログラムにおけるオフロード対象処理の呼び出し箇所に、上述のプログラムコード片を埋め込む。具体的には、プログラム書換部１４は、プログラムにおけるオフロード対象処理以外の部分について、オフロード対象処理の呼び出し文を含む箇所を、上述のプログラムコード片に置換したプログラムを、出力してもよい。 The program rewriting unit 14 embeds the above-described program code fragment at the location where the offload target process is called in the program. Specifically, the program rewriting unit 14 may output a program in which a part other than the offload target process in the program is replaced with the above-described program code fragment in the part including the call statement of the offload target process. .

以上のように構成されたプログラム書換装置１の動作を図３に示す。 The operation of the program rewriting device 1 configured as described above is shown in FIG.

図３において、まず、プログラム解析部１１は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムを取得する。そして、プログラム解析部１１は、取得したプログラムにおいて、オフロード対象処理を特定する（ステップＳ１）。 In FIG. 3, first, the program analysis unit 11 obtains a program to be executed by a computer system including a processor and an accelerator. And the program analysis part 11 specifies an offload object process in the acquired program (step S1).

例えば、前述のように、プログラム解析部１１は、オフロード対象処理を指定する情報を、プログラムとともに入力として取得してもよい。 For example, as described above, the program analysis unit 11 may acquire information specifying the offload target process as an input together with the program.

次に、高位合成部１２は、オフロード対象処理をアクセラレータ上に実装する演算装置に関して、設計内容情報を生成する（ステップＳ２）。 Next, the high-level synthesis unit 12 generates design content information regarding the arithmetic device that mounts the offload target process on the accelerator (step S2).

次に、高位合成部１２は、設計内容情報に基づいて、演算装置の構造を表す構造情報を生成する（ステップＳ３）。 Next, the high-level synthesis unit 12 generates structure information representing the structure of the arithmetic device based on the design content information (step S3).

次に、プログラムコード片生成部１３は、構造情報に基づいて、プログラムコード片を生成する（ステップＳ４）。 Next, the program code fragment generation unit 13 generates a program code fragment based on the structure information (step S4).

前述したように、プログラムコード片は、演算装置に対して同時に供給する入力データの組をメモリ１００２上の連続領域に配置する処理の記述と、その連続領域を用いてオフロード対象処理を呼び出す処理の記述とを含む。 As described above, the program code fragment is a description of a process for arranging a set of input data to be simultaneously supplied to the arithmetic device in a continuous area on the memory 1002 and a process for calling an offload target process using the continuous area. And a description.

次に、プログラム書換部１４は、プログラムにおけるオフロード対象処理の呼び出し箇所に、上述のプログラムコード片を埋め込む（ステップＳ５）。 Next, the program rewriting unit 14 embeds the above-described program code fragment at the location where the offload target process is called in the program (step S5).

前述したように、プログラム書換部１４は、プログラムにおけるオフロード対象処理以外の部分について、オフロード対象処理の呼び出し文を含む箇所を、上述のプログラムコード片に置換したプログラムを、出力してもよい。 As described above, the program rewriting unit 14 may output a program in which a portion other than the offload target process in the program is replaced with the above-described program code fragment in the part including the call statement of the offload target process. .

以上で、プログラム書換装置１は、動作を終了する。 Thus, the program rewriting device 1 finishes the operation.

このようにして、プログラム書換装置１は、ステップＳ１で取得したプログラムに対して、ステップＳ２で生成した設計内容情報と、ステップＳ５で書き換えたオフロード対象処理以外の部分のプログラムとを出力する。 In this way, the program rewriting device 1 outputs the design content information generated in step S2 and the program other than the offload target process rewritten in step S5 with respect to the program acquired in step S1.

次に、プログラム書換装置１の動作を具体例で示す。この例では、オフロード対象処理は、アクセラレータ上に、パイプライン演算器として実装されるものとする。 Next, the operation of the program rewriting device 1 will be shown as a specific example. In this example, it is assumed that the offload target process is implemented as a pipeline arithmetic unit on the accelerator.

まず、ステップＳ１において、プログラム解析部１１は、図４に示す入力プログラムを表す情報と、オフロード対象処理を指定する情報とを取得する。この例では、入力プログラム中には、「ｆｕｎｃＡ」および「ｍａｉｎ」という２つの関数が含まれている。また、プログラム解析部１１は、オフロード対象処理を指定する情報として、関数名「ｆｕｎｃＡ」を取得したものとする。 First, in step S1, the program analysis unit 11 acquires information representing the input program shown in FIG. 4 and information specifying an offload target process. In this example, the input program includes two functions “funcA” and “main”. Further, it is assumed that the program analysis unit 11 acquires the function name “funcA” as information for designating the offload target process.

そして、この具体例では、このステップにおいて、プログラム解析部１１は、取得した入力プログラムを、オフロード対象プログラムと、オフロードの対象でないオフロード非対象プログラムとに分けて出力する。 In this specific example, in this step, the program analysis unit 11 outputs the acquired input program separately to an offload target program and an offload non-target program that is not an offload target.

具体例におけるステップＳ１でのプログラム解析部１１の動作の詳細を図５に示す。 FIG. 5 shows details of the operation of the program analysis unit 11 at step S1 in the specific example.

図５において、プログラム解析部１１は、入力プログラムの構造解析を行い、関数名シンボルを抽出する（ステップＳ１１）。ここでは、「ｆｕｎｃＡ」および「ｍａｉｎ」が抽出される。 In FIG. 5, the program analysis unit 11 analyzes the structure of the input program and extracts function name symbols (step S11). Here, “funcA” and “main” are extracted.

次に、プログラム解析部１１は、抽出した関数名シンボルのうち、オフロード対象処理を指定する情報と一致する関数名シンボルを検索する（ステップＳ１２）。ここでは、「ｆｕｎｃＡ」が検出される。 Next, the program analysis unit 11 searches for a function name symbol that matches information specifying the offload target process from among the extracted function name symbols (step S12). Here, “funcA” is detected.

次に、プログラム解析部１１は、検出した関数名シンボルに対応する範囲のプログラムを、オフロード対象プログラムとして出力する（ステップＳ１３）。 Next, the program analysis unit 11 outputs a program in a range corresponding to the detected function name symbol as an offload target program (step S13).

次に、プログラム解析部１１は、オフロード対象処理を指定する情報「ｆｕｎｃＡ」と一致しない関数名シンボルに対応する範囲のプログラムを、オフロード非対象プログラムとして出力する（ステップＳ１４）。 Next, the program analysis unit 11 outputs a program in a range corresponding to the function name symbol that does not match the information “funcA” designating the offload target process as an offload non-target program (step S14).

以上で、具体例でのステップＳ１の動作の詳細な説明を終了する。 Above, detailed description of operation | movement of step S1 in a specific example is complete | finished.

これにより、図４に示すように、オフロード対象プログラムとしては、関数ｆｕｎｃＡの中身が出力される。また、オフロード非対象プログラムとしては、関数ｍａｉｎの中身が出力される。 As a result, as shown in FIG. 4, the contents of the function funcA are output as the offload target program. The contents of the function main are output as the offload non-target program.

次に、ステップＳ２〜Ｓ３において、高位合成部１２は、オフロード対象プログラムに基づいて、パイプライン演算器のＲＴＬ記述を出力する。また、高位合成部１２は、パイプライン演算器の各入力端子について、最も前段の入力端子からのサイクル数（遅延数）を表す情報を、構造情報として生成する。 Next, in steps S2 to S3, the high-level synthesis unit 12 outputs the RTL description of the pipeline arithmetic unit based on the offload target program. Further, the high-level synthesis unit 12 generates, as structure information, information indicating the number of cycles (the number of delays) from the previous input terminal for each input terminal of the pipeline arithmetic unit.

具体例におけるステップＳ２〜Ｓ３での高位合成部１２の動作の詳細を図６に示す。 Details of the operation of the high-level synthesis unit 12 in steps S2 to S3 in the specific example are shown in FIG.

図６において、高位合成部１２は、オフロード対象プログラムの高位合成を行い、パイプライン演算器の構造を生成する（ステップＳ２１）。 In FIG. 6, the high-level synthesis unit 12 performs high-level synthesis of the offload target program and generates a pipeline arithmetic unit structure (step S21).

この例では、図４に示したように、オフロード対象プログラムは、４つの入力ａ，ｂ，ｃ，ｄに対し、（（（ａ＊ｂ）＋ｃ）＊ｄ）の計算を行い、計算結果を出力するプログラムである。そこで、高位合成部１２は、このようなオフロード対象プログラムを実装するパイプライン演算器の構造として、図７に示す構造を生成する。このパイプライン演算器は、最初のパイプライン・ステージ（ステージ１）で、入力ａおよびｂの乗算を行う。乗算の遅延が２サイクルとすると、ステージ３で乗算結果が出力される。そして、このパイプライン演算器は、ステージ３で、ステージ１の乗算結果と入力ｃとの加算を行う。加算の遅延が２サイクルとすると、ステージ５で加算結果が出力される。そして、このパイプライン演算器は、ステージ５で、ステージ３の加算結果と入力ｄとの乗算を行う。乗算結果は、ステージ７で出力される。 In this example, as shown in FIG. 4, the offload target program calculates (((a * b) + c) * d) for the four inputs a, b, c, and d, and the calculation result Is a program that outputs Therefore, the high-level synthesis unit 12 generates the structure shown in FIG. 7 as the structure of the pipeline arithmetic unit that implements such an offload target program. This pipeline operator performs multiplication of inputs a and b in the first pipeline stage (stage 1). If the multiplication delay is two cycles, the multiplication result is output in stage 3. The pipeline computing unit adds the multiplication result of stage 1 and the input c at stage 3. If the delay of addition is 2 cycles, the addition result is output at stage 5. The pipeline arithmetic unit multiplies the addition result of stage 3 and the input d at stage 5. The multiplication result is output at stage 7.

次に、高位合成部１２は、生成された構造を、ＲＴＬ記述に変換し、出力する（ステップＳ２２）。 Next, the high-level synthesis unit 12 converts the generated structure into an RTL description and outputs it (step S22).

ここでは、図７に示したパイプライン演算器の構造に対して、図８に示すような、Ｖｅｒｉｌｏｇ−ＲＴＬ記述が出力されたものとする。なお、図８において、ＭＵＬＴＩＰＬＩＥＲは、乗算器を表す。また、ＡＤＤＥＲは、加算器を表す。 Here, it is assumed that a Verilog-RTL description as shown in FIG. 8 is output to the structure of the pipeline arithmetic unit shown in FIG. In FIG. 8, MULTIPLIER represents a multiplier. ADDER represents an adder.

次に、高位合成部１２は、生成された構造中のそれぞれの入力端子について、最も前段の入力端子からのサイクル数（遅延数）を求め、構造情報として出力する（ステップＳ２３）。 Next, for each input terminal in the generated structure, the high-level synthesis unit 12 obtains the number of cycles (the number of delays) from the previous input terminal and outputs it as structure information (step S23).

ここでは、図７に示したパイプライン演算器の構造に対して、図９に示すような構造情報が出力される。図９の例では、構造情報は、入力データおよび遅延数の組で表される。例えば、入力ａは、ステージ１で入力される。入力ａの遅延数は、最も前段のステージであるステージ１との差であるので、１−１＝０となる。入力ｂの遅延数は、入力ａと同様に０となる。入力ｃは、ステージ３で入力される。したがって、入力ｃの遅延数は、３−１＝２である。また、入力ｄは、ステージ５で入力される。したがって、入力ｄの遅延数は、５−１＝４である。 Here, structure information as shown in FIG. 9 is output for the structure of the pipeline arithmetic unit shown in FIG. In the example of FIG. 9, the structure information is represented by a set of input data and a delay number. For example, input a is input at stage 1. Since the delay number of the input a is the difference from the first stage, which is the first stage, 1-1 = 0. The delay number of the input b is 0 as in the case of the input a. Input c is input at stage 3. Therefore, the delay number of the input c is 3-1 = 2. The input d is input at the stage 5. Therefore, the delay number of the input d is 5-1 = 4.

以上で、具体例でのステップＳ２〜Ｓ３の動作の詳細な説明を終了する。 Above, detailed description of operation | movement of step S2-S3 in a specific example is complete | finished.

次に、ステップＳ４において、プログラムコード片生成部１３は、パイプライン演算器の構造情報を参照することにより、遅延数に応じた各入力データを組にして連続領域に配置する記述と、連続領域を用いてオフロード対象処理を呼び出す処理とを記述する。 Next, in step S4, the program code fragment generation unit 13 refers to the structure information of the pipeline arithmetic unit, and sets a description in which each input data corresponding to the number of delays is arranged in a continuous area and a continuous area. Is used to describe the process of calling the offload target process.

具体例におけるステップＳ４でのプログラムコード片生成部１３の動作の詳細を図１０に示す。 Details of the operation of the program code fragment generation unit 13 in step S4 in the specific example are shown in FIG.

図１０では、プログラムコード片生成部１３は、オフロード対象処理の呼び出し文における各引数について、以下のステップＳ３１〜Ｓ３２の処理を実行する。ここでは、各引数を、ａｒｇｊ（ｊ＝０〜Ｍ、Ｍは０以上の整数）とする。この例では、ａｒｇ０が「Ａ＋ｉ」、ａｒｇ１が「Ｂ＋ｉ」、ａｒｇ２が「Ｃ＋ｉ」、ａｒｇ３が「Ｄ＋ｉ」にそれぞれ相当する。なお、この例では、Ｍ＝３である。 In FIG. 10, the program code fragment generation unit 13 executes the following steps S31 to S32 for each argument in the call statement of the offload target process. Here, each argument is assumed to be argj (j = 0 to M, M is an integer of 0 or more). In this example, arg0 corresponds to “A + i”, arg1 corresponds to “B + i”, arg2 corresponds to “C + i”, and arg3 corresponds to “D + i”. In this example, M = 3.

ここでは、まず、プログラムコード片生成部１３は、構造情報を参照することにより、ａｒｇｊに対応する遅延数ｄｊを検索する（ステップＳ３１）。 Here, first, the program code fragment generation unit 13 searches the delay number dj corresponding to argj by referring to the structure information (step S31).

例えば、図９を参照すると、ａｒｇ０に対応する遅延数ｄ０としては、０が検索される。 For example, referring to FIG. 9, 0 is retrieved as the delay number d0 corresponding to arg0.

次に、プログラムコード片生成部１３は、連続領域に、遅延数に応じた引数を配置する記述として、「ｐａｃｋ［ｊ］＝＊（ａｒｇｊ＋Ｚ−ｄｊ）；」という文を生成する（ステップＳ３２）。 Next, the program code fragment generation unit 13 generates a sentence “pack [j] = * (argj + Z−dj);” as a description in which an argument corresponding to the number of delays is arranged in the continuous area (step S32). .

ここで、Ｚは、パイプライン演算器の入力から出力までの遅延数を表す。図７を参照すると、この具体例では、Ｚ＝６である。例えば、ａｒｇ０に対応するコード片は「ｐａｃｋ［０］＝＊（Ａ＋ｉ＋６）；」となる。そして、プログラムコード片生成部１３は、この引数について生成した記述を、プログラムコード片とする。もし、それまでに他の引数について生成されたプログラムコード片があれば、プログラムコード片生成部１３は、この引数について生成した記述を、それまでのプログラムコード片に追加する。 Here, Z represents the number of delays from the input to the output of the pipeline arithmetic unit. Referring to FIG. 7, in this specific example, Z = 6. For example, the code fragment corresponding to arg0 is “pack [0] = * (A + i + 6);”. Then, the program code fragment generation unit 13 sets the description generated for the argument as a program code fragment. If there is a program code fragment generated for another argument so far, the program code fragment generation unit 13 adds the description generated for this argument to the previous program code fragment.

全ての引数についてステップＳ３１〜Ｓ３２が完了すると、図１１に示すように、遅延数に応じた引数の組を、連続領域に配置するプログラムコード片９０１が生成される。 When steps S31 to S32 are completed for all the arguments, as shown in FIG. 11, a program code piece 901 that arranges a set of arguments corresponding to the number of delays in a continuous area is generated.

次に、プログラムコード片生成部１３は、連続領域を表す「ｐａｃｋ」を用いてオフロード対象処理を呼び出す記述を生成する（ステップＳ３３）。これにより、図１１に示すように、オフロード対象処理の呼び出し文となるプログラムコード片９０２が出力される。 Next, the program code fragment generation unit 13 generates a description for calling the offload target process using “pack” representing a continuous area (step S33). As a result, as shown in FIG. 11, a program code fragment 902 serving as a call statement for the offload target process is output.

次に、プログラムコード片生成部１３は、遅延数に応じた引数の組を連続領域に配置してオフロード対象処理を呼び出す処理のプロローグおよびエピローグとなるプログラムコード片を生成する（ステップＳ３４）。 Next, the program code fragment generation unit 13 generates a program code fragment that becomes a prologue and an epilogue of a process for placing an argument set corresponding to the number of delays in a continuous area and calling an offload target process (step S34).

ここで、プロローグおよびエピローグとなるプログラムコード片について説明する。そのために、まず、図７に示したパイプライン演算器に対して、同時に供給する引数の組合せについて説明する。同時に供給する引数の組合せは、各引数の遅延数を考慮すると、図１２に示す通りとなる。図１２では、各行が、同時に供給する引数の組合せを表す。また、図１２において、「出力」の列は、各行の組合せの引数が入力されたサイクルでパイプライン演算器から出力される情報を表す。このとき、パイプライン演算器に最初に入力を行ってからＺサイクル目までの期間８０１は、引数の一部が不要またはパイプライン演算器からの出力がないため、上述のプログラムコード片９０１および９０２を適用できない。また、パイプライン演算器から最後の出力があるまでのＺサイクルの期間８０３は、引数の一部または全部が不要であるため、上述のプログラムコード片９０１および９０２を適用できない。つまり、上述のプログラムコード片９０１および９０２を適用可能である期間は、Ｚ＋１サイクル目からＮサイクル目までの期間８０２となる。そこで、プログラムコード片生成部１３は、図１３に示すように、期間８０１について、プロローグとなるプログラムコード片９０３を生成する。また、プログラムコード片生成部１３は、期間８０３について、エピローグとなるプログラムコード片９０４を生成する。なお、プログラムコード片生成部１３は、必要に応じて連続領域を宣言する記述を、プログラムコード片に付加しておく。図１１、図１３の例では、プロローグとなるプログラムコード片９０３の先頭に、連続領域ｐａｃｋの宣言文が付加されている。 Here, a program code fragment that becomes a prologue and an epilogue will be described. For this purpose, first, a description will be given of combinations of arguments supplied simultaneously to the pipeline arithmetic unit shown in FIG. The combination of arguments supplied at the same time is as shown in FIG. 12 in consideration of the delay number of each argument. In FIG. 12, each row represents a combination of arguments supplied simultaneously. In FIG. 12, the column “output” represents information output from the pipeline computing unit in the cycle in which the argument of the combination of each row is input. At this time, in the period 801 from the first input to the pipeline calculator until the Z-th cycle, part of the arguments are unnecessary or there is no output from the pipeline calculator, so the above-described program code pieces 901 and 902 Is not applicable. In addition, the program code pieces 901 and 902 described above cannot be applied to the Z cycle period 803 from the pipeline arithmetic unit until the last output is present, because some or all of the arguments are unnecessary. That is, the period during which the above-described program code pieces 901 and 902 can be applied is a period 802 from the Z + 1th cycle to the Nth cycle. Therefore, as shown in FIG. 13, the program code fragment generation unit 13 generates a program code fragment 903 serving as a prolog for the period 801. Further, the program code fragment generation unit 13 generates a program code fragment 904 that becomes an epilogue for the period 803. Note that the program code fragment generation unit 13 adds a description declaring the continuous area to the program code fragment as necessary. In the examples of FIGS. 11 and 13, the declaration statement of the continuous area pack is added to the head of the program code piece 903 that becomes the prologue.

以上で、具体例でのステップＳ４の動作の詳細な説明を終了する。 Above, detailed description of operation | movement of step S4 in a specific example is complete | finished.

次に、ステップＳ５において、プログラム書換部１４は、オフロード非対象プログラムにおけるオフロード対象処理の呼び出し箇所に、上述のプログラムコード片９０１〜９０４を埋め込んで出力する。出力プログラムは、図１４に示す通りとなる。この具体例では、プログラム書換部１４は、図４に示したオフロード対象プログラムにおける呼び出し文「Ｒ［ｉ］＝ｆｕｎｃＡ（Ａ＋ｉ，Ｂ＋ｉ，Ｃ＋ｉ，Ｄ＋ｉ）」の直前に、図１１に示したプログラムコード片９０１を挿入している。また、プログラム書換部１４は、図４に示したオフロード対象プログラムにおける呼び出し文「Ｒ［ｉ］＝ｆｕｎｃＡ（Ａ＋ｉ，Ｂ＋ｉ，Ｃ＋ｉ，Ｄ＋ｉ）」を、図１１に示したプログラムコード片９０２に置換している。また、プログラムコード片９０１および９０２の繰り返し処理の繰り返し回数を、Ｎ−Ｚ回に変更している。そして、プログラム書換部１４は、プログラムコード片９０１および９０２の繰り返し処理の前後に、プログラムコード片９０３および９０４をそれぞれ挿入している。なお、図１４では、プログラムコード片９０３および９０４の詳細を省略している。 Next, in step S5, the program rewriting unit 14 embeds and outputs the above-described program code pieces 901 to 904 at the offload target process call location in the offload non-target program. The output program is as shown in FIG. In this specific example, the program rewriting unit 14 performs the program shown in FIG. 11 immediately before the call statement “R [i] = funcA (A + i, B + i, C + i, D + i)” in the offload target program shown in FIG. A cord piece 901 is inserted. Further, the program rewriting unit 14 replaces the calling statement “R [i] = funcA (A + i, B + i, C + i, D + i)” in the offload target program shown in FIG. 4 with the program code fragment 902 shown in FIG. doing. Further, the number of repetitions of the program code pieces 901 and 902 is changed to NZ times. The program rewriting unit 14 inserts the program code pieces 903 and 904 before and after the repetitive processing of the program code pieces 901 and 902, respectively. In FIG. 14, details of the program code pieces 903 and 904 are omitted.

このようにして、この具体例では、プログラム書換装置１は、図４の入力プログラムに対して、図８のＲＴＬ記述と、図１４の出力プログラムとを出力する。 In this way, in this specific example, the program rewriting device 1 outputs the RTL description of FIG. 8 and the output program of FIG. 14 to the input program of FIG.

以上で、具体例の説明を終了する。 This is the end of the description of the specific example.

次に、本発明の実施の形態の効果について述べる。 Next, effects of the embodiment of the present invention will be described.

本実施の形態としてのプログラム書換装置は、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、アクセラレータに対して効率的にデータを供給するようプログラムを書き換えることができる。 The program rewriting apparatus according to the present embodiment can rewrite a program to efficiently supply data to an accelerator in a program executed by a computer system including a processor and an accelerator.

その理由について説明する。本実施の形態では、プログラム解析部が、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させるプログラムにおいて、オフロード対象処理を特定する。そして、高位合成部が、オフロード対象処理をアクセラレータ上に実装する演算装置の設計内容を表す設計内容情報を生成する。また、高位合成部が、生成した設計内容情報に基づいて、演算装置の構造を表す構造情報を生成する。そして、プログラムコード片生成部が、演算装置の構造情報に基づいて、プログラムコード片を生成する。プログラムコード片には、演算装置に対して同時に供給する入力データの組をメモリ上の連続領域に配置する処理の記述と、連続領域を用いてオフロード対象処理を呼び出す処理の記述とが含まれる。そして、プログラム書換部が、プログラムにおけるオフロード対象処理の呼び出し箇所に、プログラムコード片を埋め込むからである。 The reason will be described. In the present embodiment, the program analysis unit specifies an offload target process in a program to be executed by a computer system including a processor and an accelerator. Then, the high-level synthesis unit generates design content information representing the design content of the arithmetic device that mounts the offload target process on the accelerator. Further, the high-level synthesis unit generates structure information representing the structure of the arithmetic device based on the generated design content information. Then, the program code fragment generation unit generates a program code fragment based on the structure information of the arithmetic device. The program code fragment includes a description of a process for arranging a set of input data to be simultaneously supplied to the arithmetic device in a continuous area on the memory and a description of a process for calling an offload target process using the continuous area. . This is because the program rewriting unit embeds a piece of program code in the calling location of the offload target process in the program.

これにより、本実施の形態を用いて書き換えたプログラムを、プロセッサおよびアクセラレータを含むコンピュータシステムに実行させると、非連続な領域に配置されていた異なる組の入力データを、アクセラレータに対して同時刻に効率よく供給することができる。 As a result, when the computer system including the processor and the accelerator is executed with the program rewritten using the present embodiment, different sets of input data arranged in non-contiguous areas are sent to the accelerator at the same time. It can be supplied efficiently.

ここで、本実施の形態により書き換えられたプログラムが、元のプログラムと比較して、アクセラレータによる入力データの読み込み速度を短縮する効果について、具体例を用いて説明する。 Here, the effect that the program rewritten according to the present embodiment reduces the input data reading speed by the accelerator as compared with the original program will be described using a specific example.

例えば、プログラム中のオフロード非対象処理において、オフロード対象処理を呼び出す際に、非連続な領域に配置された４個の４バイトデータを引数として渡すことを考えるここでは、オフロード対象処理は、パイプライン演算器として実装されているものとする。 For example, in the offload non-target process in the program, when calling the offload target process, consider passing four 4-byte data arranged in a non-contiguous area as an argument. Here, the offload target process is Suppose that it is implemented as a pipeline computing unit.

まず、プロセッサと、アクセラレータ上のパイプライン演算器との速度差を考える。プロセッサが３ＧＨｚで動作し、アクセラレータ上のパイプライン演算器が１００ＭＨｚで動作すると、速度差は約３０倍である。また、プロセッサは、１サイクルに１個の４バイトデータをロードできるものとする。また、アクセラレータは、１サイクルに１個の４バイトデータをロードできるものとする。 First, consider the speed difference between the processor and the pipeline computing unit on the accelerator. If the processor operates at 3 GHz and the pipeline calculator on the accelerator operates at 100 MHz, the speed difference is about 30 times. Further, it is assumed that the processor can load one 4-byte data in one cycle. Further, it is assumed that the accelerator can load one 4-byte data in one cycle.

書換前の元のプログラムは、同時刻に必要となる異なる組の４個の入力データを連続領域に配置しないでアクセラレータに供給する。この場合、プロセッサからアクセラレータへのアドレス送付は、４サイクル（１サイクル×４個）を必要とする。また、送付されたアドレスが非連続であるため、それぞれのアドレスに対応するデータのアクセラレータによるロードは、４サイクル（１サイクル×４個）を必要とする。ここで、アクセラレータの４サイクルは、プロセッサとの速度差３０倍を勘案すると、プロセッサのサイクル換算では１２０サイクルに相当する。すなわち、この場合、異なる組の４個の入力データの読み込みは、プロセッサのサイクル換算で、合計４＋１２０＝１２４サイクルを必要とする。 The original program before rewriting supplies different sets of four input data required at the same time to the accelerator without arranging them in the continuous area. In this case, sending an address from the processor to the accelerator requires 4 cycles (1 cycle × 4). Since the sent addresses are non-consecutive, loading of data corresponding to each address by the accelerator requires 4 cycles (1 cycle × 4). Here, the four cycles of the accelerator correspond to 120 cycles in terms of the processor cycle in consideration of a speed difference of 30 times from the processor. In other words, in this case, reading of four sets of input data from different sets requires a total of 4 + 120 = 124 cycles in terms of processor cycles.

一方、本実施の形態により書き換えられたプログラムは、同時刻に必要となる異なる組の４個の入力データを連続領域に配置してからアクセラレータに供給する。この場合、プロセッサによる入力データの収集は、４サイクル（１サイクル×４個）を必要とする。次に、収集した入力データの連続領域への書込みは、４サイクル（１サイクル×４個）を必要とする。そして、プロセッサからアクセラレータへのアドレス送付は、１サイクル（１サイクル×１個）を必要とする。また、送付されたアドレスの示す連続領域に４個のデータが配置されているため、そのアドレスに対応するデータのアクセラレータによるロードは、１サイクル（１サイクル×１個）で完了する。ここで、アクセラレータの１サイクルは、プロセッサとの速度差３０倍を勘案すると、プロセッサのサイクル換算で３０サイクルに相当する。すなわち、本実施の形態では、異なる組の４個の入力データの読み込みは、プロセッサのサイクル換算で、合計４＋４＋１＋３０＝３９サイクルを必要とする。 On the other hand, the program rewritten in accordance with the present embodiment supplies different sets of four input data required at the same time to the accelerator after arranging them in the continuous area. In this case, the collection of input data by the processor requires 4 cycles (1 cycle × 4). Next, the writing of the collected input data to the continuous area requires 4 cycles (1 cycle × 4). The address transmission from the processor to the accelerator requires one cycle (1 cycle × 1). In addition, since four pieces of data are arranged in the continuous area indicated by the sent address, loading of data corresponding to the address by the accelerator is completed in one cycle (1 cycle × 1). Here, one cycle of the accelerator corresponds to 30 cycles in terms of the processor cycle in consideration of a speed difference of 30 times from the processor. That is, in the present embodiment, reading of four different sets of input data requires a total of 4 + 4 + 1 + 30 = 39 cycles in terms of processor cycles.

このように、本実施の形態により書き換えられたプログラムは、元のプログラムと比較して、１２４÷３９＝約３．２倍高速に、アクセラレータに入力データを読み込ませることができる。 Thus, the program rewritten according to the present embodiment can cause the accelerator to read input data at 124 ÷ 39 = about 3.2 times faster than the original program.

なお、本発明の実施の形態において、アクセラレータとしては、例えば、ＦＰＧＡ（field-programmable gate array）や、ＧＰＵ（Graphics Processing Unit）が適用可能であるが、これらに限らない。 In the embodiment of the present invention, for example, a field-programmable gate array (FPGA) or a graphics processing unit (GPU) is applicable as an accelerator, but is not limited thereto.

また、本発明の実施の形態において、オフロードされる処理がアクセラレータ上にパイプライン演算器として実装される例について説明したが、オフロードされる処理を実装する演算装置は、パイプライン演算器に限らない。 In the embodiment of the present invention, the example in which the offloaded process is implemented as a pipeline arithmetic unit on the accelerator has been described. However, the arithmetic unit that implements the offloaded process is included in the pipeline arithmetic unit. Not exclusively.

また、本発明の実施の形態において、高位合成部によって生成される設計内容情報が、ＶｅｒｉｌｏｇのＲＴＬ記述である例を中心に説明したが、これに限らない。設計内容情報は、その他のハードウェア記述言語で記述されていてもよい。 In the embodiment of the present invention, the design content information generated by the high-level synthesis unit has been mainly described as an example of Verilog RTL description. However, the present invention is not limited to this. The design content information may be described in another hardware description language.

また、本発明の実施の形態において、プログラム書換装置の各機能ブロックが、記憶装置またはＲＯＭに記憶されたコンピュータ・プログラムを実行するＣＰＵによって実現される例を中心に説明した。これに限らず、各機能ブロックの一部、全部、または、それらの組み合わせが専用のハードウェアにより実現されていてもよい。 Further, in the embodiment of the present invention, the example in which each functional block of the program rewriting device is realized by a CPU that executes a computer program stored in a storage device or ROM has been described. However, the present invention is not limited to this, and some, all, or a combination of each functional block may be realized by dedicated hardware.

また、上述した本発明の各実施の形態において、プログラム書換装置の機能ブロックは、複数の装置に分散されて実現されてもよい。 In each of the above-described embodiments of the present invention, the functional block of the program rewriting device may be distributed and implemented in a plurality of devices.

また、本発明の実施の形態において、各フローチャートを参照して説明したプログラム書換装置の動作を、本発明のコンピュータ・プログラムとしてコンピュータ装置の記憶装置（記憶媒体）に格納しておいてもよい。そして、係るコンピュータ・プログラムを当該ＣＰＵが読み出して実行するようにしてもよい。そして、このような場合において、本発明は、係るコンピュータ・プログラムのコードあるいは記憶媒体によって構成される。 In the embodiment of the present invention, the operation of the program rewriting device described with reference to each flowchart may be stored in a storage device (storage medium) of the computer device as the computer program of the present invention. Then, the computer program may be read and executed by the CPU. In such a case, the present invention is constituted by the code of the computer program or a storage medium.

以上、上述した実施形態を模範的な例として本発明を説明した。しかしながら、本発明は、上述した実施形態には限定されない。即ち、本発明は、本発明のスコープ内において、当業者が理解し得る様々な態様を適用することができる。 The present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.

この出願は、２０１５年９月３０日に出願された日本出願特願２０１５−１９３１０６を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2015-193106 for which it applied on September 30, 2015, and takes in those the indications of all here.

１プログラム書換装置
１１プログラム解析部
１２高位合成部
１３プログラムコード片生成部
１４プログラム書換部
１００１ＣＰＵ
１００２メモリDESCRIPTION OF SYMBOLS 1 Program rewriting apparatus 11 Program analysis part 12 High-level synthesis part 13 Program code piece generation part 14 Program rewriting part 1001 CPU
1002 memory

Claims

In a program to be executed by a computer system including a processor and an accelerator, program analysis means for specifying a process (offload target process) to be offloaded to the accelerator;
High-level synthesis means for generating design content information representing the design content of an arithmetic device for mounting the offload target process on the accelerator, and generating structure information representing the structure of the arithmetic device based on the generated design content information When,
A program describing a process of reading a set of input data to be simultaneously supplied to the arithmetic unit based on the structure information, placing the input data in a continuous area on a memory, and calling the offload target process using the continuous area Program code fragment generating means for generating a code fragment;
Program rewriting means for embedding the program code fragment in the call location of the offload target process in the program,
A program rewriting device comprising:

The high-level synthesis means generates, as the structure information, information including the number of cycles (the number of delays) from the input terminal in the previous stage for each input terminal of the arithmetic device,
2. The program code fragment generation unit according to claim 1, wherein the program code fragment includes a description of processing for arranging the input data according to the number of delays as a set and arranging the input data in the continuous area. Program rewriting device.

The program rewriting apparatus according to claim 1, wherein the high-level synthesis unit generates design content information of a pipeline arithmetic unit as the arithmetic unit.

Computer equipment
In a program to be executed by a computer system including a processor and an accelerator, a process to be offloaded to the accelerator (offload target process) is specified,
Generate design content information representing the design content of a computing device that implements the offload target process on the accelerator, and generate structure information representing the structure of the computing device based on the generated design content information.
A program describing a process of reading a set of input data to be simultaneously supplied to the arithmetic unit based on the structure information, placing the input data in a continuous area on a memory, and calling the offload target process using the continuous area Generate a piece of code
A method of embedding the program code fragment at a call location of the offload target process in the program.

In a program to be executed by a computer system including a processor and an accelerator, a program analysis step for specifying a process (offload target process) to be offloaded to the accelerator;
A high-level synthesis step of generating design content information representing the design content of an arithmetic device that implements the offload target process on the accelerator, and generating structure information representing the structure of the arithmetic device based on the generated design content information When,
A program describing a process of reading a set of input data to be simultaneously supplied to the arithmetic unit based on the structure information, placing the input data in a continuous area on a memory, and calling the offload target process using the continuous area A program code fragment generation step for generating a code fragment;
A program rewriting step of embedding the program code fragment at a call location of the offload target process in the program;
A storage medium storing a program for causing a computer device to execute the program.