KR101737785B1

KR101737785B1 - Apparatus and method for program compilation

Info

Publication number: KR101737785B1
Application number: KR1020150190701A
Authority: KR
Inventors: 이재진; 조강원
Original assignee: 서울대학교산학협력단
Priority date: 2015-01-16
Filing date: 2015-12-31
Publication date: 2017-05-19
Also published as: KR20160088796A

Abstract

The present invention relates to a program compiling apparatus and a program compiling method. According to a first aspect of the present invention, there is provided a program compiling apparatus for compiling an OpenCL program, the program compiling apparatus comprising: a syntax separator for separating an OpenCL kernel into statements before a loop statement, statements after a loop statement, and statements after a loop statement; A circuit generator, and a language generator for expressing the generated circuit in a hardware description language.

Description

[0001] APPARATUS AND METHOD FOR PROGRAM COMPILATION [0002]

The present invention relates to a program compiling apparatus and a program compiling method, and more particularly, to an apparatus and method for compiling an OpenCL program for an FPGA.

OpenCL (Open Computing Language) is a standard programming model for programs running in heterogeneous computing environments, defining the OpenCL platform and defining the execution of OpenCL applications on a defined platform.

The OpenCL platform consists of one host processor and one or more compute devices connected to it.

The computing device has one or more compute units (CUs), each of which is again comprised of one or more processing elements (PEs).

Typically, the host processor is a CPU, and the operating system runs on the host processor. The computing device corresponds to a multicore CPU or accelerator (GPU, Intel Xeon Phi coprocessor, FPGA, etc.).

If the system in which the OpenCL platform is implemented is a heterogeneous system equipped with a field programmable gate array (FPGA), the CPU becomes a host processor and the FPGA becomes a computing device. The FPGA processes multiple work items processed by the computing device in parallel according to the instructions of the host program executed by the CPU.

The way to run multiple work items in parallel is to create a circuit corresponding to one work item and then use pipelining techniques to divide the circuit into multiple pipeline stages, There is a way to process work items at the same time.

As another method, there is a method of creating a plurality of circuits corresponding to one work item (hereinafter, referred to as "unit circuits" for convenience), and independently processing different work items in each unit circuit.

In general, we use both methods to handle the OpenCL kernel. In other words, as the capacity of the FPGA permits, multiple unit circuits are replicated and multiple work items are simultaneously executed by the pipelining method in each unit circuit.

However, if there are loops in the kernel, the number of work items that can be processed simultaneously in a unit circuit is reduced to the number of pipeline stages that execute the contents of the loop, not the total number of pipeline stages. As long as the pipeline stage of the loop statement is not terminated, the work item is stalled without processing, and the utilization of the FPGA is degraded.

Korean Patent Laid-Open No. 10-2014-0097548, which is related to the prior art, relates to a software library for heterogeneous parallel processing platforms, wherein the library source code in the OpenCL framework is compiled into an intermediate representation and distributed to the final user computing system , The CPU of the computer system compiles the intermediate representation of the library into a binary that runs on the GPU, executes the host application that calls the kernel, and sends the kernel retrieved from the binaries to the GPU. However, the prior art documents do not solve the problems as described above.

Therefore, a technique for solving the above-described problems is required.

On the other hand, the background art described above is technical information acquired by the inventor for the derivation of the present invention or obtained in the derivation process of the present invention, and can not necessarily be a known technology disclosed to the general public before the application of the present invention .

An embodiment of the present invention is directed to an apparatus and method for compiling an OpenCL program for an FPGA.

In addition, an embodiment of the present invention aims at minimizing unnecessary circuits.

According to a first aspect of the present invention, there is provided a program compilation apparatus for compiling an Open CL program, the syntax of the OpenCL kernel including a syntax for separating an OpenCL kernel into a syntax before a loop statement, A circuit generator for generating a circuit corresponding to each syntax, and a language generator for representing the generated circuit in a hardware description language.

According to a second aspect of the present invention, there is provided a method of compiling an OpenCL program, the method comprising: dividing an OpenCL kernel into statements before a loop statement, statements after a loop statement, and statements after a loop statement; And expressing the generated circuit in a hardware description language.

According to a third aspect of the present invention, there is provided a computer-readable recording medium on which a program for performing a program compilation method is recorded. The program compilation method includes: dividing an OpenCL kernel into statements before a loop statement, Generating a circuit corresponding to each statement, and representing the generated circuit in a hardware description language.

According to a fourth aspect of the present invention, there is provided a computer program stored in a recording medium for performing a program compilation method, the program compiling method comprising the steps of: Separating into a statement after the loop statement, generating a circuit corresponding to each statement, and expressing the generated circuit in a hardware description language.

According to one of the above-mentioned objects of the present invention, an embodiment of the present invention is directed to an apparatus and a method for compiling an OpenCL program for an FPGA.

In addition, according to any one of the above-described embodiments of the present invention, an embodiment of the present invention includes a method of reproducing a circuit for processing an iterative statement included in a kernel called by a host program using OpenCL by using an FPGA , The performance bottleneck of the program by the loop statement can be reduced, and the performance degradation of the system can be prevented.

Further, according to any one of the tasks of the present invention, instead of replicating a circuit for processing contents outside the above-mentioned loop to the OpenCL kernel including the loop statement, .

In addition, according to any one of the tasks of the present invention, unnecessary circuits are minimized, thereby efficiently using FPGA hardware, and consequently, power consumption can be lowered and performance can be improved.

The effects obtained by the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those skilled in the art from the following description will be.

1 is a configuration diagram of an OpenCL platform system according to an embodiment of the present invention.
2 is a block diagram illustrating a program compiling apparatus according to an embodiment of the present invention.
3 is a flowchart illustrating a method of compiling a program according to an embodiment of the present invention.
4 to 5 are exemplary diagrams for explaining a program compiling method according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

1 is a configuration diagram of an OpenCL platform system 100 according to an embodiment of the present invention.

The OpenCL platform system 100 can execute an OpenCL (Open Computing Language) application.

Such a system 100 may include a host processor 10 and one or more computing devices 20.

For example, the host processor 10 may correspond to a CPU and the computing device 20 may be a multicore CPU or accelerator (GPU, Intel Xeon Phi coprocessor, FPGA, etc.).

The host processor 10 can execute the operating system, execute the host program constituting the OpenCL application, and control the computing device 20 using the OpenCL API function according to the host program.

And the computing device 20 may be comprised of one or more compute units (CUs) 21 and each computing unit may again be composed of one or more processing elements (PE)

The computing device 20 may have three kinds of memories: a device memory 23, a local memory 24, and a private memory 25.

The device memory 23 area is constituted by a global memory and a constant memory and can also be shared by all the PEs 22 and the area of the local memory 24 can be allocated independently for each calculation unit 21, The private memory area 25 can be allocated independently for each PE 22.

The calculation device 20 receives the command of the host program and executes the kernel of the OpenCL program or copies the data of the main memory 11 to the device memory 23, The data can be copied to the main memory 11.

At this time, the OpenCL program constituting the Open CL application is composed of several kernel functions, and is written in the language called OpenCL C, which is similar to C, and can be executed in the computing device 20.

When the OpenCL application is executed in the OpenCL platform system 100, the host program executing in the host processor 10 can define an N-dimensional index space called NDRange, while issuing a command to execute the kernel of the OpenCL program.

At this time, each index of the NDRange is referred to as a work item, and the work items can be classified into a work group.

The computing device 20 may then create a kernel instance that is a thread that executes a kernel function for each work item according to a kernel execution command.

Each work group can be executed in each of one or more calculation units 21 that constitute the computing device 20 and multiple work items included in each work group can be executed in the PE 22 of the calculation unit 21, .

On the other hand, when the computing device 20 of the OpenCL platform system 100 is an FPGA, an FPGA circuit capable of executing the OpenCL program must be implemented.

That is, it is necessary to implement an FPGA circuit that takes an OpenCL program, specifically, a kernel function of an OpenCL program as input, and parallelizes various work items of the kernel function.

To do this, a circuit structure is represented by a hardware description language such as Verilog or VHDL as a technology for implementing an FPGA circuit, and a circuit structure of an FPGA can be implemented using the hardware description language, which is called logic synthesis.

On the other hand, there is a way to write a program in a general high-level language such as the C language, and to change the hardware structure of the FPGA accordingly, which is called high-level synthesis.

High-level synthesis is a process of creating a circuit structure by taking a program written in a high-level language as input, and changing the circuit structure of the FPGA by applying logic synthesis technology. In this case, the separated processes are conceptually separated, and they do not necessarily have to be divided into two independent steps.

For example, a high-level synthesis may be performed by a single software. In this case, the hardware description language of the circuit structure is not a human-readable language such as Verilog or VHDL. Instead, It may be an intermediate representation.

The present invention describes a method for implementing an FPGA circuit that processes an OpenCL kernel including a loop through a high-level synthesis.

At this time, the program compiling apparatus according to an exemplary embodiment of the present invention may be an arbitrary component on the OpenCL platform system 100, or may be a component located outside the OpenCL platform system 100. [ In the following description, it is assumed that the host processor 10 is a program compiling apparatus.

2 is a block diagram illustrating a program compiling apparatus 20 according to an embodiment of the present invention.

As shown in FIG. 2, the program compiling apparatus 20 according to an embodiment of the present invention may include a graph generating unit 210 for generating a control flow graph based on an OpenCL kernel.

That is, the graph generating unit 210 can express all the paths that the OpenCL kernel can traverse during execution as a control flow graph using graphical notation.

Meanwhile, the program compiling apparatus 20 according to an embodiment of the present invention may further include a syntax separating unit 220 for separating the OpenCL kernel into statements before the loop statement, statements after the loop statement, and statements after the loop statement.

In addition, the syntax separator 220 can search for a loop in the control flow graph generated by the graph generator 210, and based on the loop statement, the syntax before the loop statement, the loop statement syntax, The syntax can be distinguished.

Meanwhile, the program compiling apparatus 20 according to an embodiment of the present invention may further include a circuit generating unit 230 for generating a circuit corresponding to each of the statements separated by the syntax separating unit 220.

The circuit generating unit 230 may generate a first unit circuit corresponding to the previous statement, a second unit circuit corresponding to the statement in the loop statement, and a third unit circuit corresponding to the statement after the statement.

At this time, in order to minimize the number of the first unit circuit and the third unit circuit and to maximize the number of the second unit circuits, the circuit generating unit 230 replicates the second unit circuit a predetermined number of times or more based on the FPGA capacity The second unit circuit can be generated.

Also, the circuit generating unit 230 may additionally generate a first control circuit between the first unit circuit and the second unit circuit, and may combine the first control circuit and the second unit circuit with each other. The first control circuit may check a first signal value indicating that a new work item can enter the second unit circuit. If it is determined that a new work item can enter the second unit circuit, The calculation result calculated by the circuit can be transmitted to the second unit circuit together with the ID of the corresponding work item.

In addition, the circuit generating unit 230 may additionally generate a second control circuit between the second unit circuit and the third unit circuit, and may combine the second control circuit and the third unit circuit with each other. The second control circuit can check the second signal value indicating the state in which the work item exits from the loop and continues to be executed in the third unit circuit. If it is determined that the work item should escape from the second unit circuit, The calculation result calculated by the two-unit circuit can be transmitted to the third unit circuit together with the ID of the corresponding work item.

Meanwhile, the program compiling apparatus 20 according to an embodiment of the present invention may further include a language generating unit 240 that expresses the circuit generated by the circuit generating unit 230 in a hardware description language.

Although the program compiling apparatus according to an embodiment of the present invention has been described as being implemented by a host processor, the term " part " used in the present embodiment means performing a certain role, Means a hardware component such as software or an FPGA or an ASIC, and may also be a CPU, a GPU, or the like. However, 'part' is not meant to be limited to software or hardware. &Quot; to " may be configured to reside on an addressable storage medium and may be configured to play one or more processors. Thus, by way of example, 'parts' may refer to components such as software components, object-oriented software components, class components and task components, and processes, functions, , Subroutines, segments of program patent code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

The functions provided within the components and components may be combined with a smaller number of components and components or separated from additional components and components.

In addition, the components and components may be implemented to play back one or more CPUs in a device or a secure multimedia card

3 is a flowchart illustrating a method of compiling a program according to an embodiment of the present invention.

The method for compiling a program according to the embodiment shown in FIG. 3 includes steps that are processed in a time-series manner in the program compiling apparatus 10 shown in FIG. Therefore, the contents described above with respect to the program compiling apparatus 10 shown in FIG. 2 can be applied to the program compiling method according to the embodiment shown in FIG.

3 is described below with reference to Figs. 4 and 5. 4 and 5 are diagrams for explaining a program compiling method according to an embodiment of the present invention, and FIG. 4 is an exemplary diagram of an FPGA implementing an OpenCL kernel including a loop statement according to an embodiment of the present invention And FIG. 5 is an exemplary diagram illustrating an FPGA implementing an OpenCL kernel including a plurality of loop statements according to an embodiment of the present invention.

The program compiling apparatus 10 can separate the OpenCL kernel into the loop transfer prefix 40, the antipattern block 41, and the loop block 42 (S310).

For this, the program compiling apparatus 10 can generate a control flow graph (CFG) from the OpenCL kernel, and can use the generated control flow graph to search for loops and separate the kernel.

Thereafter, the program compiling apparatus 10 may generate a circuit structure for each part of the separated kernel (S320).

Referring to FIG. 4, the previous loop 40 of the OpenCL kernel may be implemented as one first unit circuit 43, and the loop statement 41 of the OpenCL kernel may be implemented as three second unit circuits 44 And the loop 42 after the loop of the OpenCL kernel can be implemented as one third unit circuit 45.

At this time, the first to third unit circuits 43, 44, and 45 may simultaneously execute a plurality of work items using a pipelining technique.

On the other hand, each of the first through third unit circuits shown in FIG. 4 is assumed to have two pipeline stages, but according to another embodiment, at least one pipeline stage of the unit circuits may be implemented.

That is, each of the first to third unit circuits 43, 44 and 45 may be constituted by a plurality of pipeline stages, each pipeline stage may store the ID of the currently executed work item and the calculation result in a register, Each of the pipeline stages reads the ID and calculation results of the previous pipeline stage and can perform calculations on the work item.

The execution result of the last pipeline stage of the second unit circuit 44 may include the condition value of the loop statement. For example, if the value is true, Lt; RTI ID = 0.0 > Pipeline < / RTI >

For example, if the condition value of the loop statement is false, the work item exits from the loop statement, and therefore, the second unit circuit 44 outputs a new work item that has been executed in the first unit circuit 43 .

According to the embodiment, the first unit circuit 43 and the third unit circuit 45 can be made one copy without duplication, and the second unit circuit 44 can be duplicated as much as the FPGA capacity allows Can be generated.

That is, when the work item repeats the contents of the loop in the loop by several tens to hundreds of times, the number of the first unit circuits 43 and the third unit circuits 45 is minimized and the number of the second unit circuits 44 is maximized It is possible to maximize the advantages of the present invention.

Thereafter, the first unit circuit 43 for the loop statement, the second unit circuit 44 for the loop statement, and the third unit circuit 45 for the loop statement are connected to the first and second control circuits 50 and 51 (S330).

That is, between the first unit circuit 43 and the second unit circuit 44 and between the second unit circuit 44 and the third unit circuit 45, And the first to third unit circuits 43, 44 and 45 may be respectively coupled to the control circuits 50 and 51. [

4, the first unit circuit 43 and the second unit circuit 44 are coupled one by one by the first control circuit 50, and the second unit circuit 44 and the second unit circuit 44 are connected by the first control circuit 50, The three unit circuits 45 can be coupled in many-to-one fashion by the second control circuit 51. [

The first and second control circuits 50 and 51, which are respectively coupled to the first to third unit circuits 43, 44 and 45, can be implemented to operate as follows.

That is, according to one embodiment of the present invention, the first control circuit 50 can receive a first signal from the second unit circuit 44 indicating that a new work item can enter.

In this case, the first signal may be 1 when the condition value is false, or may be 0 when the condition value is false, and may be 0 when the second unit circuit 44 is stalled, The value of the first signal may be one if the last pipeline stage of the second signal 44 is empty.

For example, when a specific work item reaches the last pipeline stage of the first unit circuit 43, the first control circuit 50 can check the first signal value of all the second unit circuits 44 And can transmit the ID of the work item and the calculation result to the input of the second unit circuit 44 in which the value of the first signal is 1 (that is, a new work item can be executed). If there are a plurality of the second unit circuits 44 whose value of the first signal is 1, one of them can be selected according to a random order.

Alternatively, for example, if the value of the first signal is all 0, the first control circuit 50 sends a stall signal to the first unit circuit 43 to suspend the execution of the work item.

According to still another embodiment of the present invention, the second control circuit 51 receives from the second unit circuit 44 the first unit circuit 44, which indicates that the work item of the last pipeline stage of the second unit circuit 44 has escaped from the loop 2 signals and to check the second signal values of all the second unit circuits 44. [

For example, if there is one second unit circuit 44 whose value of the second signal is 1, the second control circuit 51 sets the work item ID of the last pipeline stage of the corresponding second unit circuit 44 The calculation result can be transmitted to the input of the third unit circuit 45. If there are a plurality of the second unit circuits 44 whose value of the second signal is 1, one of them can be selected according to a random order, And a stall signal is sent to the remaining second unit circuits so as to temporarily stop the second unit circuits.

In this case, the second signal may be 1 when the condition value determined in the last pipeline stage of the second unit circuit 44 is false, and may be 0 when the condition value is true. If the second unit circuit 44 is stalled The first signal is 0 while the second signal remains at its original value and the value of the second signal may be 0 if the last pipeline stage of the second unit circuit is empty.

4, it is assumed that one loop is included in the OpenCL kernel. However, if the OpenCL kernel includes one or more loops, it may be divided into several loops based on each loop, Can be implemented as a unit circuit.

5, an OpenCL kernel is divided into first to second iterations according to an embodiment of the present invention. Referring to FIG. 5, in step S310, The second loop statement 502 and the second loop statement 504 may be divided into a first loop statement 501 and a second loop statement 502. In the second loop statement 502, In particular, the control flow graph can be used to search for the first and second loops and to separate the kernel.

The program compiling apparatus 10 implements the parts divided in the same manner as described in step S320 in the FPGA circuit structure and outputs the unit circuits implemented in step S320 to the first to fourth control circuits 505, 506, 507 , And 508, respectively.

Finally, the program compiling apparatus 10 may express the circuit structure generated in step S330 in a hardware description language (S340).

The method for compiling a program according to the embodiment described with reference to FIG. 3 may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

The method for compiling a program according to an embodiment of the present invention may also be implemented as a computer program (or a computer program product) including instructions executable by a computer. A computer program includes programmable machine instructions that are processed by a processor and can be implemented in a high-level programming language, an object-oriented programming language, an assembly language, or a machine language . The computer program may also be recorded on a computer readable recording medium of a type (e.g., memory, hard disk, magnetic / optical medium or solid-state drive).

Thus, a method of compiling a program according to an embodiment of the present invention can be implemented by a computer program as described above being executed by a computing device. The computing device may include a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to the low-speed bus and the storage device. Each of these components is connected to each other using a variety of buses and can be mounted on a common motherboard or mounted in any other suitable manner.

Where the processor may process instructions within the computing device, such as to display graphical information to provide a graphical user interface (GUI) on an external input, output device, such as a display connected to a high speed interface And commands stored in memory or storage devices. As another example, multiple processors and / or multiple busses may be used with multiple memory and memory types as appropriate. The processor may also be implemented as a chipset comprised of chips comprising multiple independent analog and / or digital processors.

The memory also stores information within the computing device. In one example, the memory may comprise volatile memory units or a collection thereof. In another example, the memory may be comprised of non-volatile memory units or a collection thereof. The memory may also be another type of computer readable medium such as, for example, a magnetic or optical disk.

And the storage device can provide a large amount of storage space to the computing device. The storage device may be a computer readable medium or a configuration including such a medium and may include, for example, devices in a SAN (Storage Area Network) or other configurations, and may be a floppy disk device, a hard disk device, Or a tape device, flash memory, or other similar semiconductor memory device or device array.

It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: OpenCL platform system
10: Host processor
20: computing device
21: calculation unit 24: local memory
22: PE 25: Private memory
23: Device memory
43: first unit circuit
44: second unit circuit
45: third unit circuit
50: first control circuit
51: second control circuit

Claims

A program compiling device for compiling an OpenCL program,
A syntactic separator that separates the OpenCL kernel into statements before the loop statement, statements after the loop statement, and statements after the loop statement;
A circuit generator for generating a circuit corresponding to each syntax; And
And a language generator for expressing the generated circuit in a hardware description language,
Wherein the circuit generation unit is configured to generate a plurality of circuits corresponding to the statements in the loop structure in which the number of circuits corresponding to the loop statement syntax is maximized and the number of circuits respectively corresponding to the statement before the loop statement and the statements after the loop statement is minimized, And replicates the circuit corresponding to the phrase statement in a predetermined number of times or more based on the capacity.

The method according to claim 1,
Further comprising a graph generating unit for generating a control flow graph based on the OpenCL kernel,
Wherein the syntax separator comprises:
And separates the OpenCL kernel into statements before the loop statement, statements after the loop statement, and statements after the loop statement based on the control flow graph.

The method according to claim 1,
Wherein the circuit generating unit includes:
Generates a first unit circuit corresponding to the statement before the loop statement, a second unit circuit corresponding to the loop statement syntax, and a third unit circuit corresponding to the statement after the loop statement.

The method of claim 3,
Wherein the circuit generating unit includes:
And replicates the second unit circuit more than a predetermined number of times to generate at least one second unit circuit.

The method of claim 3,
Wherein the first unit circuit to the third unit circuit are constituted by one or more pipeline stages so that one or more work items can be executed simultaneously.

The method of claim 3,
Wherein the circuit generating unit includes:
Further comprising a first control circuit between the first unit circuit and the second unit circuit, and a second control circuit between the second unit circuit and the third unit circuit.

The method according to claim 6,
Wherein the first control circuit receives and checks a first signal value indicating that a new work item can enter the second unit circuit from the second unit circuit,
And the second control circuit receives from the second unit circuit a second signal value indicating that the work item in the second unit circuit should escape from the second unit circuit.

Program Compilation As a method for a device to compile an OpenCL program,
Separating the OpenCL kernel into statements before the loop statement, statements after the loop statement, and statements after the loop statement;
Generating a circuit corresponding to each syntax; And
And expressing the generated circuit in a hardware description language,
Wherein the step of generating the circuit comprises the steps of: executing the OpenCL kernel so that the number of circuits corresponding to the loop statement syntax becomes a maximum, and the number of circuits corresponding to the statements before the loop statement and the statements after the loop statement is minimized, The circuit corresponding to the loop statement syntax is replicated a predetermined number of times or more based on the capacity of the calculation device for the program.

9. The method of claim 8,
Further comprising generating a control flow graph based on the OpenCL kernel,
Wherein said separating comprises:
And separating the OpenCL kernel into statements before the loop statement, statements after the loop statement, and statements after the loop statement based on the control flow graph.

9. The method of claim 8,
Wherein the step of generating the circuit comprises:
Generating a first unit circuit corresponding to the statement before the loop statement, a second unit circuit corresponding to the loop statement syntax, and a third unit circuit corresponding to the statement after the loop statement.

11. The method of claim 10,
Wherein the step of generating the circuit comprises:
Further comprising: replicating the second unit circuit more than a predetermined number of times to generate at least one second unit circuit.

11. The method of claim 10,
Wherein the first unit circuit and the third unit circuit are configured as one or more pipeline stages to execute one or more work items at the same time.

11. The method of claim 10,
Wherein the step of generating the circuit comprises:
Further comprising generating and combining a first control circuit between the first unit circuit and the second unit circuit and a second control circuit between the second unit circuit and the third unit circuit.

14. The method of claim 13,
Wherein the first control circuit receives and checks a first signal value indicating that a new work item can enter the second unit circuit from the second unit circuit,
Wherein the second control circuit receives from the second unit circuit a second signal value indicating that the work item in the second unit circuit should escape from the second unit circuit.

A computer-readable recording medium on which a program for performing the method according to claim 8 is recorded.

A computer program stored in a recording medium for performing the method according to claim 8, the program being executed by a program compiling apparatus.