WO2021156954A1 - オフロードサーバ、オフロード制御方法およびオフロードプログラム - Google Patents
オフロードサーバ、オフロード制御方法およびオフロードプログラム Download PDFInfo
- Publication number
- WO2021156954A1 WO2021156954A1 PCT/JP2020/004201 JP2020004201W WO2021156954A1 WO 2021156954 A1 WO2021156954 A1 WO 2021156954A1 JP 2020004201 W JP2020004201 W JP 2020004201W WO 2021156954 A1 WO2021156954 A1 WO 2021156954A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processing
- loop
- gpu
- offload
- parallel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
- G06F8/452—Loops
Definitions
- the present invention relates to an offload server, an offload control method, and an offload program that automatically offloads functional processing to a GPU (Graphics Processing Unit) or the like.
- a GPU Graphics Processing Unit
- FPGA is a programmable gate array whose configuration can be set by a designer or the like after manufacturing, and is a kind of PLD (Programmable Logic Device).
- AWS Amazon Web Services
- Azure registered trademark
- Azure provides GPU instances and FPGA instances, and these resources can also be used on demand.
- Microsoft® uses FPGAs to streamline searches.
- OpenIoT Internet of Things
- CUDA Computer Unified Device Architecture
- OpenCL Open Computing Language
- the GPU and FPGA can be easily used in the user's IoT application. That is, when deploying a general-purpose application such as image processing or encryption processing to be operated in an OpenIoT environment, it is desired that the OpenIoT platform analyze the application logic and automatically offload the processing to the GPU or FPGA.
- CUDA a development environment for GPGPU (General Purpose GPU) that uses the computing power of GPU in addition to image processing, is developing.
- GPGPU General Purpose GPU
- OpenCL has also appeared as a standard for handling hetero hardware such as GPU, FPGA, and manycore CPU in a unified manner.
- Non-Patent Documents 1 and 2 can be mentioned as an effort to automate trial and error of parallel processing points.
- loop statements suitable for GPU offload are appropriately extracted by repeating performance measurement in a verification environment using an evolutionary computation method, and variables in the nested loop statement are ranked as high as possible.
- the CPU-GPU transfer is integrated in the loop of, and the speed is automatically increased.
- Non-Patent Documents 1 and 2 GA (Genetic Algorithm: Genetic) is used for a group of loop statements that can be processed in parallel by automatically extracting an appropriate parallel processing area for GPU offload from general-purpose code for CPU. By searching for a more appropriate parallel processing area using an algorithm), automatic offloading to the GPU is realized.
- GA Genetic Algorithm: Genetic
- the techniques of Patent Documents 1 and 2 are premised on automatic speed-up using OpenACC, and there is a problem that performance improvement is insufficient as compared with manual speed-up using CUDA.
- the present invention has been made in view of such a point, and it is an object to be able to expand the offload target range and increase the applicable applications.
- an offload server that offloads the specific processing of the application to the GPU, and the application code analysis unit that analyzes the source code of the application and the loop statement based on the result of the code analysis.
- the GPU processing of the above is specified by using at least one selected from the group consisting of the kernels directive, the parallel loop directive, and the parallel loop vector directive of OpenACC, and the loop statement of the application.
- the parallel processing specification part that specifies and compiles the parallel processing specification statement in the GPU and the loop statement that causes a compile error are excluded from the offload target.
- An accelerator verification device that compiles the parallel processing pattern creation unit that creates a parallel processing pattern that specifies whether to perform parallel processing or not for a loop statement that does not cause a compilation error, and the application of the parallel processing pattern.
- a performance measurement unit that executes performance measurement processing when offloaded to the GPU, and a parallel processing pattern with the highest processing performance is selected from a plurality of the parallel processing patterns based on the performance measurement results.
- the offload server is characterized by including an execution file creation unit that compiles the parallel processing pattern with the highest processing performance and creates an execution file.
- the offload target range can be expanded to increase the applicable applications.
- the offload server in the embodiment for carrying out the present invention (hereinafter, referred to as “the present embodiment”) will be described with reference to the drawings.
- the present embodiment There are many different applications that you want to offload. Further, in an application having a large amount of calculation and time, such as image analysis for video processing and machine learning processing for analyzing sensor data, iterative processing by a loop statement occupies a long time. Therefore, it is considered that the target is to increase the speed by automatically offloading the loop statement to the GPU.
- Non-Patent Document 1 proposes to automatically find an appropriate loop statement to be offloaded to the GPU by GA (Genetic Algorithm). That is, in Non-Patent Document 1, from a general-purpose program that is not supposed to be parallelized, the parallelizable loop statement is first checked, and then the parallelizable loop statement group is executed by the CPU when GPU is executed. At the time of, the value is set to 0 and geneticized, and the performance verification trial is repeated in the verification environment to search for an appropriate region. By narrowing down to parallel processing loop statements and holding and recombining parallel processing patterns that can be accelerated in the form of gene parts, patterns that can be efficiently accelerated from the huge number of parallel processing patterns that can be taken. Is exploring.
- GA Genetic Algorithm
- Non-Patent Document 1 the variables used in the nested loop statement are transferred between the CPU and the GPU when the loop statement is offloaded to the GPU.
- the transfer is performed at each lower loop, which is not efficient.
- Non-Patent Document 2 proposes that variables that do not cause a problem even if CPU-GPU transfer is performed at the higher level are collectively transferred at the higher level. Loops that take a long time to process and have a large number of loops are often nested, so this is a method that shows a certain effect on speeding up by reducing the number of transfers.
- Non-Patent Document 1 and Non-Patent Document 2 it is actually confirmed that the automatic speedup is achieved even in a medium-scale application having more than 100 loop statements. When conscious of practicality, expansion of the scope of application is required.
- Figure 1 shows ⁇ cases where data copy and present are not used>, ⁇ cases where data copy and present are used>, and cases where ⁇ data copy and present are used and a temporary area is used as a data storage location>. It is a figure explaining the parameter and data transfer between CPU-GPU in each of.
- a one-way arrow ( ⁇ ) in FIG. 1 indicates data transfer from the CPU to the GPU or from the GPU to the CPU, and a two-way arrow ( ⁇ ) in FIG. 2 indicates bidirectional data between the CPU and the GPU. Indicates a transfer.
- Fig. 1 The ⁇ case where datacopy and present are not used> in Fig. 1 is the case of the PGI compiler, which is well-known as an OpenACC compiler. That is, for securing the parameter area and initializing the parameters, the CPU transfers the parameter data to the GPU, and the GPU receives the initialization data.
- the PGI compiler which is well-known as an OpenACC compiler. That is, for securing the parameter area and initializing the parameters, the CPU transfers the parameter data to the GPU, and the GPU receives the initialization data.
- bidirectional parameter data transfer is performed from the CPU to the GPU, or from the GPU to the CPU, and from the CPU to the GPU. That is, the CPU transmits the loop start notification to the GPU and receives the loop end notification from the GPU. As a result, the GPU synchronizes with the host (here, the CPU) in loop units.
- ⁇ Case using data copy or present> in FIG. 1 is a case in Non-Patent Document 2. That is, in the case of using ⁇ data copy or present> in FIG. 1, in addition to securing the parameter area and initializing the parameters, the CPU transfers the parameter data to the GPU and the GPU initially transfers the data area start notification transmission. Receive the data. Further, the loop start notification transmission is a synchronous transfer between the CPU and the GPU. From the GPU's point of view, automatic synchronization is performed by the loop configuration. Then, the CPU receives the loop end notification from the GPU, and in response to this, notifies the GPU of the end of the data area. The GPU sends the final result to the host (here, the CPU) to the CPU and synchronizes with the host.
- the host here, the CPU
- Non-Patent Document 2 even when datacopy or present is specified by OpenACC, variables may be automatically transferred by the CPU-GPU in the compiler. Since the compiler basically processes on the safe side, whether it is a global variable or a local variable, where it is initialized, whether it is obtained from another function including a loop, it is only referenced, or within a loop Depending on multiple conditions such as whether it is updated or not, transfer occurs even if transfer is unnecessary due to compiler dependence.
- the case of ⁇ data copy or present in FIG. 1 and using a temporary area as a data storage location> is the case of the present invention.
- an unnecessary CPU-GPU transfer is created by creating a temporary area, initializing parameters in the temporary area, and using it for CPU-GPU transfer. To shut off. That is, in the present invention, as shown in ⁇ Case in which data copy or present is used and a temporary area is used as a data storage location> in FIG. 1, a temporary area is created by the GPU and parameters are set in the temporary area. create. The point is that parameters have never been created in this temporary area.
- the CPU receives a data area start notification from the GPU, and the final result is sent to the CPU. Is sent to the host (CPU in this case) to synchronize with the host.
- the [higher speed] of the present invention has been described above.
- the directives are expanded in order to increase the applicable applications.
- an instruction phrase for designating GPU processing it is expanded to a parallel instruction phrase in addition to the kernels instruction phrase described in Non-Patent Document 2.
- kernels are used for single loops and tightly nested loops
- parallel loops are used for loops including non-tightly nested loops.
- parallel loop vector is used for loops that parallelize cannot, but vectorize can.
- a tightly nested loop is a simple loop in which, for example, when two loops that increment i and j are nested, processing using i and j is performed in the lower loop and not in the upper loop. Is.
- the difference is that kernels make the judgment of parallelization by the compiler, and parallel makes the judgment of parallelization by the programmer (wrong logic results in wrong results and is the responsibility of the programmer). be.
- Non-Patent Document 2 a simple loop was examined. In this case, loop statements that cause an error in kernels such as non-tightly nested loops and loops that cannot be parallelized were out of scope, so the scope of application was narrow.
- kernels are used for single or tightly nested loops, and parallel loops are used for non-tightly nested loops. Then, use parallel loop vector for loops that cannot be parallelized but can be vectorized. Also, by using the parallel directive, there is a concern that the reliability will be lower than when the result is kernels. I will address this concern. That is, it is assumed that the final offload program is subjected to a sample test, the result difference with the CPU is checked, and the result is shown to the user so that the user can confirm it. In the first place, since the hardware is different between the CPU and GPU, there are differences in the number of significant digits and rounding error, and it is necessary to check the result difference with the CPU even with kernels alone. In this check, the user may also check the result difference with the CPU.
- FIG. 2 is a diagram showing an environment-adapted software system including an offload server 1 according to the present embodiment to which the basic idea of the present invention is applied.
- the environment-adaptive software system according to the present embodiment is characterized by including an offload server 1 in addition to the conventional environment-adaptive software configuration.
- the offload server 1 is an offload server that offloads a specific process of an application to an accelerator. Further, the offload server 1 is communicably connected to each device located in the three layers of the cloud layer 2, the network layer 3, and the device layer 4.
- a data center 30 is arranged in the cloud layer 2
- a network edge 20 is arranged in the network layer 3
- a gateway 10 is arranged in the device layer 4.
- efficiency is realized by appropriately performing function arrangement and processing offload in each layer of the device layer, the network layer, and the cloud layer. .. Mainly, by improving the function placement efficiency in which functions are placed in appropriate locations on three layers and processing them, and by offloading function processing such as image analysis to hetero hardware such as GPU and FPGA (Field Programmable Gate Array). Aim for efficiency.
- the cloud layer the number of servers equipped with heterogeneous HW (hardware) such as GPU and FPGA (hereinafter referred to as "hetero device") is increasing.
- hetero device the number of servers equipped with heterogeneous HW (hardware) such as GPU and FPGA (hereinafter referred to as "hetero device") is increasing.
- hetero device the number of servers equipped with heterogeneous HW (hardware) such as GPU and FPGA (hereinafter referred to as "hetero device") is increasing.
- FPGA is also used in Bing search of Microsoft (registered trademark). In this way, by utilizing heterodevices, for example,
- the offload server 1 performs offload processing executed in the background of using services for users in the environment-adapted software system.
- FIG. 3 is a functional block diagram showing a configuration example of the offload server 1 according to the embodiment of the present invention.
- the offload server 1 is a device that automatically offloads specific processing of an application to an accelerator. As shown in FIG. 3, the offload server 1 includes a control unit 11, an input / output unit 12, a storage unit 13, and a verification machine 14 (accelerator verification device). NS.
- the input / output unit 12 is an input / output unit for transmitting / receiving information between a communication interface for transmitting / receiving information to / from each device, an input device such as a touch panel or a keyboard, and an output device such as a monitor. It consists of an output interface.
- the storage unit 13 is composed of a hard disk, a flash memory, a RAM (Random Access Memory), and the like.
- the storage unit 13 stores a test case database 131, a program (offload program) for executing each function of the control unit 11, and information necessary for processing of the control unit 11. (For example, Intermediate file 132) is temporarily stored.
- Performance test items are stored in the test case DB 131.
- the test case DB 131 stores information for performing a test such as measuring the performance of an application to be accelerated. For example, in the case of a deep learning application for image analysis processing, it is a sample image and a test item for executing it.
- the verification machine 14 includes a CPU (Central Processing Unit), a GPU, and an FPGA (accelerator) as a verification environment for environment-adaptive software.
- the control unit 11 is an automatic offloading function that controls the entire offload server 1.
- the control unit 11 is realized, for example, by a CPU (not shown) deploying and executing a program (offload program) stored in the storage unit 13 in a RAM.
- the control unit 11 includes an application code designation unit (Specify application code) 111, an application code analysis unit (Analyze application code) 112, a data transfer designation unit 113, a parallel processing designation unit 114, and a parallel processing pattern creation unit 115. , Performance measurement unit 116, executable file creation unit 117, production environment placement unit (Deploy final binary files to production environment) 118, performance measurement test extraction execution unit (Extract performance test cases and run automatically) 119, and provided by the user. It has a department (Provide price and performance to a user to judge) 120.
- the application code designation unit 111 specifies the input application code. Specifically, the application code designation unit 111 specifies a processing function (image analysis, etc.) of the service provided to the user.
- the application code analysis unit 112 analyzes the source code of the processing function and grasps the structure of the loop statement, the FFT library call, and the like.
- the data transfer designation unit 113 does not refer to or update each other among the variables that need to be transferred between the CPU and GPU, and the result of GPU processing.
- variables that only return to the CPU specify to transfer data collectively before and after the start and end of GPU processing.
- the variable that needs to be transferred between the CPU and the GPU is a variable defined by a plurality of files or a plurality of loops from the result of code analysis.
- the data transfer designation unit 113 specifies the designation of batch data transfer before and after the start and end of GPU processing by using data copy of OpenACC.
- the data transfer designation unit 113 adds an instruction phrase that does not require transfer when the variables to be processed by the GPU have already been collectively transferred to the GPU side.
- the data transfer designation unit 113 uses OpenACC data present to clearly indicate that transfer is not required for variables that are collectively transferred before the start of GPU processing and that do not need to be transferred at the timing of loop statement processing.
- the data transfer designation unit 113 creates a temporary area on the GPU side (#pragma acc declare create) at the time of data transfer between the CPU and the GPU, stores the data in the temporary area, and then synchronizes the temporary area (#pragma acc update). ) To instruct variable transfer.
- the data transfer designation unit 113 selects at least GPU processing for the loop statement from the group consisting of the kernels directive, the parallel loop directive, and the parallel loop vector directive of OpenACC. Specify using one.
- the OpenACC kernels directive is used for single loops and tightly nested loops.
- the OpenACC parallel loop directive is used for non-tightly nested loops.
- OpenACC's parallel loop vector directive is used for loops that cannot be parallelized but can be vectorized.
- the parallel processing specification unit 114 specifies the loop statement (repetition statement) of the application, and for each repetition statement, specifies the processing in the GPU with the OpenACC instruction phrase and compiles it.
- the parallel processing designation unit 114 includes an extract offloadable area 114a and an intermediate language file output unit 114b.
- the offload range extraction unit 114a identifies processing that can be GPU offloaded, such as a loop statement, and extracts an intermediate language according to the offload processing.
- the intermediate language file output unit 114b outputs the extracted intermediate language file 132.
- Intermediate language extraction is not a one-time process, it is repeated to try and optimize execution for proper offload area search.
- the parallel processing pattern creation unit 115 excludes the loop statement (repeated statement) in which a compile error occurs from the offload target, and specifies whether to parallel process the repeated statement in which a compile error does not occur. Create a parallel processing pattern to be performed.
- the performance measurement unit 116 compiles the application of the parallel processing pattern, arranges it on the verification machine 14, and executes the performance measurement processing when it is offloaded to the GPU.
- the performance measurement unit 116 includes a binary file arrangement unit (Deploy binary files) 116a.
- the binary file placement unit 116a deploys (places) an executable file derived from an intermediate language on a verification machine 14 equipped with a GPU.
- the performance measurement unit 116 executes the arranged binary file, measures the performance when offloaded, and returns the performance measurement result to the offload range extraction unit 114a.
- the offload range extraction unit 114a extracts another parallel processing pattern, and the intermediate language file output unit 114b tries to measure the performance based on the extracted intermediate language (reference numeral e in FIG. 3 below). reference).
- Executable file creation unit 117 selects the parallel processing pattern with the highest processing performance from a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, compiles and executes the parallel processing pattern with the highest processing performance. Create a file.
- the production environment placement unit 118 places the created executable file in the production environment for users (“placement of the final binary file in the production environment”).
- the production environment placement unit 118 determines a pattern that specifies the final offload area, and deploys it in the production environment for users.
- the performance measurement test extraction execution unit 119 After arranging the execution file, the performance measurement test extraction execution unit 119 extracts the performance test items from the test case DB131 and executes the performance test (“deployment of the final binary file in the production environment”). After arranging the execution file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and automatically executes the extracted performance test in order to show the performance to the user.
- the user providing unit 120 presents information such as price and performance to the user based on the performance test result (“Providing information such as price and performance to the user”).
- the test case DB 131 stores data for automatically performing a test for measuring the performance of the application.
- the user providing unit 120 presents to the user the price of the entire system determined by the result of executing the test data of the test case DB 131 and the unit price of each resource (virtual machine, FPGA instance, GPU instance, etc.) used in the system.
- the user determines to start charging for the service based on the presented information such as price and performance.
- the offload server 1 can use an evolutionary computation method such as GA for optimizing offload.
- the configuration of the offload server 1 when GA is used is as follows. That is, the parallel processing designation unit 114 sets the number of loop statements (repeated statements) that do not cause a compilation error as the gene length based on the genetic algorithm.
- the parallel processing pattern creation unit 115 maps whether or not the accelerator processing is possible to the gene pattern, with either 1 or 0 when the accelerator processing is performed and 0 or 1 when the accelerator processing is not performed.
- the parallel processing pattern creation unit 115 prepares a gene pattern of a specified number of individuals in which each value of the gene is randomly created to 1 or 0, and the performance measurement unit 116 prepares a parallel processing designation statement in the GPU according to each individual. Compile the specified application code and place it on the verification machine 14. The performance measurement unit 116 executes the performance measurement process on the verification machine 14.
- the performance measuring unit 116 measures the performance without compiling the application code corresponding to the parallel processing pattern and measuring the performance. Use the same value as the value. Further, the performance measurement unit 116 sets the performance measurement value to a predetermined time (long time) as a time-out treatment for the application code in which a compilation error occurs and the application code in which the performance measurement does not end in a predetermined time.
- Executable file creation unit 117 measures the performance of all individuals and evaluates them so that the shorter the processing time, the higher the degree of suitability.
- the executable file creation unit 117 selects a high-performance individual from all the individuals, performs crossover and mutation processing on the selected individual, and creates a next-generation individual.
- the above selection includes a method such as roulette selection that is stochastically selected according to the ratio of goodness of fit.
- the executable file creation unit 117 selects the highest-performance parallel processing pattern as a solution after the processing of the specified number of generations is completed.
- the offload server 1 of this embodiment is an example applied to GPU automatic offload of user application logic as an elemental technology of environment adaptation software.
- FIG. 4 is a diagram showing an automatic offload process using the GA of the offload server 1. As shown in FIG. 4, the offload server 1 is applied to the elemental technology of environment-adapted software.
- the offload server 1 has a control unit (automatic offload function unit) 11, a test case DB 131, an intermediate language file 132, and a verification machine 14.
- the offload server 1 acquires the application code 130 used by the user.
- the offload server 1 automatically offloads the function processing to the accelerator of the device 152 having the CPU-GPU and the device 153 having the CPU-FPGA.
- step S11 the application code designation unit 111 (see FIG. 3) specifies a processing function (image analysis, etc.) of the service provided to the user. Specifically, the application code designation unit 111 specifies the input application code.
- Step S12 Analyze application code>
- the application code analysis unit 112 analyzes the source code of the processing function and grasps the structure of the loop statement, the FFT library call, and the like.
- Step S13 Extract offloadable area>
- the parallel processing designation unit 114 specifies a loop statement (repetition statement) of the application, specifies GPU processing with OpenACC for each repetition statement, and compiles it.
- the offload range extraction unit 114a identifies a process that can be offloaded to the GPU, such as a loop statement, and extracts an intermediate language according to the offload process.
- Step S14 Output intermediate file>
- the intermediate language file output unit 114b (see FIG. 3) outputs the intermediate language file 132.
- Intermediate language extraction is not a one-time process, it is repeated to try and optimize execution for proper offload area search.
- Step S15 Compile error>
- the parallel processing pattern creation unit 115 excludes the loop statement in which a compile error occurs from the offload target, and parallel processes the repeated statement in which no compile error occurs. Create a parallel processing pattern that specifies whether or not to use it.
- Step S21 Deploy binary files>
- the binary file placement unit 116a (see FIG. 3) deploys the executable file derived from the intermediate language on the verification machine 14 equipped with the GPU.
- the binary file placement unit 116a starts the placed file, executes the assumed test case, and measures the performance when offloaded.
- Step S22 Measure performances>
- the performance measuring unit 116 executes the arranged file and measures the performance when offloaded. In order to make the offload area more appropriate, this performance measurement result is returned to the offload range extraction unit 114a, and the offload range extraction unit 114a extracts another pattern. Then, the intermediate language file output unit 114b tries to measure the performance based on the extracted intermediate language (see reference numeral e in FIG. 4). The performance measurement unit 116 repeats the performance measurement in the verification environment and finally determines the code pattern to be deployed.
- control unit 11 compiles each repeating statement by designating the GPU processing with OpenACC.
- Step S23 Deploy final binary files to production environment>
- the production environment placement unit 118 determines a pattern in which the final offload area is specified and deploys it in the production environment for users.
- Step S24 Extract performance test cases and run automatically>
- the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and automatically executes the extracted performance tests in order to show the performance to the user after arranging the execution file.
- Step S25 Provide price and performance to a user to judge>
- the user providing unit 120 presents information such as price and performance to the user based on the performance test result.
- the user determines to start charging for the service based on the presented information such as price and performance.
- steps S11 to S25 are performed in the background of the user's service use, and are assumed to be performed, for example, during the first day of temporary use.
- control unit 11 of the offload server 1 when the control unit (automatic offload function unit) 11 of the offload server 1 is applied to the elemental technology of the environment adaptation software, the function processing is offloaded from the source code of the application used by the user. , The area to be offloaded is extracted and the intermediate language is output (steps S11 to S15). The control unit 11 arranges and executes an execution file derived from the intermediate language on the verification machine 14 to verify the offload effect (steps S21 to S22). After repeating the verification and determining an appropriate offload area, the control unit 11 deploys the executable file in the production environment actually provided to the user and provides it as a service (steps S23 to S25).
- the GPU automatic offload is a process for repeatedly obtaining steps S12 to S22 of FIG. 4 for the GPU to finally obtain an offload code to be deployed in step S23.
- GPU is a device that generally does not guarantee latency, but is suitable for increasing throughput by parallel processing.
- IoT IoT data encryption processing
- image processing for camera image analysis image processing
- machine learning processing for mass sensor data analysis and the like, and these are often repetitive processes. Therefore, we aim to speed up by automatically offloading the repeated statements of the application to the GPU.
- an appropriate offload area is automatically extracted from a general-purpose program that is not supposed to be parallelized. Therefore, it is possible to first check the parallelizable for statement and then repeat the performance verification trial in the verification environment using GA for the parallelizable for statement group to search for an appropriate area.
- GA parallelizable for statement group
- FIG. 5 is a diagram showing a search image of the control unit (automatic offload function unit) 11 by Simple GA.
- FIG. 5 shows a search image of processing and gene sequence mapping of a for statement.
- GA is one of the combinatorial optimization methods that imitates the evolutionary process of living organisms.
- the flow chart of GA is initialization->evaluation->selection->crossing->mutation-> end judgment.
- Simple GA which is a simplified process, is used in GA.
- Simple GA is a simplified GA in which genes are only 1 and 0, roulette selection, one-point crossing, and mutation reverse the value of one gene.
- ⁇ Initialization> After checking whether all for statements in the application code can be parallelized, the parallelizable for statements are mapped to the gene sequence. Set to 1 when GPU processing is performed, and 0 when GPU processing is not performed. For the gene, a designated number of individuals M is prepared, and 1 and 0 are randomly assigned to one for statement.
- the control unit (automatic offload function unit) 11 acquires the application code 130 (see FIG. 3) used by the user, and as shown in FIG. 5, the application. From the code patterns 141 of the code 130, it is checked whether or not the for statement can be parallelized. As shown in FIG. 5, when five for statements are found from the code pattern 141 (see the reference numeral f in FIG.
- one digit is used for each for statement, and here, one digit of five digits is used for each of the five for statements.
- 0 is randomly assigned. For example, it is set to 0 when it is processed by the CPU and 1 when it is sent to the GPU. However, at this stage, 1 or 0 is randomly assigned.
- ⁇ Evaluation> In the evaluation, deployment & performance measurement is performed (see reference numeral g in FIG. 5). That is, the performance measurement unit 116 (see FIG. 3) compiles the code corresponding to the gene, deploys it on the verification machine 14, and executes it. The performance measurement unit 116 measures the benchmark performance. Increase the goodness of fit of genes in a pattern with good performance (parallel processing pattern).
- ⁇ Selection> high performance code patterns are selected based on the goodness of fit (see reference numeral h in FIG. 5).
- the performance measuring unit 116 selects a gene having a high goodness of fit and a specified number of individuals based on the goodness of fit.
- roulette selection and elite selection of the highest goodness-of-fit gene are performed according to the goodness of fit.
- FIG. 5 it is shown as a search image that the number of circles (circle) in the selected code patterns 142 is reduced to three.
- ⁇ Cross> In crossover, at a constant crossover rate Pc, some genes are exchanged between selected individuals at a certain point to create offspring individuals.
- ⁇ Mutation> each value of an individual's gene is changed from 0 to 1 or 1 to 0 at a constant mutation rate Pm. Mutations are also introduced to avoid local solutions. In addition, a mode in which mutation is not performed may be used in order to reduce the amount of calculation.
- OpenACC has a compiler that enables GPU offload by extracting bytecode for GPU by specifying it with the directive #pragmaacckernels. By writing a for statement command in this #pragma, it is possible to determine whether or not the for statement works on the GPU.
- a code pattern having a gene length corresponding to the number of for statements has been obtained.
- parallel processing patterns 10010, 01001, 00101, ... are randomly assigned.
- an error may occur even though the for statement can be offloaded. That is the case when the for statement is hierarchical (GPU processing can be performed if either is specified). In this case, the for statement that caused the error may be left.
- the verification machine 14 It is deployed on the verification machine 14, and benchmarked, for example, if it is image processing, it is benchmarked by the image processing.
- the reciprocal of the processing time the one that takes 10 seconds for the processing time is 1, 0.1 for the one that takes 100 seconds, and 10 for the one that takes 1 second.
- Select the one with high fitness for example, select 3 to 5 out of 10, and rearrange them to create a new code pattern.
- the same thing as before may be created during creation. In that case, it is not necessary to perform the same benchmark, so use the same data as before.
- the code pattern and its processing time are stored in the storage unit 13.
- the search image of the control unit (automatic offload function unit) 11 by Simple GA has been described above. Next, a batch processing method for data transfer will be described.
- GPU processing and CPU processing are not nested for variables defined in multiple files from the results, and variables for which CPU processing and GPU processing are separated are collectively.
- a temporary area is created (#pragma acc declare create), the data is stored in the temporary area, and then the temporary area is synchronized (#pragma acc update) to instruct the transfer.
- Comparative examples are a normal CPU program (see FIG. 6), simple GPU utilization (see FIG. 7), and nesting batch (Non-Patent Document 2) (see FIG. 8).
- the following description and ⁇ 1> to ⁇ 4> at the beginning of the loop sentence in the figure are added for convenience of explanation (the same applies to other figures and their explanations).
- the reference numeral k in FIG. 6 is the setting of the variables c and d in the loop ⁇ 3>
- the reference numeral l in FIG. 6 is the setting of the variables e and f in the loop ⁇ 4>.
- the normal CPU program shown in FIG. 6 is executed by the CPU (does not use the GPU).
- FIG. 7 is a diagram showing a loop statement when data is transferred from the CPU to the GPU by using the normal CPU program shown in FIG. 6 using a simple GPU.
- the types of data transfer include data transfer from the CPU to the GPU and data transfer from the GPU to the CPU.
- data transfer from the CPU to the GPU will be taken as an example.
- the processing part that can be processed in parallel such as the for statement by the PGI compiler is specified by the OpenACC directive #pragma acc kernels.
- #pragma acc kernels As shown in the broken line frame including the reference numeral n in FIG. 7, c and d are transferred at this timing by #pragma acc kernels.
- FIG. 8 is a diagram showing a loop statement in the case of data transfer from the CPU to the GPU and from the GPU to the CPU by nesting batch (Non-Patent Document 2).
- the data transfer instruction line from the CPU to the GPU here, #pragma acc data copyin (a, b) in the copy in clause of the variables a and b is inserted at the position indicated by the symbol o in FIG. do.
- #pragma acc data copyout (a, b) in the copyout clause of the variables a and b.
- FIG. 9 is a diagram showing a loop statement by batch transfer at the time of data transfer of the CPU-GPU of the present embodiment.
- FIG. 9 corresponds to the nesting batch of FIG. 8 of the comparative example.
- the data transfer instruction line from the CPU to the GPU here, #pragma acc datacopyin (a, b) in the copyin clause of the variables a, b, c, and d. Insert, c, d).
- Variables that are collectively transferred using the above #pragma acc data copyin (a, b, c, d) and do not need to be transferred at that timing are at the timing shown in the two-dot chain line frame including the symbol p in FIG. Specify using the data present statement #pragma acc data present (c, d) that clearly indicates that the GPU already has a variable.
- Variables that can be transferred in a batch by specifying the transfer in a batch are transferred in a batch, and variables that have already been transferred and do not need to be transferred are specified using data present to reduce the transfer and further improve the efficiency of offload means.
- the compiler may automatically determine and transfer. Unlike the instruction of OpenACC, the automatic transfer by the compiler is an event in which the transfer between the CPU and GPU is originally unnecessary, but the automatic transfer is performed depending on the compiler.
- FIG. 10 is a diagram showing a loop statement by batch transfer at the time of data transfer of the CPU-GPU of the present embodiment.
- FIG. 10 corresponds to the nesting batch and the variable specification that does not require transfer in FIG.
- the OpenACC declare create statement #pragma acc declare create that creates a temporary area at the time of data transfer of the CPU-GPU is specified at the position indicated by the reference numeral s in FIG.
- #pragma acc declare create when the CPU-GPU data is transferred, a temporary area is created (#pragma acc declare create), and the data is stored in the temporary area.
- transfer is instructed by specifying the OpenACC declare create statement #pragma acc update for synchronizing the temporary area at the position indicated by the symbol t in FIG.
- the number of loops is investigated by using a profiling tool as a preliminary step of a full-scale offload processing search. Since the profiling tool can be used to investigate the number of executions of each line, it is possible to sort programs in advance, for example, targeting a program having a loop of 50 million times or more as an offload processing search.
- a specific description will be given (partially overlaps with the content described in FIG. 4).
- the application that searches the offload processing unit is analyzed, and the loop statements such as for, do, and while are grasped.
- execute sample processing use a profiling tool to investigate the number of loops in each loop statement, and whether or not to perform a full-scale offload processing section search depending on whether or not there are loops with a certain value or more. Is determined.
- the GA processing is started (see Fig. 4 above).
- the initialization step after checking whether or not all the loop statements of the application code can be parallelized, the parallelizable loop statements are mapped to the gene sequence as 1 when GPU processing is performed and 0 when not. A specified number of individuals is prepared for the gene, and 1,0 is randomly assigned to each value of the gene.
- the code corresponding to the gene is compiled, deployed on the verification machine, executed, and the benchmark performance is measured. Increase the goodness of fit of genes with good performance patterns.
- the code corresponding to the gene includes a parallel processing instruction line (for example, see reference numeral j in FIG. 6) and a data transfer instruction line (for example, see reference numeral l in FIG. 6, reference numeral m in FIG. 7, and FIG. (See code o) is inserted.
- genes with high goodness of fit are selected for a specified number of individuals based on the goodness of fit.
- roulette selection and elite selection of the highest goodness-of-fit gene are performed according to the goodness of fit.
- the crossover step at a constant crossover rate Pc, some genes are exchanged between selected individuals at a certain point to create offspring individuals.
- each value of an individual's gene is changed from 0 to 1 or 1 to 0 at a constant mutation rate Pm.
- the process is terminated after repeating the specified number of generations, and the gene with the highest goodness of fit is used as the solution. With the highest performance code pattern that corresponds to the gene with the highest goodness of fit, it will be redeployed to the production environment and provided to users.
- the implementation of the offload server 1 will be described below. This implementation is for confirming the effectiveness of this embodiment.
- An implementation that automatically offloads a C / C ++ application using a general-purpose PGI compiler will be described. Since the purpose of this implementation is to confirm the effectiveness of GPU automatic offload, the target application is a C / C ++ language application, and the GPU processing itself uses a conventional PGI compiler for explanation.
- the C / C ++ language is the most popular in the development of OSS (Open Source Software) and proprietary software, and many applications are being developed in the C / C ++ language.
- OSS Open Source Software
- applications are being developed in the C / C ++ language.
- a general-purpose application of OSS such as encryption processing and image processing is used.
- the GPU processing is performed by the PGI compiler.
- the PGI compiler is a compiler for C / C ++ / Fortran that interprets OpenACC.
- the parallelizable processing unit such as the for statement is specified by the OpenACC directive #pragma acc kernels (parallel processing specification statement).
- #pragma acc kernels parallel processing specification statement
- the bytecode for the GPU is extracted, and the GPU can be offloaded by executing the bytecode.
- an error is issued when the data in the for statement is dependent on each other and cannot be processed in parallel, or when multiple layers with different nested for statements are specified.
- directives such as #pragmaaccdatacopyin/copyout/copy enable explicit data transfer instructions.
- the benchmark is executed and the number of loops of the for statement grasped by the above parsing is grasped.
- GNU coverage gcov etc. is used to grasp the number of loops.
- profiling tools "GNU profiler (gprof)” and “GNU coverage (gcov)” are known. Either can be used because both can check the number of executions of each line. The number of executions can be changed, for example, only for applications having a loop count of 10 million times or more, but this value can be changed.
- Compile errors are difficult to deal with automatically, and even if they are dealt with, they often have no effect.
- an external routine call it may be avoided by #pragmaaccroutine, but many external calls are libraries, and even if GPU processing including them is performed, the call becomes a bottleneck and performance is not achieved. Since the for statement is tried one by one, no compile error occurs for nesting errors.
- a is the gene length. 1 of the gene corresponds to the parallel processing directive, 0 corresponds to the absence, and the application code is mapped to the gene of length a.
- the goodness of fit of each gene sequence For the total number of individuals, after measuring the benchmark performance, set the goodness of fit of each gene sequence according to the benchmark processing time. The individual to be left is selected according to the set goodness of fit. The selected individuals are subjected to GA treatment of crossing treatment, mutation treatment, and copy treatment as they are to create a next-generation population.
- Directive insertion, compilation, performance measurement, fitness setting, selection, crossover, and mutation processing are performed for next-generation individuals.
- the same measured value as before is used without compiling or measuring the performance of the individual.
- the C / C ++ code with directives corresponding to the highest performance gene sequence is used as the solution.
- the number of individuals, the number of generations, the crossover rate, the mutation rate, the goodness of fit setting, and the selection method are GA parameters and are specified separately.
- FIGS. 11A-B are flowcharts for explaining the operation outline of the above-described implementation, and FIGS. 11A and 11B are connected by a coupler. Perform the following processing using the OpenACC compiler for C / C ++.
- step S101 the application code analysis unit 112 (see FIG. 3) analyzes the code of the C / C ++ application.
- step S102 the parallel processing designation unit 114 (see FIG. 3) specifies the loop statement and reference relationship of the C / C ++ application.
- step S103 the parallel processing specification unit 114 checks the GPU processing possibility of each loop statement (#pragma acc kernels).
- the control unit (automatic offload function unit) 11 repeats the processing of steps S105 to S116 for the number of loop statements between the loop start end of step S104 and the loop end of step S117.
- the control unit (automatic offload function unit) 11 repeats the processing of steps S106 to S107 for the number of loop statements between the loop start end of step S105 and the loop end of step S108.
- the parallel processing specification unit 114 compiles each loop statement by specifying GPU processing (#pragma acc kernels) by OpenACC.
- the parallel processing specification unit 114 checks the GPU processing possibility with the following directive (#pragma acc parallel loop).
- the control unit (automatic offload function unit) 11 repeats the processing of steps S110 to S111 for the number of loop statements between the loop start end of step S109 and the loop end of step S112.
- the parallel processing specification unit 114 compiles each loop statement by designating GPU processing (#pragma acc parallel loop) with OpenACC.
- the parallel processing specification unit 114 checks the GPU processing possibility with the following directive (#pragma acc parallel loop vector).
- the control unit (automatic offload function unit) 11 repeats the processing of steps S114 to S115 for the number of loop statements between the loop start end of step S113 and the loop end of step S116.
- the parallel processing specification unit 114 compiles each loop statement by designating GPU processing (#pragma acc parallel loop vector) with OpenACC.
- the parallel processing designation unit 114 removes the GPU processing instruction clause from the loop statement when an error occurs.
- step S118 the parallel processing specification unit 114 counts the number of for statements that do not generate a compilation error and sets the gene length.
- the parallel processing designation unit 114 prepares a gene sequence of a designated number of individuals. Here, 0 and 1 are randomly assigned and created.
- the parallel processing designation unit 114 maps the C / C ++ application code to the gene and prepares the designated population pattern. Depending on the prepared gene sequence, if the gene value is 1, insert a directive that specifies parallel processing into the C / C ++ code (see, for example, the #pragma directive in FIG. 5).
- the control unit (automatic offload function unit) 11 repeats the processing of steps S121 to S128 for a specified number of generations between the loop start end of step S120 and the loop end of step S129. Further, in the repetition of the designated number of generations, the designated number of individuals is further repeated for the processing of steps S122 to S125 between the loop start end of step S121 and the loop end of step S126. That is, in the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.
- step S122 the data transfer designation unit 113 uses explicit instruction lines (#pragma acc data copy / copyin / copyout / present and #pragam acc declarecreate, #pragma acc update) based on the variable reference relationship. Specify the transfer.
- step S123 the parallel processing pattern creation unit 115 (see FIG. 2) compiles the C / C ++ code specified by the directive according to the gene pattern with the PGI compiler. That is, the parallel processing pattern creation unit 115 compiles the created C / C ++ code with the PGI compiler on the verification machine 14 equipped with the GPU.
- a compile error may occur when multiple nested for statements are specified in parallel. In this case, the processing time at the time of performance measurement is treated in the same manner as when the time-out occurs.
- step S124 the performance measurement unit 116 (see FIG. 2) deploys the executable file on the verification machine 14 equipped with the CPU-GPU.
- step S125 the performance measuring unit 116 executes the arranged binary file and measures the benchmark performance when offloaded.
- step S127 the executable file creation unit 117 (see FIG. 2) evaluates the individual having a shorter processing time so that the degree of conformity is higher, and selects an individual having higher performance.
- step S1208 the executable file creation unit 117 performs crossover and mutation processing on the selected individual to create a next-generation individual.
- the executable file creation unit 117 performs compilation, performance measurement, fitness setting, selection, crossover, and mutation processing on the next-generation individual. That is, for all individuals, after measuring the benchmark performance, the goodness of fit of each gene sequence is set according to the benchmark processing time. The individual to be left is selected according to the set goodness of fit.
- the executable file creation unit 117 performs GA processing of cross processing, mutation processing, and copy processing as it is on the selected individual to create a next-generation population.
- step S130 the executable file creation unit 117 solves the C / C ++ code (highest-performance parallel processing pattern) corresponding to the highest-performance gene sequence after the GA processing for the specified number of generations is completed.
- the above-mentioned number of individuals, number of generations, crossover rate, mutation rate, goodness of fit setting, and selection method are parameters of GA.
- the GA parameter may be set as follows, for example.
- the parameters and conditions of Simple GA to be executed can be set as follows, for example. Gene length: Number of loop sentences that can be paralleled Individual number M: Gene length or less Number of generations T: Gene length or less Fitness: (Processing time) (-1 / 2) With this setting, the shorter the benchmark processing time, the higher the goodness of fit. Further, by setting the goodness of fit to the (-1 / 2) power of the processing time, it is possible to prevent the goodness of fit of a specific individual having a short processing time from becoming too high and narrowing the search range.
- time-out is performed and the goodness of fit is calculated assuming that the processing time is a time (long time) such as 1000 seconds. This timeout time may be changed according to the performance measurement characteristics.
- Selection Roulette selection However, the highest fitness gene in the generation is not crossed or mutated, and elite preservation is also performed to preserve it in the next generation.
- Crossover rate Pc 0.9 Mutation rate Pm: 0.05
- gcov, gprof, etc. are used to identify in advance an application with many loops and taking a long time to execute, and perform an offload trial. This makes it possible to find applications that can be speeded up efficiently.
- the instruction phrase is expanded in order to increase the number of applicable applications.
- an instruction phrase for specifying GPU processing it is expanded to a parallel loop instruction phrase and a parallel loop vector instruction phrase in addition to the kernels instruction clause.
- kernels are used for single loops and tightly nested loops.
- parallel loops are used for loops including non-tightly nested loops.
- the parallel loop vector is used for loops that cannot be parallelized but can be vectorized.
- tightly nested loop is a nested loop, for example, when two loops that increment i and j are nested, processing using i and j is performed in the lower loop, and it is not performed in the upper loop.
- Non-Patent Document 2 a simple loop is targeted, and a loop statement that causes an error in kernels such as a non-tightly nested loop and a loop that cannot be parallelized is excluded, so the scope of application is narrow.
- kernels are used for single and tightly nested loops
- parallel loops are used for non-tightly nested loops.
- use a parallel loop vector for loops that cannot be parallelized but can be vectorized.
- the parallel directive there is a concern that the reliability will be lower than when the result is kernels.
- a sample test is performed on the final offload program, the result difference with the CPU is checked, the result is shown to the user, and the user confirms it.
- the hardware since the hardware is different between the CPU and GPU, there are differences in the number of significant digits and rounding error, and it is necessary to check the result difference with the CPU even with kernels alone.
- the offload server 1 is realized by, for example, a computer 900 having a configuration as shown in FIG.
- FIG. 12 is a hardware configuration diagram showing an example of a computer 900 that realizes the function of the offload server 1.
- the computer 900 has a CPU 910, a RAM 920, a ROM 930, an HDD 940, a communication interface (I / F: Interface) 950, an input / output interface (I / F) 960, and a media interface (I / F) 970.
- the CPU 910 operates based on the program stored in the ROM 930 or the HDD 940, and controls each part.
- the ROM 930 stores a boot program executed by the CPU 910 when the computer 900 is started, a program that depends on the hardware of the computer 900, and the like.
- the HDD 940 stores a program executed by the CPU 910, data used by such a program, and the like.
- the communication interface 950 receives data from another device via the communication network 80 and sends it to the CPU 910, and transmits the data generated by the CPU 910 to the other device via the communication network 80.
- the CPU 910 controls an output device such as a display or a printer and an input device such as a keyboard or a mouse via the input / output interface 960.
- the CPU 910 acquires data from the input device via the input / output interface 960. Further, the CPU 910 outputs the generated data to the output device via the input / output interface 960.
- the media interface 970 reads the program or data stored in the recording medium 980 and provides the program or data to the CPU 910 via the RAM 920.
- the CPU 910 loads the program from the recording medium 980 onto the RAM 920 via the media interface 970, and executes the loaded program.
- the recording medium 980 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phasechange rewritable Disk), a magneto-optical recording medium such as an MO (Magneto Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory. ..
- the CPU 910 of the computer 900 realizes the functions of each part of the offload server 1 by executing the program loaded on the RAM 920. Further, the HDD 940 stores the data in each part of the offload server 1. The CPU 910 of the computer 900 reads and executes these programs from the recording medium 980, but as another example, these programs may be acquired from another device via the communication network 80.
- the offload server 1 performs GPU processing to a loop statement based on the application code analysis unit 112 that analyzes the source code of the application and the result of the code analysis of OpenACC. , The kernels directive, the parallel loop directive, and the parallel loop vector directive.
- the parallel processing specification unit 114 that compiles by specifying the parallel processing specification statement in GPU, and the loop statement that causes a compile error are excluded from the offload target and the loop statement that does not cause a compile error.
- the parallel processing pattern creation unit 115 that creates a parallel processing pattern that specifies whether to perform parallel processing and the application of the parallel processing pattern are compiled, placed in the accelerator verification device 14, and offloaded to the GPU.
- the parallel processing pattern with the highest processing performance is selected from a plurality of parallel processing patterns, and the parallel processing pattern with the highest processing performance is compiled. It is provided with an execution file creation unit 117 for creating an execution file.
- the OpenACC parallel directive is targeted. Then, the offload target range can be expanded by instructing the non-tightly nested loop, which causes an error in the kernels directive, with the parallel directive. This can be applied to more applications.
- the kernels directive is used for the single loop and the tightlynested loop.
- OpenACC standard kernels directive can be applied to the single loop and tightly nested loop.
- the parallel loop directive is used for non-tightly nested loops.
- the offload target range can be expanded from Non-Patent Documents 1 and 2, and the applicable applications can be increased. can.
- the parallel loop vector directive is used for loops that cannot be parallelized but can be vectorized.
- the offload target range can be expanded from Non-Patent Documents 1 and 2 by expanding the parallel loop vector to a loop that cannot be parallelized but can be vectorized, and the number of applicable applications is increased. be able to.
- the parallel processing designation unit 114 sets the number of loop statements that do not cause a compilation error as the gene length based on the genetic algorithm, and the parallel processing pattern creation unit 115 sets 1 or 0 when performing GPU processing.
- the parallel processing pattern creation unit 115 sets 1 or 0 when performing GPU processing.
- Part 116 compiles the application code that specifies the parallel processing specification statement in the GPU according to each individual, places it in the accelerator verification device, executes the performance measurement process in the accelerator verification device, and creates an execution file.
- Part 117 performs performance measurement on each individual, evaluates the individual with a shorter processing time so that the degree of conformity is higher, and selects an individual having a degree of conformity higher than a predetermined value as an individual with high performance. Then, the selected individuals are subjected to crossover and mutation processing to create next-generation individuals, and after the processing for the specified number of generations is completed, the highest-performance parallel processing pattern is selected as the solution.
- the present invention is an offload program for making a computer function as the above offload server.
- each function of the above-mentioned offload server 1 can be realized by using a general computer.
- each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically dispersed / physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured.
- each of the above configurations, functions, processing units, processing means, etc. may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software for the processor to interpret and execute a program that realizes each function. Information such as programs, tables, and files that realize each function can be stored in memory, hard disks, recording devices such as SSDs (Solid State Drives), IC (Integrated Circuit) cards, SD (Secure Digital) cards, optical disks, etc. It can be held on a recording medium.
- SSDs Solid State Drives
- IC Integrated Circuit
- SD Secure Digital
- a genetic algorithm (GA) method is used in order to be able to find a solution to the combinatorial optimization problem within a limited optimization period.
- GA genetic algorithm
- an OpenACC compiler for C / C ++ is used, but any one that can offload GPU processing may be used.
- Java lambda® GPU processing IBM Java 9 SDK® may be used.
- the parallel processing specification statement depends on these development environments. For example, in Java (registered trademark), parallel processing description in lambda format is possible from Java 8. IBM® provides a JIT compiler that offloads parallel processing descriptions in lambda format to the GPU. In Java, the same offload is possible by using these and tuning whether or not the loop processing is in lambda format with GA.
- the for statement is illustrated as a repeating statement (loop statement), but a while statement and a do-while statement other than the for statement are also included.
- loop statement loop statement
- the for statement that specifies the loop continuation condition etc. is more suitable.
- Offload server 11 Control unit 12 Input / output unit 13 Storage unit 14 Verification machine (accelerator verification device) 15 OpenIoT resource 111 Application code specification unit 112 Application code analysis unit 113 Data transfer specification unit 114 Parallel processing specification unit 114a Offload range extraction unit 114b Intermediate language file output unit 115 Parallel processing pattern creation unit 116 Performance measurement unit 116a Binary file placement unit 117 Executable file creation unit 118 Production environment placement unit 119 Performance measurement test extraction execution unit 120 User provision unit 130 Application code 131 Test case DB 132 Intermediate language file 151 Various devices 152 Device with CPU-GPU 153 Device with CPU-FPGA 154 Device with CPU
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Devices For Executing Special Programs (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/004201 WO2021156954A1 (ja) | 2020-02-04 | 2020-02-04 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
| JP2021575142A JP7363930B2 (ja) | 2020-02-04 | 2020-02-04 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
| US17/797,190 US12033235B2 (en) | 2020-02-04 | 2020-02-04 | Offload server, offload control method, and offload program |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/004201 WO2021156954A1 (ja) | 2020-02-04 | 2020-02-04 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021156954A1 true WO2021156954A1 (ja) | 2021-08-12 |
Family
ID=77200425
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/004201 Ceased WO2021156954A1 (ja) | 2020-02-04 | 2020-02-04 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12033235B2 (https=) |
| JP (1) | JP7363930B2 (https=) |
| WO (1) | WO2021156954A1 (https=) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021156954A1 (ja) * | 2020-02-04 | 2021-08-12 | 日本電信電話株式会社 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2019216127A1 (ja) * | 2018-05-09 | 2019-11-14 | 日本電信電話株式会社 | オフロードサーバおよびオフロードプログラム |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10372497B1 (en) | 2017-09-05 | 2019-08-06 | Parallels International Gmbh | Offloading GPU computations for computers and virtual machines |
| US11593138B2 (en) | 2019-05-20 | 2023-02-28 | Microsoft Technology Licensing, Llc | Server offload card with SoC and FPGA |
| US11204711B2 (en) | 2019-10-31 | 2021-12-21 | EMC IP Holding Company LLC | Method and system for optimizing a host computing device power down through offload capabilities |
| WO2021156954A1 (ja) * | 2020-02-04 | 2021-08-12 | 日本電信電話株式会社 | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
-
2020
- 2020-02-04 WO PCT/JP2020/004201 patent/WO2021156954A1/ja not_active Ceased
- 2020-02-04 JP JP2021575142A patent/JP7363930B2/ja active Active
- 2020-02-04 US US17/797,190 patent/US12033235B2/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2019216127A1 (ja) * | 2018-05-09 | 2019-11-14 | 日本電信電話株式会社 | オフロードサーバおよびオフロードプログラム |
Non-Patent Citations (2)
| Title |
|---|
| "#pragma acc kernels loop Versus #pragma acc parallel loop", INVIDIA, 28 May 2015 (2015-05-28), pages 1 - 4, XP055844731, Retrieved from the Internet <URL:https://www.pgroup.com/userforum/viewtopic.php?t=4759> [retrieved on 20200310] * |
| YAMATO YOJI ET AL.: "Parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications", IEICE TECHNICAL REPORT, vol. 118, no. KBSE2018-37, 10 November 2018 (2018-11-10), pages 53 - 58 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JP7363930B2 (ja) | 2023-10-18 |
| JPWO2021156954A1 (https=) | 2021-08-12 |
| US12033235B2 (en) | 2024-07-09 |
| US20230066594A1 (en) | 2023-03-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7063289B2 (ja) | オフロードサーバのソフトウェア最適配置方法およびプログラム | |
| JP6927424B2 (ja) | オフロードサーバおよびオフロードプログラム | |
| JP7632712B2 (ja) | オフロードサーバ、および、オフロード制御方法 | |
| JP6992911B2 (ja) | オフロードサーバおよびオフロードプログラム | |
| JP7363931B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| US12050894B2 (en) | Offload server, offload control method, and offload program | |
| JP7363930B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| JP7184180B2 (ja) | オフロードサーバおよびオフロードプログラム | |
| JP7521597B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| JP7662037B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| WO2022102071A1 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| JP7716632B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| JP7806893B2 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム | |
| WO2024147197A1 (ja) | オフロードサーバ、オフロード制御方法およびオフロードプログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20917356 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021575142 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20917356 Country of ref document: EP Kind code of ref document: A1 |