WO2023144926A1

WO2023144926A1 - Offload server, offload control method, and offload program

Info

Publication number: WO2023144926A1
Application number: PCT/JP2022/002880
Authority: WO
Inventors: 庸次山登
Original assignee: 日本電信電話株式会社
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2023-08-03
Also published as: JPWO2023144926A1

Abstract

This offload server (1) comprises: a performance measurement unit (118) for compiling an application program of a parallel processing pattern, positioning the application program in a device for accelerator verification, and executing a process for performance measurement when offloading to an accelerator has occurred; and a positioning setting unit (170) that, when positioning a converted application program in one of a cloud server, a carrier edge server, and a user edge server on a network in accordance with a condition pertaining to a cost or a response time as designated by a user, calculates and sets the location for positioning of the application program on the basis of a linear planning formula in which device and link costs, a calculation resource upper limit, and a bandwidth upper limit are used as constraint conditions and the cost or response time of the calculation resource is used as an objective function.

Description

Offload server, offload control method and offload program

The present invention automatically offloads functional processing to an accelerator such as a GPU (Graphics Processing Unit) or FPGA (Field Programmable Gate Array), and places the converted application program (hereinafter referred to as an application as appropriate) in an appropriate location. The present invention relates to a server, an offload control method, and an offload program.

The use of heterogeneous computational resources other than the CPU (Central Processing Unit) is increasing. For example, it has begun to perform image processing on servers with enhanced GPUs (accelerators), and to accelerate signal processing with FPGAs (accelerators). An FPGA is a programmable gate array whose configuration can be set by a designer or the like after manufacturing, and is a type of PLD (Programmable Logic Device). Amazon Web Services (AWS) (registered trademark) provides GPU instances and FPGA instances, and these resources can be used on demand. Microsoft (registered trademark) uses FPGAs to streamline searches.

It is expected that various applications will be created using service linking technology, etc., but by making use of more advanced hardware, the performance of operating applications can be expected to be improved. However, in order to do so, programming and settings that match the hardware to be operated are required. For example, many technical knowledge such as CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language) is required, and the hurdles are high. OpenCL is an open API (Application Programming Interface) that can handle all computational resources (not limited to CPUs and GPUs) in a unified manner without being tied to specific hardware.

The following are required so that GPUs and FPGAs can be easily used in user applications. That is, when deploying general-purpose applications such as image processing and cryptographic processing to be operated in an environment, it is desired that the platform analyzes the application logic and automatically offloads the processing to the GPU and FPGA.

The development environment CUDA for GPGPUs (General Purpose GPUs), which uses the computing power of GPUs for purposes other than image processing, is being developed. CUDA is a development environment for GPGPUs. OpenCL has also emerged as a standard for handling heterogeneous hardware such as GPUs, FPGAs, and many-core CPUs in a unified manner.

With CUDA and OpenCL, programming is done by extending the C language. However, it is necessary to describe memory copy, release, etc. between a device such as a GPU and a CPU, which makes the description highly difficult. Actually, there are not many engineers who can master CUDA and OpenCL.

In order to perform GPGPU easily, there is a technique in which a portion to be processed in parallel, such as a loop statement, is specified on a directive basis, and a compiler converts it into device-oriented code according to the directive. Technical specifications include OpenACC (Open Accelerator) and the like, and compilers include PGI Compiler (registered trademark) and the like. For example, in an example using OpenACC, the user specifies parallel processing in code written in C/C++/Fortran using OpenACC directives. The PGI compiler checks the parallelism of the code, generates executable binaries for GPU and CPU, and converts them into executable modules. The IBM JDK (registered trademark) supports a function of offloading parallel processing specification according to the lambda format of Java (registered trademark) to the GPU. By using these techniques, the programmer does not need to be aware of data allocation to the GPU memory.
In this way, techniques such as OpenCL, CUDA, and OpenACC enable offload processing to GPUs and FPGAs.

However, even if offload processing itself can be performed, there are many problems with appropriate offloading. For example, there is a compiler with an automatic parallelization function, such as the Intel compiler (registered trademark). When performing automatic parallelization, parallel processing parts such as for statements (repeated statements) in the program are extracted. However, when GPUs are used to operate in parallel, performance is often poor due to overhead due to data exchange between the CPU and GPU memory. When speeding up by using a GPU, it is necessary for a skilled person to search for an appropriate parallel processing part by tuning with OpenCL or CUDA, or with a PGI compiler or the like.
For this reason, it is difficult for users without skills to improve the performance of applications using GPUs. takes a lot of time.

Regarding placement, there is research to optimize the embedding position of VN (Virtual Network) for a group of servers on the network as an optimal use of network resources (see Non-Patent Document 1). In Non-Patent Literature 1, the optimum arrangement of VNs is determined in consideration of communication traffic. However, it is intended for single-resource virtual networks, with the goal of reducing carrier equipment costs and overall response time, taking into account conditions such as the processing time of different applications and the cost and response time requirements of individual users. It has not been.

Non-Patent Document 2 can be cited as an approach to automate the trial-and-error process for parallel processing.
Non-Patent Literature 2 describes how once written code can use GPUs, FPGAs, many-core CPUs, etc. that exist in the deployment destination environment, automatic conversion, resource settings, etc. are performed, and applications are operated at high performance. We propose an environment adaptive software for the purpose of In addition, Non-Patent Document 2 proposes a system for automatically offloading loop statements of application code to the GPU as an element of environment-adaptive software, and evaluates performance improvement.

Non-Patent Document 3 proposes a system for automatically offloading loop statements of application code to FPGA as an element of environment adaptive software, and evaluates performance improvement.
Non-Patent Document 4 evaluates a method of optimizing the amount of resources (such as the number of virtual machine cores) for executing an application after automatic conversion for a GPU or the like as an element of environment-adaptive software.

Non-Patent Documents 1 to 4 focus on shortening the processing time during automatic offloading.
There is a problem that when processing is offloaded to heterogeneous devices such as GPUs and FPGAs, there is no proposal to operate the converted application while satisfying user demands (price, response time).

It is with this in mind that the present invention has been made, which, when automatically converted for placement on an offload device such as a GPU or FPGA, optimizes the converted application to meet the user's cost or response time requirements. The task is to place

In order to solve the above-described problems, an offload server for offloading specific processing of an application program to an accelerator, comprising: an application code analysis unit for analyzing the source code of the application program; A data transfer specification section that analyzes the reference relationships of variables used and, for data that may be transferred outside the loop, specifies data transfer using an explicit specification line that explicitly specifies data transfer outside the loop and a parallel processing designation unit that identifies loop statements of the application program, compiles each identified loop statement by designating a parallel processing designation statement in the accelerator, and a loop statement that causes a compilation error. , a parallel processing pattern creation unit for creating a parallel processing pattern for specifying whether or not to perform parallel processing for a loop statement that is not to be offloaded and that does not generate a compilation error; and the application of the parallel processing pattern. A performance measurement unit that compiles a program, places it in an accelerator verification device, and executes performance measurement processing when offloaded to the accelerator; Depending on the conditions, when deploying on a cloud server, carrier edge server, or user edge server on the network, device and link costs, computational resource limits, and bandwidth limits are constraints, and computational resource costs or and a placement setting unit that calculates and sets a placement location of an application program based on a linear programming formula with response time as an objective function.

According to the present invention, when an application is automatically converted so that it can be placed on an offload device such as a GPU or FPGA, the converted application can be optimally placed to meet the user's cost or response time requirements.

1 is a functional block diagram showing a configuration example of an offload server according to the first embodiment of the present invention; FIG. FIG. 4 is a diagram showing automatic offload processing using the offload server according to the first embodiment; FIG. 4 is a diagram showing a search image of a control unit (automatic offload function unit) by Simple GA of the offload server according to the first embodiment; FIG. 10 is a diagram showing an example of a normal CPU program of a comparative example; FIG. 10 is a diagram showing an example of a loop statement when data is transferred from a CPU to a GPU using a simple CPU program of a comparative example; FIG. 10 is a diagram showing an example of a loop statement when data is transferred from the CPU to the GPU when the offload server according to the first embodiment is nested and integrated; FIG. 10 is a diagram showing an example of a loop statement when data is transferred from the CPU to the GPU when the transfer integration of the offload server according to the first embodiment is performed; FIG. 10 is a diagram showing an example of a loop statement when data is transferred from the CPU to the GPU when the offload server according to the first embodiment transfers data together and a temporary area is used; 4 is a flow chart for explaining an overview of the operation of implementing the offload server according to the first embodiment; 4 is a flow chart for explaining an overview of the operation of implementing the offload server according to the first embodiment; 7 is a flow chart illustrating setting of the resource ratio and amount of resources to be added after the offload server's GPU offload attempt and placement of a new application according to the first embodiment; FIG. 2 is a diagram illustrating an example topology of computation nodes of the offload server according to the first embodiment; 8 is a graph showing changes in the average response time of the offload server according to the first embodiment and the number of deployed applications; FIG. 8 is a functional block diagram showing a configuration example of an offload server according to the second embodiment of the present invention; FIG. 11 is a flow chart for explaining an operation outline of implementation of the offload server according to the second embodiment; FIG. FIG. 11 is a flow chart showing performance measurement processing of the performance measurement unit of the offload server according to the second embodiment; FIG. FIG. 12 is a diagram showing a search image of the PLD processing pattern creation unit of the offload server according to the second embodiment; FIG. 12 is a diagram illustrating the flow from the C code of the offload server to the search for the OpenCL final solution according to the second embodiment; 3 is a hardware configuration diagram showing an example of a computer that implements the functions of an offload server according to each embodiment of the present invention; FIG.

Hereinafter, an offload server in a mode for carrying out the present invention (hereinafter referred to as "this embodiment") will be described with reference to the drawings.
(Basic concept of automatic offloading of the present invention)
In order to embody the concept of environment-adaptive software, the present inventor has so far proposed methods for GPU automatic offloading of program loop statements, FPGA automatic offloading, and optimization of conversion application execution resources ( See

Non-Patent Documents

2, 3, and 4). The basic concept of the present invention will be described based on the examination of the elemental technologies of these

Non-Patent Documents

2, 3, and 4 as well.

<Appropriate resource ratio between CPU and offload device>
First, optimization of the resource ratio between the CPU and the offload device after program conversion for offloading to the device is completed will be described.
A method such as Non-Patent Document 2 can automatically offload a normal program to an offload device such as a GPU or FPGA.
Currently, multi-core CPUs and many-core CPUs can flexibly allocate a percentage of all cores through virtualization using virtual machines and containers. In recent years, GPUs have been virtualized in the same way as CPUs, and operations such as allocating a percentage of all cores of GPUs are becoming possible. Regarding FPGA, resource usage is often represented by a set number of Look Up Tables and Flip Flops, and unused gates can be used for other purposes.

In this way, it is possible to operate using a portion of all resources for the CPU, GPU, and FPGA, and optimizing the resources of the CPU and offload devices according to the application is important for improving cost performance.

In addition, it is possible to convert the application into code for CPU and GPU processing using the method described in Non-Patent Document 2. However, even if the code itself is good, it will not perform well if the amount of CPU and GPU resources is not properly balanced. For example, when performing a certain process, if the CPU processing time is 1000 seconds and the GPU processing time is 1 second, even if the processing that can be offloaded is speeded up by the GPU to some extent, the CPU as a whole becomes a bottleneck. there is

Furthermore, Non-Patent Document 5, "K. Shirahata, H. Sato and S. Matsuoka, "Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters," IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp.733 -740, Dec. 2010.”, by distributing Map tasks so that the CPU and GPU execution times are the same when processing tasks with the MapReduce (registered trademark) framework using the CPU and GPU. , to improve overall performance.

The inventor came up with the idea of determining the resource ratio between the CPU and the offload device as follows. In other words, in order to avoid the processing in any device becoming a bottleneck, refer to the above non-patent documents, etc., so that the processing time of the CPU and the offload device are of the same order from the processing time of the test case. , the resource ratio between the CPU and the offload device (hereinafter referred to as "resource ratio").

In addition, the inventor adopts a method of gradually increasing the speed based on the performance measurement results in the verification environment during automatic offloading, like the method of Non-Patent Document 2. The reason is that the performance varies greatly depending on not only the code structure but also the actual processing details such as hardware specifications, data size, loop count, and the like. Also, performance is difficult to predict statically and requires dynamic measurements. Therefore, at the time of code conversion, since there are already performance measurement results in the verification environment, resource ratios are determined using those results.

When measuring performance, specify a test case and measure it. For example, if the test case processing time in the verification environment is CPU processing: 10 seconds and GPU processing: 5 seconds, the resource on the CPU side is doubled, and the processing time is considered to be about the same. Therefore, the resource ratio is 2:1. In addition, for a user request to speed up a particular process by offloading, prepare a test case that includes that process and speed up the test case using the method described in Non-Patent Document 2 or the like. is reflected.

<Determining and automatically verifying the amount of CPU and offload device resources>
Next, determination and automatic verification of the amount of resources of the CPU and the offload device (hereinafter referred to as "resource amount") will be described.
When the resource ratio is determined by <Optimizing the resource ratio between the CPU and the offload device>, the application is placed in the commercial environment.
When deploying in a commercial environment, the resource amount is determined while keeping the resource ratio as much as possible so as to satisfy the cost request specified by the user. For example, with respect to CPU, it is assumed that 1 VM is 1000 yen/month, GPU is 4000 yen/month, and the resource ratio is 2:1. It is also assumed that the user's budget is within 10,000 yen per month. In this case, even if the resource ratio is set to 2:1, the budget is within the user's budget of 10,000 yen per month. "1" is secured and placed in the commercial environment. Also, if the user's budget is within 5000 yen per month, the appropriate resource ratio of 2:1 cannot be maintained. In this case, as the resource amount, "1" is secured for the CPU and "1" for the GPU.

After securing resources and deploying the program in the commercial environment, automatic verification is performed to confirm that it will work before the user uses it. Automatic verification runs performance verification test cases and regression test cases. A performance verification test case is performed by using an automatic test execution tool such as Jenkins (registered trademark) for a hypothetical test case specified by the user, and the processing time, throughput, and the like are measured. A regression test case obtains information about software such as middleware and OS installed in the system, and executes a regression test corresponding to the information using Jenkins or the like. A study to perform these automatic verifications with a small number of test case preparations is in Non-Patent Document 6 (Y. Yamato, “Automatic verification technology of software patches for user virtual environments on IaaS cloud,” Journal of Cloud Computing, Springer, 2015 , 4:4, DOI: 10.1186/s13677-015-0028-6, Feb. 2015.), etc., and the technology of this non-patent document 6 is used.

In the performance verification test cases, check whether the calculation results are correct even if offloading is performed. Also, in the performance verification test case, the difference between the calculation result and the case without offloading is checked. For example, the PGI compiler that processes the GPU uses the PCAST (registered trademark) function PGI_compare (registered trademark) and acc_compare (registered trademark) API (Application Programming Interface) to calculate the difference between calculation results when the GPU is used and when it is not used. I can confirm.
Note that even if parallel processing or the like is correctly offloaded, there are cases where the calculation results do not match completely, such as when the GPU and CPU have different rounding errors. Therefore, for example, confirmation according to the IEEE 754 specification is performed, and whether the difference is acceptable is presented to the user for confirmation by the user.

As a result of automatic verification, the user is presented with information on the processing time and throughput of performance verification test cases, differences in calculation results, and regression test execution results. The user is also presented with the secured resources (the number of VMs, specifications, etc.) and their prices, and the user refers to this information to determine the start of operation.

<resource, resource ratio, test case processing time>
The resource, resource ratio, and test case processing time in this embodiment will be described.
- Resources CPUs, GPUs, FPGAs, etc. are now being provided as instances of virtual resources.
As resources, the number of CPU cores, clock, memory amount, disk size, GPU core number, clock, memory amount, FPGA gate size (LE (registered trademark) for Intel (registered trademark), Xilinx (registered trademark) In the case of , the unit is LC (registered trademark)). Cloud service providers package them and provide them in the form of small-sized virtual machines or GPU instances. In the case of virtualization, it can be said that the number of instances to be used is the amount of resources to be used.

・Resource ratio The resource ratio is the ratio of the number of instances of CPU, GPU, and FPGA. If the number of instances is 1, 2, or 3, the resource ratio is 1:2:3.

- Test Case Processing Time This embodiment searches for and discovers an offload pattern that speeds up a test case specified by the user. The test case is the number of transaction processing such as TPC-C (registered trademark) in the case of DB (database), and execution of Fourier transform processing in sample data in the case of FFT. The processing time is the execution time when the sample processing is executed. For example, the processing time of process A is 10 seconds before offloading, but it becomes 2 seconds after offloading. Each time is obtained.

<Loop statement found>
Currently, it is difficult for a compiler to find a match that this loop statement is suitable for GPU parallel processing. It is difficult to predict how much performance and power consumption will be achieved by offloading to the GPU without actually measuring it. Therefore, an instruction to offload this loop statement to the GPU is manually performed, and trial and error measurements are performed.
The present invention automatically finds appropriate loop statements to offload to the GPU using a genetic algorithm (GA), which is an evolutionary computation technique. That is, for a group of parallelizable loop statements, 1 is set for GPU execution and 0 is set for CPU execution to generate a gene, and an appropriate pattern is searched for by repeated measurement in a verification environment.

(First embodiment)
Next, the offload server 1 and the like in the mode for carrying out the present invention (hereinafter referred to as "this embodiment") will be described.

[GPU automatic offload of loop statement]
FIG. 1 is a functional block diagram showing a configuration example of an offload server 1 according to the first embodiment of the present invention.
The offload server 1 is a device that automatically offloads specific processing of an application to an accelerator.
As shown in FIG. 1, the offload server 1 includes a control unit 11, an input/output unit 12, a storage unit 13, and a verification machine 14 (accelerator verification device). be.

The input/output unit 12 includes a communication interface for transmitting/receiving information to/from each device, etc., an input device for transmitting/receiving information to/from an input device such as a touch panel or a keyboard, or an output device such as a monitor. It consists of an output interface.

The storage unit 13 is composed of a hard disk, flash memory, RAM (Random Access Memory), etc., and stores a program (offload program) for executing each function of the control unit 11 and information necessary for processing of the control unit 11 ( For example, an intermediate language file (Intermediate file) 133) is temporarily stored.

The storage unit 13 includes a test case DB (Test case database) 131, an equipment resource DB 132, and an intermediate language file (Intermediate file) 133.

The test case DB 131 stores test item data corresponding to the software to be verified. The test item data is, for example, transaction test data such as TPC-C in the case of a database system such as MySQL.

The facility resource DB 132 holds resources such as servers held by the business operator, information prepared in advance such as prices, and information on how much they are used. For example, there are 10 servers that can accommodate 3 GPU instances, and 1 GPU instance costs 5000 yen per month. It is information that is only used. This information is used to determine the amount of resources to be secured when the user specifies operating conditions (conditions such as cost and performance). The user operating conditions are the cost conditions specified by the user at the time of the offload request (for example, budget within 10,000 yen per month) and performance conditions (for example, transaction throughput such as TPC-C is above or above, sample Fourier transform processing is 1 thread) within how many seconds, etc.).

The intermediate language file 133 temporarily stores information necessary for the processing of the control unit 11 in the form of a programming language interposed between the high-level language and the machine language.

The verification machine 14 is equipped with a CPU, GPU, and FPGA as a verification environment for environment-adaptive software.

The control unit 11 is an automatic offloading function that controls the offload server 1 as a whole. The control unit 11 is implemented, for example, by a CPU (Central Processing Unit) (not shown) expanding an application program (offload program) stored in the storage unit 13 into a RAM and executing the application program.

The control unit 11 includes an application code specifying unit (Specify application code) 111, an application code analyzing unit (Analyze application code) 112, a data transfer specifying unit 113, a parallel processing specifying unit 114, a resource ratio determining unit 115, A resource amount setting unit 116, a placement setting unit 170, a parallel processing pattern creation unit 117, a performance measurement unit 118, an execution file creation unit 119, a production environment placement unit (Deploy final binary files to production environment) 120, It has a performance measurement test extract execution unit (Extract performance test cases and run automatically) 121 and a user provision unit (Provide price and performance to a user to judge) 122 .

<Application code designation unit 111>
The application code designation unit 111 designates an input application code. Specifically, the application code designation unit 111 passes the application code described in the received file to the application code analysis unit 112 .

<Application code analysis unit 112>
The application code analysis unit 112 analyzes the source code of the processing function and grasps structures such as loop statements and FFT library calls.

<Data transfer designation unit 113>
The data transfer specification unit 113 analyzes the reference relationships of variables used in loop statements of the application program, and explicitly specifies data transfer outside the loop for data that may be transferred outside the loop. using target specification lines (#pragma acc kernels, #pragma acc data copyin(a, b), #pragma acc data copyout(a, b), #prama acc parallel loop, #prama acc parallel loop vector, etc.) Specify data transfer.

<Parallel processing designation unit 114>
The parallel processing designation unit 114 specifies loop statements (repetition statements) of the application program, and compiles each loop statement by designating a parallel processing designation statement in the accelerator.
The parallel processing designation unit 114 includes an extract offload able area 114a and an output intermediate file 114b.

The offload range extraction unit 114a identifies processing that can be offloaded to the GPU/FPGA, such as loop statements and FFT, and extracts an intermediate language corresponding to the offload processing.
The intermediate language file output unit 114b outputs the extracted intermediate language file 133. FIG. Intermediate language extraction is not a one-time process, but iterates to try and optimize executions for suitable offload region searches.

<Resource ratio determination unit 115>
The resource ratio determination unit 115 determines the processing time of the CPU and the offload device (test case CPU processing time and offload device processing time) as the resource ratio based on the performance measurement result (described later). Specifically, the resource ratio determination unit 115 determines the resource ratio so that the processing times of the CPU and the offload device are of the same order. Further, when the difference between the processing times of the CPU and the offload device is equal to or greater than a predetermined threshold value, the resource ratio determination unit 115 sets the resource ratio to a predetermined upper limit value.

<Resource amount setting unit 116>
Based on the determined resource ratio, the resource amount setting unit 116 sets the resource amount of the CPU and the offload device so as to satisfy a predetermined cost condition (described later). Specifically, the resource amount setting unit 116 maintains the determined resource ratio and sets the maximum resource amount that satisfies a predetermined cost condition. In addition, if the predetermined cost condition is not satisfied by setting the minimum resource amount while maintaining the determined resource ratio, the resource amount setting unit 116 breaks the resource ratio and sets the resource amounts of the CPU and the offload device to satisfy the cost condition. Set with a smaller value (e.g. minimum).

<Placement setting unit 170>
The placement setting unit 170, when placing the converted application on any of the cloud server, the carrier edge server, and the user edge server on the network, according to the cost or response time conditions specified by the user, determines the devices and links. , the computational resource upper limit, and the bandwidth upper limit as constraints, and the cost of the computational resource or the response time as the objective function, the application location is calculated and set. Specifically, the placement setting unit 170 calculates the placement location of the new application (the placement location of the APL) by a linear programming method based on the server of the equipment resource DB 132, the specification information of the link, and the placement information of the existing application. to set. The linear programming method uses, for example, the objective function and constraint conditions of the linear programming formulas shown in [Equation 1] and [Equation 2] below. The linear programming formulas shown in [Equation 1] and [Equation 2] below are stored in the equipment resource DB 132, and are read out from the equipment resource DB 132 by the layout setting unit 170, and expanded on the memory processed by the layout setting unit 170. .

<Parallel processing pattern creation unit 117>
The parallel processing pattern creation unit 117 excludes loop statements (repeated statements) that cause compilation errors from being offloaded, and designates whether or not to execute parallel processing for repetitive statements that do not cause compilation errors. Create a parallel processing pattern to do.

<Performance measurement unit 118>
The performance measurement unit 118 compiles the parallel processing pattern application program, places it in the verification machine 14, and executes performance measurement processing when offloaded to the accelerator.
The performance measurement unit 118 includes a Deploy binary files unit 118a. The binary file placement unit 118a deploys (places) an execution file derived from the intermediate language to the verification machine 14 having a GPU or FPGA.

The performance measurement unit 118 executes the arranged binary file, measures the performance when offloading, and returns the performance measurement result to the offload range extraction unit 114a. In this case, the offload range extraction unit 114a extracts another parallel processing pattern, and the intermediate language file output unit 114b attempts performance measurement based on the extracted intermediate language (marked a in FIG. 2 described later). reference).

<Executable File Creation Unit 119>
The execution file creation unit 119 selects a plurality of parallel processing patterns with high processing performance from among a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, crosses the parallel processing patterns with high processing performance, and suddenly Mutation processing creates different parallel processing patterns. Then, the executable file creation unit 119 newly performs performance measurement, and after performing the performance measurement for the specified number of times, selects the parallel processing pattern with the highest processing performance from among a plurality of parallel processing patterns based on the performance measurement result, and selects the parallel processing pattern with the highest processing performance. Compile the parallel processing pattern of processing performance and create an executable file.

<Production environment placement unit 120>
The production environment placement unit 120 places the created executable file in the production environment for the user (“place final binary file in production environment”). The production environment placement unit 120 determines a pattern specifying the final offload area, and deploys it in the production environment for users.

<Performance measurement test extraction execution unit 121>
After arranging the execution files, the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and executes the performance test (“arrangement of the final binary file to the production environment”).
After arranging the executable file, the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and automatically executes the extracted performance test in order to show the performance to the user.

<User provision unit 122>
The user providing unit 122 presents information such as price/performance to the user based on the performance test results (“Provision of information such as price/performance to the user”). The test case DB 131 stores performance test items. The user provision unit 122 presents data such as price and performance to the user along with the performance test results based on the performance test results corresponding to the test items stored in the test case DB 131 . Based on the presented information such as price and performance, the user decides to start using the service for a fee. Here, non-patent document 7 (Y. Yamato, M. Muroi, K. Tanaka and M. Uchimura, “Development of Template Management Technology for Easy Deployment of Virtual Resources on OpenStack,” Journal of Cloud Computing, Springer, 2014, 3:7, DOI: 10.1206/s13677-014-0007-3, 12 pages, June 2014.). 6 technique may be used.

[Application of genetic algorithm]
The offload server 1 can use GA (Genetic Algorithms) for offload optimization. The configuration of the offload server 1 when using GA is as follows.
That is, the parallel processing specifying unit 114 sets the gene length to the number of loop statements (repeated statements) that do not cause compilation errors based on the genetic algorithm. The parallel processing pattern creation unit 117 maps the availability of accelerator processing to the gene pattern by assigning either 1 or 0 when accelerator processing is to be performed, and the other 0 or 1 when not performing accelerator processing.

The parallel processing pattern creation unit 117 prepares a gene pattern for a specified number of individuals in which each value of the gene is randomly created to be 1 or 0. The performance measurement unit 118 compiles the application code specifying the parallel processing specifying statement in the accelerator according to each individual, and places it in the verification machine 14 . The performance measurement unit 118 executes performance measurement processing in the verification machine 14 .

Here, if a gene with the same parallel processing pattern as before occurs in an intermediate generation, the performance measurement unit 118 does not compile and measure the performance of the application code corresponding to the parallel processing pattern. Use the same value.
In addition, the performance measurement unit 118 sets the performance measurement value to a predetermined time (long time) as a time-out for an application code that causes a compile error and an application code whose performance measurement does not end within a predetermined time.

The executable file creation unit 119 performs performance measurement on all individuals, and evaluates individuals with shorter processing times so that the degree of fitness is higher. The execution file creation unit 119 selects individuals whose fitness is higher than a predetermined value (for example, the top n% of all individuals, or the top m of all individuals, where n and m are natural numbers) as individuals with high performance. , the selected individual is crossover and mutated to create the next generation individual. The execution file creating unit 119 selects the parallel processing pattern with the highest performance as a solution after the specified number of generations have been processed.

The automatic offload operation of the offload server 1 configured as described above will be described below.
[Auto Offload Operation]
FIG. 2 is a diagram showing automatic offload processing using the offload server 1. As shown in FIG.
As shown in FIG. 2, the offload server 1 is applied to elemental technology of environment adaptive software. The offload server 1 has a control unit (automatic offload function unit) 11 , a test case DB 131 , an equipment resource DB 132 , an intermediate language file 133 and a verification machine 14 .
The offload server 1 acquires an application code 125 used by the user.

The user is, for example, a person who has made a contract to use various devices (Device 151, device 152 having CPU-GPU, device 153 having CPU-FPGA, device 154 having CPU).
The offload server 1 automatically offloads functional processing to the accelerators of the device 152 with CPU-GPU and the device 153 with CPU-FPGA.

The operation of each part will be described below with reference to the step numbers in FIG.
<Step S11: Specify application code>
In step S<b>11 , the application code designation unit 111 (see FIG. 1 ) passes the application code written in the received file to the application code analysis unit 112 .

<Step S12: Analyze application code>
In step S12, the application code analysis unit 112 (see FIG. 1) analyzes the source code of the processing function and grasps structures such as loop statements and FFT library calls.

<Step S13: Extract offloadable area>
In step S13, the parallel processing designation unit 114 (see FIG. 1) identifies loop statements (repetition statements) of the application, and compiles each repetition statement by designating a parallel processing designation statement in the accelerator. Specifically, the offload range extraction unit 114a (see FIG. 1) identifies processing that can be offloaded to the GPU/FPGA, such as loop statements and FFT, and extracts an intermediate language corresponding to the offload processing.

<Step S14: Output intermediate file>
In step S14, the intermediate language file output unit 114b (see FIG. 1) outputs the intermediate language file 133. FIG. Intermediate language extraction is not a one-time process, but iterates to try and optimize executions for suitable offload region searches.

<Step S15: Compile error>
In step S15, the parallel processing pattern creation unit 117 (see FIG. 1) excludes loop statements that cause compilation errors from being offloaded, and repeat statements that do not cause compilation errors are processed in parallel. Create a parallel processing pattern that specifies whether or not.

<Step S21: Deploy binary files>
In step S21, the binary file placement unit 118a (see FIG. 1) deploys the execution file derived from the intermediate language to the verification machine 14 equipped with GPU/FPGA.

<Step S22: Measure performance>
In step S22, the performance measurement unit 118 (see FIG. 1) executes the placed file and measures the performance when offloading.
In order to make the area to be offloaded more appropriate, this performance measurement result is returned to the offload range extraction unit 114a, and the offload range extraction unit 114a extracts another pattern. Then, the intermediate language file output unit 114b attempts performance measurement based on the extracted intermediate language (see symbol a in FIG. 2).

As indicated by symbol a in FIG. 2, the control unit 11 repeatedly executes steps S12 to S22. The automatic offload function of the control unit 11 is summarized below. That is, the parallel processing specification unit 114 specifies loop statements (repetition statements) of the application program, specifies a parallel processing specification statement in the GPU for each repetition statement, and compiles it. Then, the parallel processing pattern creation unit 117 creates a parallel processing pattern that excludes loop statements that cause compilation errors from being offloaded, and specifies whether or not to perform parallel processing for loop statements that do not cause compilation errors. do. Then, the binary file placement unit 118a compiles the application program of the parallel processing pattern and places it on the verification machine 14, and the performance measurement unit 118 executes the performance measurement processing on the verification machine 14. FIG. The execution file creation unit 119 selects a pattern with the highest processing performance from a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, compiles the selected patterns, and creates an execution file.

<Step S23: Resource amount setting according to user operating conditions>
In step S23, the control unit 11 performs resource amount setting according to user operating conditions. That is, the resource ratio determination unit 115 of the control unit 11 determines the resource ratio between the CPU and the offload device. Based on the determined resource ratio, the resource amount setting unit 116 then refers to the information in the facility resource DB 132 and sets the resource amounts of the CPU and the offload device so as to satisfy the user operating conditions (see FIG. 10). described later).

<Step S24: Deploy final binary files to production environment>
In step S<b>24 , the production environment placement unit 120 determines a pattern specifying the final offload area, and deploys it to the production environment for the user.

<Step S25: Extract performance test cases and run automatically>
In step S25, the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and automatically executes the extracted performance test in order to show the performance to the user after the execution file is arranged.

<Step S26: Provide price and performance to a user to judge>
In step S26, the user providing unit 122 presents information such as price and performance to the user based on the performance test results. Based on the presented information such as price and performance, the user decides to start using the service for a fee.

The above steps S11 to S26 are performed, for example, in the background when the user uses the service, and are assumed to be performed, for example, during the first day of provisional use.

As described above, the control unit (automatic offload function unit) 11 of the offload server 1, when applied to the element technology of the environment-adaptive software, for offloading the function processing, the source code of the application program used by the user , the offloaded area is extracted and the intermediate language is output (steps S11 to S15). The control unit 11 arranges and executes the execution file derived from the intermediate language on the verification machine 14, and verifies the offload effect (steps S21 to S22). After repeating verification and determining an appropriate offload area, the control unit 11 deploys the executable file in the production environment that is actually provided to the user and provides it as a service (steps S23 to S26).

[GPU automatic offload using GA]
GPU automatic offloading is a process for repeating steps S12 to S22 in FIG. 2 for the GPU and finally obtaining the offload code to be deployed in step S23.

GPUs generally do not guarantee latency, but they are devices suitable for increasing throughput through parallel processing. Encryption processing, image processing for camera video analysis, and machine learning processing for analyzing a large amount of sensor data are typical examples, and these are often repetitive processes. Therefore, we aim to increase the speed by automatically offloading repeated statements of the application to the GPU.

However, as described in the prior art, appropriate parallel processing is required for speeding up. In particular, when a GPU is used, performance often cannot be obtained unless the data size and the number of loops are large due to memory transfer between the CPU and the GPU. Also, depending on the timing of memory data transfer, etc., the combination of individual loop statements (repetition statements) that can be speeded up in parallel may not be the fastest. For example, if there are 10 for statements (repeated statements), and if the 1st, 5th, and 10th can be faster than the CPU, the combination of the 1st, 5th, and 10th will be the fastest. and so on.

In order to specify the appropriate parallel area, there is an attempt to use the PGI compiler to optimize the parallelization of for statements through trial and error. However, trial and error requires a lot of operations, and when the service is provided, there is a problem that it delays the user's start of use and increases the cost.

Therefore, in this embodiment, an appropriate offload area is automatically extracted from a general-purpose program that is not intended for parallelization. For this reason, the parallelizable for statement is checked first, and then the performance verification trial is repeated in the verification environment using the GA for the parallelizable for statement group to search for an appropriate area. After narrowing down to parallelizable for statements, by retaining and recombining parallel processing patterns that can be accelerated in the form of genes, patterns that can be efficiently accelerated from a huge number of possible parallel processing patterns can be explored.

[Search image of control unit (automatic offload function unit) 11 by Simple GA]
FIG. 3 is a diagram showing a search image of the control unit (automatic offload function unit) 11 by Simple GA. FIG. 3 shows a search image of processing and gene sequence mapping of the for statement.
GA is one of combinatorial optimization methods that imitate the evolutionary process of organisms. The flow chart of GA consists of initialization→evaluation→selection→crossover→mutation→end determination.
In the present embodiment, Simple GA with simplified processing is used among GAs. Simple GA is a simplified GA in which only genes are 1 and 0, and roulette selection, one-point crossover, and mutation reverse the value of one gene.

<initialization>
In the initialization, after checking whether all the for statements in the application code can be parallelized, the for statements that can be parallelized are mapped to the gene array. It is set to 1 when GPU processing is performed, and set to 0 when GPU processing is not performed. A gene prepares a specified number of individuals M, and randomly assigns 1 and 0 to one for statement.
Specifically, the control unit (automatic offload function unit) 11 (see FIG. 1) acquires an application code 130 (see FIG. 2) used by the user and, as shown in FIG. From the code patterns 141 of the code 130, the parallel propriety of the for statement is checked. As shown in FIG. 3, when five for statements are found from code pattern 141 (see symbol b in FIG. 3), one digit for each for statement, here five digits for five for statements, or 0 is randomly assigned. For example, it is set to 0 when processed by the CPU, and set to 1 when output to the GPU. However, 1 or 0 is randomly assigned at this stage.
The code corresponding to the gene length is 5 digits, and the 5-digit gene length code is 2 ⁵ =32 patterns, eg, 10001, 10010, . In FIG. 3, circle marks (○ marks) in the code pattern 141 are shown as code images.

<evaluation>
In the evaluation, deployment (arrangement) and performance measurement (Deploy & performance measurement) are performed (see symbol c in FIG. 3). That is, the performance measurement unit 118 (see FIG. 1) compiles the code corresponding to the gene, deploys it to the verification machine 14, and executes it. The performance measurement unit 118 performs benchmark performance measurement. The goodness of fit of the gene of the pattern with good performance (parallel processing pattern) is increased.

<select>
In the selection, high performance code patterns are selected based on goodness of fit (see symbol d in FIG. 3). Based on the fitness, the performance measurement unit 118 (see FIG. 1) selects genes with high fitness in a specified number of individuals. In this embodiment, roulette selection according to goodness of fit and elite selection of genes with the highest goodness of fit are performed.
FIG. 3 shows a search image in which the number of circles (o) in the selected code patterns 142 is reduced to three.

<Crossover>
In crossover, at a constant crossover rate Pc, some genes are exchanged between selected individuals at one point to create offspring individuals.
Roulette-selected patterns (parallel processing patterns) and genes of other patterns are crossed. The position of the one-point crossover is arbitrary. For example, crossover is performed at the third digit of the five-digit code.

<mutation>
Mutation changes each value of an individual's gene from 0 to 1 or 1 to 0 at a constant mutation rate Pm.
Also, in order to avoid local minima, mutations are introduced. It should be noted that a mode in which no mutation is performed is also possible in order to reduce the amount of calculation.

<end judgment>
As shown in FIG. 3, generate next generation code patterns after crossover & mutation (see symbol e in FIG. 3).
In the termination determination, the processing is terminated after repeating T times for the designated number of generations, and the gene with the highest degree of fitness is taken as the solution.
For example, take performance measurements and choose the fastest three: 10010, 01001, 00101. The next generation recombines these three by GA, for example, crosses the first and second, and creates a new pattern (parallel processing pattern) 11011 . At this time, a mutation such as changing 0 to 1 is arbitrarily inserted into the recombined pattern. Repeat the above to find the fastest pattern. A designated generation (for example, the 20th generation) is determined, and the pattern remaining in the final generation is taken as the final solution.

<deploy (deployment)>
Deploy again to the production environment with the parallel processing pattern with the highest processing performance that corresponds to the gene with the highest fitness and provide it to users.

<Supplementary explanation>
A case where there are a considerable number of for statements (loop statements; repetition statements) that cannot be offloaded to the GPU will be described. For example, even if there are 200 for statements, only about 30 can be offloaded to the GPU. Here, GA is performed on these 30 items by excluding those that cause errors.

　OpenACC has a compiler that can be specified with the directive #pragma acc kernels to extract bytecodes for GPUs and execute them for GPU offloading. By writing a for statement command in this #pragma, it is possible to determine whether or not the for statement runs on the GPU.

For example, when using C/C++, analyze the C/C++ code and find the for statement. When a for statement is found, OpenACC writes to the for statement using #pragma acc kernels, #prama acc parallel loop, and #prama acc parallel loop vector, which are parallel processing grammars. In detail, put the for statement into #pragma acc kernels, #prama acc parallel loop and #prama acc parallel loop vector one by one and compile. If an error occurs, the for statement cannot be processed by GPU in the first place. so exclude.

In this way, we find the remaining for statements. Then, the length (gene length) is defined as the length without error. If there are 5 error-free for statements, the gene length is 5, and if there are 10 error-free for statements, the gene length is 10. Parallel processing is not possible when there is a dependence on data such that the previous processing is used for the next processing.
The above is the preparation stage. Next, GA processing is performed.

A code pattern with a gene length corresponding to the number of for statements is obtained. At the beginning, parallel processing patterns 10010, 01001, 00101, . . . are randomly assigned. Perform GA processing and compile. At that time, an error may occur even though it is a for statement that can be offloaded. That is when the for statement is hierarchical (if one is specified, the GPU can process it). In this case, you can leave the for statement that caused the error. Specifically, there is a method of increasing the processing time and causing timeout.

It is deployed on the verification machine 14 and benchmarked, for example, in the case of image processing, the image processing is benchmarked. The shorter the processing time, the higher the adaptability is evaluated. For example, the -1/2 power of the processing time is 1 if it takes 1 second, 0.1 if it takes 100 seconds, and 10 if it takes 0.01 seconds.
Those with high adaptability are selected, for example, 3 to 5 out of 10 are selected and rearranged to create a new code pattern. At this time, the same thing as before may be created in the middle of creation. In that case, we don't need to do the same benchmark, so we use the same data as before. In this embodiment, the code pattern and its processing time are stored in the storage unit 13 .
The search image of the control unit (automatic offload function unit) 11 by Simple GA has been described above. Next, a batch processing technique for data transfer will be described.

[Batch processing method for data transfer]
<Basic concept>
In order to reduce CPU-GPU transfers, in addition to transferring nested loop variables as high as possible, the present invention unifies the transfer timing of many variables and reduces transfers automatically transferred by the compiler.
In order to reduce the number of transfers, not only the nest units but also the variables for which the transfer timings to the GPU can be grouped are collectively transferred. For example, if the GPU processing result is not processed by the CPU and processed again by the GPU, the variables defined by the CPU that are used in multiple loop statements are collectively transferred to the GPU before the GPU processing starts. It is also possible to send the data and return it to the CPU after all GPU processing is completed.

In order to understand the reference relationship between loops and variables during code analysis, for variables defined in multiple files, GPU processing and CPU processing are not nested, and CPU processing and GPU processing are separated. Use the data copy statement of OpenACC to specify the data to be converted and transferred.
Variables that are collectively transferred before the start of GPU processing and that do not need to be transferred at the timing of loop statement processing use data present to clearly indicate that they do not need to be transferred.
When transferring data between the CPU and GPU, a temporary area is created (#pragma acc declare create), data is stored in the temporary area, and then the temporary area is synchronized (#pragma acc update) to instruct the transfer.

<Comparative example>
First, a comparative example will be described.
Comparative examples are a normal CPU program (see FIG. 4), simple GPU use (see FIG. 5), and nest integration (Non-Patent Document 2) (see FIG. 6). Note that <1> to <4>, etc. at the beginning of loop statements in the following descriptions and figures are added for convenience of explanation (the same applies to other figures and their explanations).
The loop statement of the normal CPU program shown in FIG. 4 is written on the CPU program side,
<1> Loop [for(i=0; i<10; i++)] {
}
in the
<2> Loop [for(j=0; j<20; j++] {
There is Symbol f in FIG. 4 is the setting of variables a and b in the <2> loop.
again,
<3> Loop [for(k=0; k<30; k++)] {
}
and,
<4> Loop [for(l=0; l<40; l++)] {
}
and continues. Symbol g in FIG. 4 is the setting of variables c and d in the <3> loop, and symbol h in FIG. 4 is the setting of variables e and f in the <4> loop.
The normal CPU program shown in FIG. 4 is executed by the CPU (not using the GPU).

FIG. 5 is a diagram showing a loop statement when the normal CPU program shown in FIG. 4 uses a simple GPU to transfer data from the CPU to the GPU. Data transfer types include data transfer from the CPU to the GPU and data transfer from the GPU to the CPU. Data transfer from the CPU to the GPU will be taken as an example below.
The simple GPU utilization loop statement shown in FIG. 5 is described on the CPU program side,
<1> Loop [for(i=0; i<10; i++)] {
}
in the
<2> Loop [for(j=0; j<20; j++] {
There is
Furthermore, as indicated by symbol i in FIG. 5, <1> loop [for(i=0; i<10; i++)] {
} above, a processing unit capable of parallel processing such as a for statement by the PGI compiler is specified by the OpenACC directive #pragma acc kernels (parallel processing specifying statement).
Data is transferred from the CPU to the GPU by #pragma acc kernels, as shown in the dashed box surrounding the symbol i in FIG. Here, since a and b are transferred at this timing, they are transferred 10 times.

Also, as indicated by symbol j in FIG. 5, <3> loop [for(k=0; k<30; k++)] {
}, the parallel processing part such as the for statement by the PGI compiler is specified by the directive #pragma acc kernels of OpenACC.
As shown in the dashed frame surrounding the symbol j in FIG. 5, c and d are transferred at this timing by #pragma acc kernels.

where <4> loop [for(l=0; l<40; l++)] {
Do not specify #pragma acc kernels above }. This loop is not GPU-processed because it is inefficient even if GPU-processed.

FIG. 6 is a diagram showing a loop statement when data is transferred from the CPU to the GPU and from the GPU to the CPU by nest integration (Non-Patent Document 2).
In the loop statement shown in FIG. 6, a data transfer instruction line from the CPU to the GPU, here #pragma acc data copyin(a, b) of the copyin clause of variables a and b, is inserted at the position indicated by symbol k in FIG. do. In this specification, parentheses ( ) are attached to copyin(a,b) for notational reasons. Copyout(a, b) and datacopyin(a, b, c, d) described later also use the same notation method.
The above #pragma acc data copyin(a, b) is the top-level loop that does not include the setting and definition of variable a (here, the <1> loop [for(i=0; i<10; i++)] {
})).
Since a and b are transferred at the timing shown in the frame enclosed by the dashed line including the symbol k in FIG. 6, one transfer occurs.

In the loop statement shown in FIG. 6, a data transfer instruction line from the GPU to the CPU is placed at the position indicated by symbol l in FIG. insert
The above #pragma acc data copyout(a, b) is <1> loop [for(i=0; i<10; i++)] {
It is specified at the bottom of }.

In this way, in data transfer from the CPU to the GPU, data transfer is explicitly instructed by inserting #pragma acc data copyin(a, b) in the copyin clause of variable a at the above-mentioned position. As a result, data can be transferred collectively in a loop as high as possible, and it is possible to avoid inefficient transfer in which data is transferred in each loop, as in the simple GPU-using loop statement shown in FIG.

<Embodiment>
Next, this embodiment will be described.
《Variables that do not need to be transferred are specified using data present》
In this embodiment, for variables defined in multiple files, GPU processing and CPU processing are not nested, and for variables for which CPU processing and GPU processing are separated, OpenACC's data copy statement is used to specify that they are collectively transferred. specified using At the same time, variables that are collectively transferred and that do not need to be transferred at that timing are specified using data present.

FIG. 7 is a diagram showing a loop statement by transfer integration at the time of data transfer between the CPU and GPU of this embodiment. FIG. 7 corresponds to the nest integration in FIG. 6 of the comparative example.
In the loop statement shown in FIG. 7, a data transfer instruction line from the CPU to the GPU is placed at the position indicated by symbol m in FIG. , c, d).
The above #pragma acc data copyin(a, b, c, d) is the top-level loop that does not include the setting and definition of variable a (here, the <1> loop [for(i=0; i<10; i++ )] {
})).

In this way, for variables defined in multiple files, the GPU processing and the CPU processing are not nested, and for the variables for which the CPU processing and the GPU processing are separated, specify the data copy statement #pragma of OpenACC to collectively transfer the variables. Specify using acc data copyin(a, b, c, d).
Since a, b, c, and d are transferred at the timing indicated by the dashed-dotted frame surrounding the symbol m in FIG. 7, one transfer occurs.

Then, the variables that are collectively transferred using the above #pragma acc data copyin(a, b, c, d) and that do not need to be transferred at that timing are indicated by the two-dot chain line frame surrounding the code n in FIG. It is specified using the data present statement #pragma acc data present (a, b) that clearly indicates that the GPU already has a variable at the timing.

Variables that are collectively transferred using the above #pragma acc data copyin(a, b, c, d) and that do not need to be transferred at that timing are indicated by the two-dot chain frame surrounding the symbol o in FIG. A data present statement #pragma acc data present (c, d) is used to specify that the GPU already has a variable.
At the timing when the loops <1> and <3> are processed by the GPU and the GPU processing is completed, the data transfer instruction line from the GPU to the CPU, here #pragma acc datacopyout( a, b, c, d) are inserted at position p where <3> loop of FIG. 7 ends.

By specifying batch transfer, variables that can be transferred in batches are transferred in a batch, and variables that have already been transferred and do not need to be transferred are specified using data present, thereby reducing transfers and further improving the efficiency of offloading methods. can be achieved. However, depending on the compiler, even if OpenACC is instructed to transfer, the compiler may automatically determine and transfer. The automatic transfer by the compiler is a phenomenon in which the transfer between the CPU and the GPU is originally unnecessary but is automatically transferred depending on the compiler, unlike the instructions of OpenACC.

<<Temporary storage of data>>
FIG. 8 is a diagram showing a loop statement by transfer integration at the time of data transfer between the CPU and GPU of this embodiment. FIG. 8 corresponds to nested collation and transfer-free variable explicitness of FIG.
In the loop statement shown in FIG. 8, a declare create statement #pragma acc declare create of OpenACC for creating a temporary area during CPU-GPU data transfer is specified at the position indicated by symbol q in FIG. As a result, a temporary area is created (#pragma acc declare create) when data is transferred between the CPU and GPU, and the data is stored in the temporary area.

Also, at the position indicated by symbol r in FIG. 8, the OpenACC declare create statement #pragma acc update for synchronizing the temporary area is specified to instruct the transfer.

In this way, unnecessary CPU-GPU transfers are blocked by creating a temporary area, initializing parameters in the temporary area, and using it for CPU-GPU transfers. The OpenACC instructions can reduce transfers that unintentionally degrade performance.

[GPU offload processing]
The batch processing technique for data transfer described above makes it possible to extract loop statements suitable for offloading and avoid inefficient data transfer.
However, there are programs that are not suitable for GPU offload even if the batch processing method of data transfer is used. Effective GPU offloading requires a large number of loops in the processing to be offloaded.

Therefore, in this embodiment, the number of loops is investigated using a profiling tool as a preliminary step to searching for full-scale offload processing. Using a profiling tool makes it possible to investigate the number of times each line is executed. Therefore, for example, programs with loops of 50 million times or more can be sorted in advance, such as targeting offload processing searches. A specific description will be given below (partially overlaps with the content described in FIG. 2).

In this embodiment, first, the application code analysis unit 112 (FIG. 1) analyzes the application and grasps loop statements such as for, do, and while. Next, execute the sample processing, use the profiling tool to investigate the number of loops in each loop statement, and determine whether or not to perform full-scale search based on whether there is a loop that exceeds a certain value. conduct.

　When it is decided to carry out full-scale search, the process of GA is entered (see Figure 2). In the initialization step, after checking whether or not all loop statements of the application code can be parallelized, the loop statements that can be parallelized are mapped to the gene array as 1 if GPU processing is to be performed, and as 0 if not. A specified number of individuals are prepared for the gene, and 1 and 0 are randomly assigned to each value of the gene.

Here, in the code corresponding to the gene, an explicit instruction for data transfer (#pragma acc data copyin/copyout/copy) is added from the variable data reference relationship within the loop statement specified to be processed by the GPU.

In the evaluation step, the code corresponding to the gene is compiled, deployed and executed on the verification machine, and benchmark performance is measured. Then, the goodness of fit of a gene with a good performance pattern is increased. As described above, the code corresponding to the gene includes a parallel processing instruction line (for example, reference symbol f in FIG. 4) and a data transfer instruction line (for example, reference symbol h in FIG. 4, reference symbol i in FIG. 5, and ) is inserted.

In the selection step, genes with high fitness are selected for the specified number of individuals based on the fitness. In this embodiment, roulette selection according to goodness of fit and elite selection of genes with the highest goodness of fit are performed. In the crossover step, at a constant crossover rate Pc, some genes are exchanged between the selected individuals at one point to create offspring individuals. In the mutation step, each value of an individual's gene is changed from 0 to 1 or 1 to 0 at a constant mutation rate Pm.

When the mutation step is completed and the specified number of genes for the next generation is created, an explicit instruction for data transfer is added, and the evaluation, selection, crossover, and mutation steps are repeated in the same way as the initialization step.

Finally, in the termination determination step, the process is terminated after repeating the specified number of generations, and the gene with the highest fitness is taken as the solution. Re-deploy to the production environment with the highest performing code pattern that corresponds to the best-fitting gene and provide it to the user.

The implementation of the offload server 1 will be described below. This implementation is for confirming the effectiveness of this embodiment.
[implementation]
An implementation of automatic offloading of C/C++ applications using a general-purpose PGI compiler is described.
Since the purpose of this implementation is to confirm the validity of automatic GPU offloading, the target application is a C/C++ language application, and the GPU processing itself is explained using a conventional PGI compiler.

The C/C++ language boasts top popularity in the development of OSS (Open Source Software) and proprietary software, and many applications are being developed in the C/C++ language. In order to check the offloading of applications used by general users, OSS general-purpose applications such as encryption processing and image processing are used.

GPU processing is performed by the PGI compiler. The PGI compiler is a C/C++/Fortran compiler that understands OpenACC. In this embodiment, a parallel-capable processing unit such as a for statement is specified by an OpenACC directive #pragma acc kernels (parallel processing specifying statement). This enables GPU offloading by extracting bytecodes for GPUs and executing them. In addition, an error is generated when the data in the for statement is dependent on each other and cannot be processed in parallel, or when multiple layers of nested for statements are specified. In addition, directives such as #pragma acc data copyin/copyout/copy can be used to explicitly instruct data transfer.

By inserting #pragma acc data copyout(a[...]) in OpenACC's copyin clause at the above-mentioned position according to the specification in the above #pragma acc kernels (parallel processing specification statement), explicit data transfer give instructions.

<Overview of implementation behavior>
Describe the operation overview of the implementation.
The implementation performs the following processing.
Before starting the processing of the flow shown in FIGS. 9A and 9B below, prepare a C/C++ application to be accelerated and a benchmark tool for measuring its performance.

In the implementation, when there is a request to use a C/C++ application, the code of the C/C++ application is first analyzed to find for statements, and to understand the program structure such as variable data used in the for statements. . LLVM/Clang syntax analysis library is used for syntax analysis.

In the implementation, first, in order to get a sense of whether the application has a GPU offload effect, a benchmark is run and the number of loops of the for statement ascertained in the above parsing is ascertained. GNU coverage gcov etc. is used to grasp the number of loops. "GNU Profiler (gprof)" and "GNU Coverage (gcov)" are known as profiling tools. Either can be used because both can examine the execution count of each line. The number of executions can, for example, target only applications with loop counts of 10 million or more, but this value can be changed.

General-purpose CPU applications are not implemented with parallelization in mind. Therefore, first of all, it is necessary to eliminate for statements that cannot be processed by the GPU. Therefore, #pragma acc kernels, #prama acc parallel loop, and #prama acc parallel loop vector directives for GPU processing are tried to be inserted for each for statement, and it is determined whether an error occurs during compilation. There are several types of compilation errors. If an external routine is called in a for statement, if a different hierarchy is specified repeatedly in a nested for statement, or if there is a process to exit the for statement with a break, etc., the data in the for statement is subject to data dependency. There are cases where there is Depending on the application, there are various types of compile-time errors, and there are other cases, but compile errors are excluded from processing and #pragma directives are not inserted.

It is difficult to automatically deal with compile errors, and there are many cases where dealing with them is ineffective. In the case of external routine calls, #pragma acc routine can sometimes be avoided, but many external calls are libraries, and even if they are included in GPU processing, the call becomes a bottleneck and performance is not good. Since each for statement is tried one by one, no compile error occurs regarding nesting errors. In addition, when exiting in the middle due to a break or the like, it is necessary to fix the number of loops for parallel processing, and program modification is required. Parallel processing itself cannot be performed in the first place when there is data dependence.

Here, if the number of loop statements in which no error occurs even if parallel processing is a, then a is the gene length. 1 of the gene corresponds to presence of parallel processing directive, 0 corresponds to no parallel processing directive, and the application code is mapped to the gene of length a.

Next, as an initial value, prepare gene sequences for the specified number of individuals. Each value of the gene is created by randomly assigning 0 and 1 as described in FIG. Insert directives \#pragma acc kernels，\#pragma acc parallel loop，\#pragma acc parallel loop vector into the C/C++ code to specify GPU processing when the gene value is 1 according to the prepared gene sequence do. The reason why single loops and the like are not parallel is that if the same processing is performed, kernels has better performance as a PGI compiler. At this stage, the part of the code corresponding to a certain gene that is to be processed by the GPU is determined.

The C/C++ code with parallel processing and data transfer directives inserted is compiled with the PGI compiler on a machine equipped with a GPU. Deploy compiled executables and measure performance and power usage with benchmarking tools.

After measuring the benchmark performance for all populations, set the fitness of each gene sequence according to the benchmark processing time and power consumption. Individuals to be left are selected according to the set degree of fitness. The selected individuals are subjected to GA processing such as crossover processing, mutation processing, and copy processing as they are to create a population of the next generation.

Directive insertion, compilation, performance measurement, fitness setting, selection, crossover, and mutation processing are performed on the next-generation individuals. Here, in the GA processing, if the gene with the same pattern as before occurs, the individual is not compiled and the performance measurement is not performed, and the same measured value as before is used.

After completing GA processing for the specified number of generations, the solution is the C/C++ code with directives that corresponds to the gene sequence with the highest performance.

Among these, the number of individuals, the number of generations, the crossover rate, the mutation rate, the fitness setting, and the selection method are parameters of the GA and are specified separately. By automating the above processing, the proposed technology makes it possible to automate GPU offloading, which conventionally required the time and skills of a specialized engineer.

FIGS. 9A-B are flow charts outlining the operation of the implementation described above, and FIGS. 9A and 9B are connected by a connector.
The following processing is performed using the OpenACC compiler for C/C++.

<Code Analysis>
In step S101, the application code analysis unit 112 (see FIG. 1) performs code analysis of the C/C++ application.

<Specify loop statement>
In step S102, the parallel processing designation unit 114 (see FIG. 1) identifies loop statements and reference relationships of the C/C++ application.

<Possibility of parallel processing of loop statements>
In step S103, the parallel processing designation unit 114 checks the GPU processability of each loop statement (#pragma acc kernels).

<Repeat loop statement>
The control unit (automatic offload function unit) 11 repeats the processing of steps S105 to S116 by the number of loop statements between the loop start end of step S104 and the loop end of step S117.

<Repetition of number of loops (Part 1)>
The control unit (automatic offload function unit) 11 repeats the processing of steps S106 and S107 by the number of loop statements between the loop start end of step S105 and the loop end end of step S108.
In step S106, the parallel processing designation unit 114 compiles each loop statement by designating GPU processing (#pragma acc kernels) with OpenACC.
In step S107, the parallel processing designation unit 114 checks the GPU processing possibility with the following directive (#pragma acc parallel loop) when an error occurs.

<Repetition of number of loops (Part 2)>
The control unit (automatic offload function unit) 11 repeats the processing of steps S110 to S111 by the number of loop statements between the loop start point of step S109 and the loop end point of step S112.
In step S110, the parallel processing designation unit 114 compiles each loop statement by designating GPU processing (#pragma acc parallel loop) with OpenACC.
In step S111, the parallel processing designation unit 114 checks the GPU processability with the following directive (#pragma acc parallel loop vector) when an error occurs.

<Repetition of number of loops (Part 3)>
The control unit (automatic offload function unit) 11 repeats the processing of steps S114 to S115 by the number of loop statements between the loop start point of step S113 and the loop end point of step S116.
In step S114, the parallel processing designation unit 114 compiles each loop statement by designating GPU processing (#pragma acc parallel loop vector) with OpenACC.
In step S115, the parallel processing specifying unit 114 removes the GPU processing directive phrase from the loop statement when an error occurs.

<count the number of for statements>
In step S118, the parallel processing designating unit 114 counts the number of loop statements (here, for statements) in which no compilation error occurs, and sets the number as the gene length.

<Specified population pattern preparation>
Next, as an initial value, the parallel processing designation unit 114 prepares gene sequences for the designated number of individuals. Here, 0 and 1 are randomly assigned and created.
In step S119, the parallel processing designating unit 114 maps the C/C++ application code to genes and prepares a designated population pattern.
Depending on the prepared gene sequence, a directive specifying parallel processing is inserted into the C/C++ code when the value of the gene is 1 (see, for example, the #pragma directive in FIG. 3).

The control unit (automatic offload function unit) 11 repeats the processing of steps S121 to S130 for a specified number of generations between the loop start end of step S120 and the loop end of step S131 in FIG. 9B.
Further, in the repetition of the designated number of generations, the processing of steps S122 to S125 is repeated for the designated number of individuals between the loop start end of step S121 and the loop end of step S126. That is, repetitions of the specified number of individuals are processed in a nested state within the repetition of the specified number of generations.

<Data transfer specification>
In step S122, the data transfer designation unit 113 transfers data using explicit instruction lines (#pragma acc data copy/copyin/copyout/present and #pragma acc declarecreate, #pragma acc update) based on the variable reference relationship. Specify transfer.

<compile>
In step S123, the parallel processing pattern creating unit 117 (see FIG. 1) compiles the C/C++ code specified by the directive according to the gene pattern using the PGI compiler. That is, the parallel processing pattern creation unit 117 compiles the created C/C++ code with the PGI compiler on the verification machine 14 having a GPU.
Here, a compilation error may occur when multiple nested for statements are specified in parallel. This case is handled in the same way as when the processing time times out during performance measurement.

In step S124, the performance measurement unit 118 (see FIG. 1) deploys the execution file to the verification machine 14 equipped with the CPU-GPU.
In step S125, the performance measurement unit 118 executes the arranged binary file and measures the benchmark performance when offloading.

Here, in the middle generation, genes with the same pattern as before are not measured, and the same values are used. In other words, when the same pattern of genes as before occurs during GA processing, the same measured values as before are used without compiling or performance measurement for that individual.

At step S127, the performance measurement unit 118 (see FIG. 1) measures the processing time.

At step S128, the performance measurement unit 118 sets an evaluation value based on the measured processing time.

In step S129, the execution file creation unit 119 (see FIG. 1) evaluates individuals with shorter processing times so that their fitness levels are higher, and selects individuals with higher performance. The execution file creating unit 119 selects a pattern of short time and low power consumption as a solution from among the plurality of measured patterns.

In step S130, the execution file creation unit 119 performs crossover and mutation processing on the selected individuals to create next-generation individuals. The executable file creation unit 119 performs compilation, performance measurement, fitness setting, selection, crossover, and mutation processing for the next-generation individuals.
That is, after benchmark performance is measured for all individuals, the degree of fitness of each gene sequence is set according to the benchmark processing time. Individuals to be left are selected according to the set degree of fitness. The execution file creation unit 119 performs GA processing such as crossover processing, mutation processing, and copy processing as it is on the selected individuals to create a group of individuals for the next generation.

In step S132, the executable file creation unit 119 takes the C/C++ code corresponding to the highest performance gene sequence (highest performance parallel processing pattern) as a solution after the GA processing for the designated number of generations is completed.

<GA parameters>
The number of individuals, number of generations, crossover rate, mutation rate, fitness setting, and selection method are parameters of the GA. GA parameters may be set as follows, for example.
Parameters and conditions of Simple GA to be executed can be set as follows, for example.
Gene length: Number of loop statements that can be parallelized Number of individuals M: Less than gene length Number of generations T: Less than gene length Goodness of fit: (Processing time) ^(-1/2)

With this setting, the shorter the benchmark processing time, the higher the compatibility. In addition, by setting the degree of fitness to include the (-1/2) power of the processing time, it is possible to prevent the search range from narrowing due to the degree of fitness of a specific individual whose processing time is short becoming too high. can. If the performance measurement does not end within a certain period of time, it is timed out, and the suitability is calculated assuming that the processing time is 1000 seconds (long time). This timeout period may be changed according to performance measurement characteristics.
Selection: Roulette selection However, we also perform elite preservation in which the gene with the highest fitness in the generation is preserved in the next generation without crossover or mutation.
Crossover rate Pc: 0.9
Mutation rate Pm: 0.05

<Cost performance>
The cost performance of the automatic offload function is described.
Looking only at the hardware price of GPU boards, such as NVIDIA Tesla, the price of a machine with a GPU is about double that of a normal CPU-only machine. However, in general, in the cost of data centers, hardware and system development costs are less than 1/3, operating costs such as electricity costs and maintenance and operation systems are more than 1/3, and other costs such as service orders. is about 1/3. In this embodiment, it is possible to double or more increase the performance of processing that takes a long time in applications such as encryption processing and image processing. Therefore, even if the price of the server hardware itself doubles, the cost effect can be fully expected.

In this embodiment, gcov, gprof, etc. are used to identify in advance an application that has many loops and takes a long time to execute, and offloading is attempted. This allows you to find applications that can be efficiently accelerated.

<Time to start using the actual service>
Describe the time until the start of use of the actual service.
Assuming that it takes about 3 minutes from compilation to performance measurement, it takes about 20 hours at maximum with GA of 20 individuals and 20 generations. Finish in 8 hours or less. The reality is that it takes about half a day to start using many cloud, hosting, and network services. In this embodiment, for example, automatic offloading within half a day is possible. For this reason, as long as the automatic offload is within half a day, if trial use is possible at first, it can be expected that user satisfaction will be sufficiently increased.

In order to search for the offload part in a shorter time, it is conceivable to measure the performance in parallel for the number of individuals using multiple verification machines. Adjusting the timeout time according to the application also leads to shortening. For example, if the offload processing takes twice as long as the execution time in the CPU, it is timed out. Also, the larger the number of individuals and the number of generations, the higher the possibility of discovering a high-performance solution. However, when maximizing each parameter, it is necessary to compile and perform performance benchmarks for the number of individuals times the number of generations. Therefore, it takes time to start using the actual service. In this embodiment, the GA is performed with a small number of individuals and a small number of generations, but by setting the crossover rate Pc to a high value of 0.9 and searching a wide range, a solution with a certain level of performance can be found quickly. ing.

[Expansion of directives]
In this embodiment, the directives are expanded in order to increase the number of applicable applications. Specifically, as directives specifying GPU processing, in addition to kernels directives, parallel loop directives and parallel loop vector directives are expanded.
In the OpenACC standard, kernels are used for single loops and tightly nested loops. Also, parallel loops are used for loops including non-tightly nested loops. parallel loop vector is used for loops that cannot be parallelized but can be vectorized. Here, a tightly nested loop is a nested loop, for example, when two loops that increment i and j are nested, the lower loop uses i and j, and the upper loop does not A simple loop like Also, in the implementation of the PGI compiler, etc., there is a difference in that the compiler makes decisions about parallelization for kernels, and the programmer makes decisions about parallelization for parallels.

Therefore, in this embodiment, kernels are used for single and tightly nested loops, and parallel loops are used for non-tightly nested loops. Also, use parallel loop vector for loops that cannot be parallelized but can be vectorized.
Here, there is a concern that using the parallel directive may reduce the reliability of the results compared to kernels. However, it is assumed that the final offload program will be subjected to a sample test, the difference between the result and the CPU will be checked, and the result will be shown to the user for confirmation by the user. In the first place, since the CPU and GPU have different hardware, there are differences in the number of significant digits and rounding errors, and it is necessary to check the result difference between the kernels and the CPU.

[Resource ratio and resource amount settings, and new application allocation flow chart]
FIG. 10 is a flow chart illustrating setting the resource ratio and amount of resources added after a GPU offload attempt and placing a new application. The flow chart shown in FIG. 10 is executed after the GPU offload attempts shown in FIGS. 9A-B.

In step S51, the resource ratio determination unit 115 acquires user operating conditions, test case CPU processing time, and offload device processing time. The user operating conditions are specified by the user when the user specifies the code to be offloaded. The user operating conditions are used when the resource amount setting unit 116 refers to the information in the equipment resource DB 132 and determines the resource amount.

<Processing of Resource Ratio Determining Unit 115>
In step S52, the resource ratio determination unit 115 determines the ratio of the CPU and offload device processing times (test case CPU processing time and offload device processing time) as the resource ratio based on the performance measurement result.

With this automatic offload, the performance measurement results in the verification environment have already been obtained at the time of code conversion. Using this performance measurement result, the resource ratio determination unit 115 determines the resource ratio between the CPU and the offload device. Specifically, an appropriate resource ratio is determined with respect to the ratio of the processing time of the CPU and the offload device in the verification environment. For example, if the test case processing time in the verification environment is CPU processing: 10 seconds and GPU processing: 5 seconds, the resource ratio is CPU:GPU=2:1.

The resource ratio determination unit 115 determines the resource ratio so that the processing times of the CPU and the offload device are of the same order. By determining the resource ratio so that the processing time of the CPU and the offload device are of the same order, the processing time of the CPU and the offload device can be aligned, and the CPU and accelerator can be used in mixed environments such as GPUs, FPGAs, and many-core CPUs. Even if there is, the amount of resources can be appropriately set.

The resource ratio determination unit 115 sets the resource ratio to a predetermined upper limit when the difference between the processing times of the CPU and the offload device is equal to or greater than a predetermined threshold. That is, if the processing time between the CPU and the offload device in the verification environment has a difference of, for example, 10 times or more, increasing the resource ratio to 10 times or more leads to deterioration in cost performance. In this case, for example, a resource ratio such as 5:1 is set as the upper limit (the upper limit is a resource ratio of 5:1 of the processing time). By setting an upper limit on the resource ratio, it is possible to prevent a large increase in the number of VMs.

<Processing of resource amount setting unit 116>
In step S53, the resource amount setting unit 116 sets the resource amount based on the user operating conditions and the appropriate resource ratio. That is, the resource amount setting unit 116 determines the resource amount while maintaining the resource ratio as much as possible so as to satisfy the cost condition specified by the user.

The resource amount setting unit 116 maintains an appropriate resource ratio and sets the maximum resource amount that satisfies the user operating conditions. As a specific example, it is assumed that the CPU1VM is 1,000 yen/month, the GPU is 4,000 yen/month, the resource ratio is 2:1, and the user's budget is within 10,000 yen per month. In this case, 2 CPUs and 1 GPU are secured and placed in the commercial environment.

If the user operation condition is not satisfied even with the minimum resource amount maintaining the resource ratio, the resource amount setting unit 116 sets the resource amount of the CPU and the offload device to the minimum so as to satisfy the cost condition by breaking the resource ratio. To give a specific example, it is assumed that the CPU1VM is 1,000 yen/month, the GPU is 4,000 yen/month, the resource ratio is 2:1, and the user's budget is within 5,000 yen per month. In this case, since the user budget is insufficient, the resource ratio cannot be maintained, but the resource amounts of the CPU and the offload device are set smaller, that is, 1 is secured for the CPU and 1 is allocated for the GPU.

After the processing of step S53 above is completed and the resources are secured and allocated in the commercial environment, the automatic verification described in FIG. 2 is executed in order to confirm the performance and cost before use by the user. As a result, resources can be reserved in a commercial environment, and performance and cost can be presented to the user after automatic verification.

<Summary of resource ratio and resource amount settings>
Performance measurements are used in solving offload patterns to optimize resource ratios. The implementation determines the resource ratio so that the CPU and GPU processing times are of the same order from the test case processing time. For example, if the test case processing time is 10 seconds for CPU processing and 5 seconds for GPU processing, the resources on the CPU side are doubled and the processing time is considered to be about the same, so the resource ratio is 2:1. . Since the number of virtual machines and the like is an integer, when calculating the resource ratio from the processing time, the resource ratio is rounded to an integer ratio.

Once the resource ratio has been determined, the next step is to set the resource amount when deploying the application to the commercial environment. For resource amount determination, the implementation determines the number of VMs, etc. while keeping the resource ratio as much as possible so as to satisfy the cost request specified by the user at the time of the offload request. Specifically, the maximum number of VMs is selected while maintaining the resource ratio within the cost range.

For example, for CPU, 1 VM is 1000 yen/month, GPU is 4000 yen/month, and the resource ratio is 2:1. Secure 1. Also, if the resource ratio cannot be maintained within the cost range, the resource amount is set so that the resource ratio is as close to an appropriate one as possible, starting from one CPU unit and one GPU unit. For example, if the budget is within 5000 yen per month, the resource ratio cannot be maintained, but 1 CPU and 1 GPU are secured.
Once the amount of resources is set, the implementation uses, for example, the virtualization function of Xen Server to allocate CPU and GPU resources.

In step S54, the placement setting unit 170 calculates the new application placement location (APL placement location) using a linear programming method based on the server of the equipment resource DB 132, the link specification information, and the existing application placement information. to set.

[Optimal Placement of Converted Applications]
The offload server 1 of the present embodiment, when offloading a program for a CPU to a device such as a GPU, selects a location where the application is placed so as to meet the user's requirements such as cost and operate with a short response time. rationalize.

<Appropriate placement of applications>
In this embodiment, it is assumed that applications can be placed not only in the cloud but also in network edges and user edges. However, the network edge and user edge are distributed with a lower concentration of servers compared to the cloud. Therefore, the cost of computing resources is relatively high compared to the cloud. In other words, although the price of hardware such as CPUs and GPUs is generally the same regardless of where they are located, in a data center that operates a cloud, the centralized servers can be collectively monitored and air conditioning controlled, so operation costs are relatively low. Become.
For example, a simple topology of computation node links is shown in FIG.

FIG. 11 is a diagram showing an example of the topology of computation nodes. In FIG. 11, data is sent from an IoT device that collects data in the user environment, such as an IoT system, to the user edge, and the data is sent to the cloud via the network edge, and the analysis results are viewed by company executives. This is the topology used in

As shown in FIG. 11, the topology for arranging applications consists of three layers, the number of bases in the cloud layer (eg, data center) is "2" (n13, n14), and the carrier edge layer (eg, office) is "3", the user edge layer (eg, user environment) is "4" (n6-n9), and the input node is "5" (n1-n5).
Assuming applications such as IoT, IoT data (pollen sensors, body temperature sensors, etc., which are one of IoT devices) is collected from the input node to the user edge, and depending on the characteristics of the application (response time requirements, etc.), Analysis processing is performed at the user edge and carrier edge, and analysis processing is performed after data is uploaded to the cloud. The output node is "1" (n15), and the analysis results are viewed by company executives. For example, when the input node is IoT data (pollen sensor), the person in charge of the Japan Meteorological Agency confirms the statistics and analysis results of the output node.
The arrangement topology of three layers shown in FIG. 11 is an example, and may be, for example, five layers. Also, the number of user edges and carrier edges may actually be several tens to several hundred.

Computing nodes are divided into three types: CPU, GPU, and FPGA. Nodes equipped with GPUs and FPGAs are also equipped with CPUs, but virtualization technology (for example, NVIDIA vGPU) provides separate GPU instances and FPGA instances that also include CPU resources.

Applications are deployed in the cloud, carrier edge, and user edge, and the closer to the user environment, the lower the response time, but the higher the cost of computing resources. In this embodiment, an application converted for GPU or FPGA is arranged, and the user can issue two types of requests when arranging the application.
The first is a cost request, which specifies the permissible cost of computing resources for operating the application, for example, within 5000 yen per month. The second is a response time request, which specifies an allowable response time for operating an application, such as returning a response within 10 seconds. In the conventional equipment design, for example, locations for arranging servers that accommodate virtual networks are systematically designed in consideration of long-term trends such as traffic increases.

This embodiment has the following features (1) and (2). (1) Applications to be placed are not statically determined, but are automatically converted for GPUs and FPGAs, and patterns suitable for usage forms are extracted through actual measurements through GA and the like. Because of this, application code and performance can change dynamically.
(2) It is not enough to reduce only the carrier's equipment cost and overall response time, but it is necessary to meet individual user requirements for computational resource cost and response time. Application placement policies can also change dynamically.

In consideration of the features (1) and (2) above, the application placement of this embodiment is such that when there is a request for placement from the user, conversion is performed, and the converted applications are sequentially placed on appropriate servers at that time. Iku form. If converting the application does not improve the cost performance, the application should be placed before the conversion. For example, when a GPU instance costs twice as much as a CPU instance, and the conversion does not improve the performance by more than two times, it is better to allocate before the conversion. Also, if the computational resources and bandwidth have already been used up to the upper limit, it may not be possible to allocate to that server.

<Linear programming formula for appropriate placement of applications>
In this embodiment, we formulate a linear programming method for calculating appropriate placement locations of applications. Specifically, the linear programming method is represented by [Formula 1] (Formulas (1) to (4) below) and [Formula 2] (Formulas (3) to (6) below). parameters are used.

Here, device and link costs, computational resource upper limits, band upper limits, etc., depend on the servers and networks prepared by the business operator. Therefore, those parameter values are set in advance by the operator. The calculation resource amount, bandwidth, data capacity, and processing time used by the application when offloading are determined by the measurement values of the offload pattern that was finally selected in the test in the verification environment before automatic conversion. Automatically set by the environment adaptation function.
The objective function and constraints on the parameters of the linear programming formula change depending on whether the user request is a cost request for computational resources or a response time request.

• Parameter of linear programming formula according to cost request If the request requires allocation within one month due to cost request, the parameters of the linear programming formula shown in [Equation 1] below are used.

The objective function is minimization of the response time of formula (1). One of the constraints is how much the computational resource cost of Equation (2) is within. Furthermore, a constraint condition is added as to whether or not the resource upper limit of the server in formulas (3) and (4) is exceeded.

• Parameter of linear programming formula according to response time request If the response time request requires placement within seconds of the application response time, use the parameters of the linear programming formula shown in [Formula 2] below.

The objective function is to minimize the computational resource cost of Equation (5) corresponding to Equation (2). One of the constraints is how many seconds the response time of Equation (6) corresponding to Equation (1) is within. Furthermore, the constraints of equations (3) and (4) are added.

・Explanation of parameters of linear programming formula Formulas (1) and (6) are formulas for calculating the response time of application k. In the case of formula (1), Rk is the objective function, and If Rk is a constraint that sets a user-specified upper bound.

Equations (2) and (5) are equations for calculating the cost (price) Pk of operating application k. In the case of equation (5), Pk is the objective function.

Formulas (3) and (4) are constraints that set the upper limit of computational resources and communication bandwidth, are calculated including applications deployed by others, and prevent the resource upper limit from being exceeded due to the placement of applications by new users.

The linear programming formulas (1) to (4) and (3) to (6) are calculated based on the network topology, conversion application type (increase in cost and performance for CPU, etc.), user requirements, and existing applications. Appropriate application placement can be calculated for different conditions by deriving solutions with linear programming solvers such as GLPK (Gnu Linear Programming Kit) and CPLEX (IBM Decision Optimization). By sequentially performing actual placement for a plurality of users after appropriate placement calculation, a plurality of applications are placed based on each user's request.

As described above, when there is a new application placement request, calculations are made based on the linear programming formula, and the applications are placed in order.
Since the application programs are arranged sequentially, it can be said that the first come, first served. Then, an arrangement that minimizes the objective function may be calculated according to the cost and response time specified by the user, and the application may be rearranged at the position determined by the calculation.

[evaluation]
Based on a linear programming formula, which is one aspect of the linear programming method, it was confirmed that multiple applications were appropriately arranged using the free solver GLPK (registered trademark) by changing some conditions.

<Evaluation condition>
・Target application The application to be placed performs image processing by Fourier transform, which is assumed to be used by many users. Fourier transform processing (FFT) is used in various aspects of IoT monitoring, such as vibration frequency analysis.
NAS.FT (https://www.nas.nasa.gov/publications/npb.html) (registered trademark) is one of the open source applications for FFT processing. Perform 2048×2048 size calculations of the provided sample test. When considering an IoT application that transfers data from a device to a network, it is assumed that primary analysis such as FFT processing is performed on the device side before transmission in order to reduce network costs.

MRI-Q (http://impact.crhc.illinois.edu/parboil/) (registered trademark) uses the matrix Q to represent the scanner configuration for calibration used in non-Cartesian spatial three-dimensional MRI reconstruction algorithms. calculate. In the IoT environment, image processing is often required for automatic surveillance from camera video, and there is a need for automatic offloading of image processing. MRI-Q is a C-language application that performs 3D MRI image processing during performance measurement and measures processing time using Large 64×64×64 size sample data. CPU processing is based on C language, and FPGA processing is based on OpenCL (registered trademark).
With the GPU and FPGA automatic offload technology of this embodiment, NAS.FT can be speeded up by GPU, and MRI-Q can be speeded up by FPGA, which are five times and seven times faster than CPU, respectively.

・Evaluation method The topology for deploying applications consists of three layers as shown in Fig. 11. The number of bases in the cloud layer is "5", the carrier edge layer is "20", the user edge layer is "60", and the input node. is "300". Assuming applications such as IoT, IoT data is collected from the input node to the user edge, and depending on the characteristics of the application (requirements for response time, etc.), analysis processing is performed at the user edge and carrier edge, and it is delivered to the cloud. After the data is given, it is analyzed and processed.

For example, 1000 applications are arranged based on user requirements based on the parameters of the linear programming formulas shown in [Formula 1] and [Formula 2]. The application is an IoT application and is supposed to analyze the data coming from the input node. Placement requests are generated randomly from input nodes (assuming there are "300").
For example, as the number of placement requests, the application placement is requested 1000 times at a ratio of NAS.FT:MRI-Q=3:1. Also, as a user request, a price condition or a response time condition is selected for each application when requesting placement. In the case of NAS.FT, an upper limit of 7,000 yen per month, an upper limit of 8,500 yen, or an upper limit of 10,000 yen is selected for the price, and an upper limit of 6 seconds, a condition of 7 seconds, or an upper limit of 10 seconds is selected for the response time. In the case of MRI-Q, an upper limit of 12,500 yen or 20,000 yen per month is selected for the price, and an upper limit of 4 seconds or 8 seconds is selected for the response time.

There are three patterns of user request variations.
Pattern 1: Select 1/6 of 6 types of requests for NAS.FT, and 1/4 of 4 types of requests for MRI-Q.
Pattern 2: The request selects the condition with the lowest price as the upper limit (first 7,000 yen, 12,500 yen), and if there are no vacancies, the next lowest price condition.
Pattern 3: The request selects the condition with the minimum response time as the upper limit (first 6 seconds, 4 seconds), and if there is no free space, the next fastest response time condition.

- Placement simulation Placement is performed by a simulation experiment using solver GLPK5.0 (registered trademark) as an evaluation tool. In order to simulate a large-scale network layout, it becomes a simulation using an evaluation tool. In actual use, when an application offload request is received, an offload pattern is created through repeated performance tests using the verification environment, and the appropriate amount of resources is determined based on the performance test results in the verification environment (Fig. 10). Then, according to the user's request, an appropriate layout is determined using GLPK, etc., normality confirmation tests and performance tests are automatically performed when actually deployed, the results and prices are presented to the user, and use is made after the user decides. to start.

FIG. 12 is a graph showing changes in the number of applications deployed in the average response time. FIG. 12 shows the average response time and the number of applications deployed for the above three patterns.
It was confirmed that pattern 2 was filled in order from the cloud, and pattern 3 was filled in order from the edge. In pattern 1, when various requests are received, they are arranged by satisfying the user requirements.
As shown in FIG. 12, in pattern 2, all up to the 400th placement position are placed in the cloud and the average response time remains the slowest, but when the cloud is filled, it gradually decreases.
In pattern 3, NAS.FT is placed from the user edge and MRI-Q is placed from the carrier edge. Therefore, the average response time is the shortest. However, as the number increases, it is also deployed in the cloud, slowing the average response time. In pattern 2, the average response time is intermediate between

patterns

1 and 3, and is arranged according to user requests. Thus, in pattern 2, the average response time is appropriately reduced compared to pattern 1, which initially enters the cloud entirely.

In this way, the software is automatically adapted according to the deployment environment, and when automatically offloaded to the GPU, etc., it meets the user's cost and response time requirements. That is, the program is converted so that it can be processed by a device such as a GPU, and after the amount of resources to be assigned is determined, the converted application is optimally arranged.

To summarize, first, set the data capacity used by the application, the amount of computational resources, the bandwidth, and the processing time based on the performance test data that was conducted in the verification environment when converting the program. Appropriate placement of applications is calculated based on a linear programming formula from values set for each conversion application and values such as server and link costs set in advance. When deploying an application, one is the constraint and the other is the objective function, based on user-specified price and response time requirements. Appropriate placement is calculated by the linear programming solver, and the proposed method presents the price etc. to the user when the resource is placed in the calculated place, and the use is started after the user's consent.

For applications that are automatically offloaded to GPUs and FPGAs, the appropriate allocation is calculated by changing the price conditions, response time conditions, and the number of applications requested by the user. This enables arrangement according to the user's request.

(Second embodiment)
Next, the offload server 1A etc. in the second embodiment of the present invention will be described.
The second embodiment is an example applied to FPGA automatic offloading of loop statements.
In the present embodiment, an example in which a PLD (Programmable Logic Device) is applied to an FPGA (Field Programmable Gate Array) will be described. The present invention is applicable to programmable logic devices in general.

(Explanation of principle)
Since it is difficult to predict which loops should be offloaded to increase speed with FPGA, we propose automatic measurement in a verification environment similar to GPU. However, FPGA takes more than several hours to compile OpenCL and run it on the actual machine. Can not. Therefore, after narrowing down candidate loop statements to be offloaded to the FPGA, measurement is performed. Specifically, for the found loop statements, a loop statement with high arithmetic strength is extracted using an arithmetic strength analysis tool such as ROSE (registered trademark). Furthermore, a profiling tool such as gcov (registered trademark) is used to extract loop statements with a large number of loops.

OpenCL conversion is performed for loop statements with high arithmetic intensity and loop count as candidates. When converted to OpenCL, the CPU processing program is divided into a kernel (FPGA) and a host (CPU) according to OpenCL syntax. For candidate loop statements, precompile your OpenCL to find resource-efficient loop statements. Since resources to be created can be known during compilation, loop statements that use a sufficiently small amount of resources are further narrowed down.
Since some candidate loop statements remain, we use them to measure performance and power consumption. The selected single-loop statement is compiled and measured, and for the single-loop statement whose speed has been further improved, a combination pattern is created and the second measurement is performed. A pattern of short time and low power consumption is selected as a solution from among the measured patterns.

For FPGA offloading of loop statements, after narrowing down using arithmetic strength, etc., measurement is performed and the evaluation value of the low power pattern is increased to automatically speed up and reduce power consumption.

FIG. 13 is a functional block diagram showing a configuration example of the offload server 1A according to the second embodiment of the invention. In describing this embodiment, the same components as those in FIG.
The offload server 1A is a device that automatically offloads specific processing of an application to an accelerator.
Also, the offload server 1A can be connected to an emulator.
As shown in FIG. 13, the offload server 1A includes a control unit 21, an input/output unit 12, a storage unit 13, and a verification machine 14 (accelerator verification device). be.

The control unit 21 is an automatic offloading function that controls the entire offload server 1A. The control unit 21 is implemented, for example, by a CPU (not shown) expanding a program (offload program) stored in the storage unit 13 into a RAM and executing the program.

The control unit 21 includes an application code specification unit (Specify application code) 111, an application code analysis unit (Analyze application code) 112, a PLD processing specification unit 213, an arithmetic intensity calculation unit 214, an arrangement setting unit 170, and a PLD A processing pattern creation unit 215, a performance measurement unit 118, an execution file creation unit 119, a production environment deployment unit (Deploy final binary files to production environment) 120, and a performance measurement test extraction execution unit (Extract performance test cases and run automatically ) 121 and a user provision unit (Provide price and performance to a user to judge) 122 .

<PLD processing designation unit 213>
The PLD processing designation unit 213 identifies loop statements (repetition statements) of the application, and creates a plurality of offload processing patterns in which pipeline processing and parallel processing in the PLD are designated by OpenCL for each of the identified loop statements. to compile.
The PLD processing designation unit 213 includes an extract offload able area 213a and an output intermediate file 213b.

The offload range extracting unit 213a identifies processing that can be offloaded to the FPGA, such as loop statements and FFT, and extracts an intermediate language corresponding to the offload processing.

The intermediate language file output unit 213b outputs the extracted intermediate language file 133. Intermediate language extraction is not a one-time process, but iterates to try and optimize executions for suitable offload region searches.

<Arithmetic intensity calculator 214>
The arithmetic intensity calculation unit 214 calculates the arithmetic intensity of the loop statement of the application using an arithmetic intensity analysis tool such as the ROSE framework (registered trademark). Arithmetic intensity is the number of floating point numbers (FN) executed during program execution divided by the number of bytes accessed to main memory (FN operations/memory access).
Arithmetic intensity is an index that increases as the number of calculations increases and decreases as the number of accesses increases, and processing with high arithmetic intensity is heavy processing for the processor. Therefore, the arithmetic strength analysis tool analyzes the arithmetic strength of the loop statement. The PLD processing pattern creation unit 215 narrows down loop statements with high arithmetic intensity to offload candidates.

A calculation example of the arithmetic strength is described.
Assume that floating-point calculation processing is performed 10 times (10 FLOPs) in one loop and the data used in the loop is 2 bytes. When the same size data is used for each loop, the arithmetic intensity is 10/2=5 [FLOP/byte].
Since the arithmetic strength does not consider the number of loops, in the present embodiment, the number of loops is also considered in addition to the arithmetic strength to narrow down.

<PLD processing pattern generator 215>
Based on the arithmetic intensity calculated by the arithmetic intensity calculation unit 214, the PLD processing pattern creation unit 215 narrows down loop statements whose arithmetic intensity is higher than a predetermined threshold (hereinafter referred to as high arithmetic intensity as appropriate) as offload candidates, Create a PLD processing pattern.
As a basic operation, the PLD processing pattern creation unit 215 excludes loop statements (repeated statements) that cause compilation errors from being offloaded, and performs PLD processing on repetitive statements that do not cause compilation errors. Create a PLD processing pattern that specifies whether or not

・Loop count measurement function As a loop count measurement function, the PLD processing pattern creation unit 215 measures the loop count of the loop statements of the application using a profiling tool. Narrow down loop statements that are more than the number of times (hereinafter referred to as a high number of loops as appropriate). GNU coverage gcov etc. is used to grasp the number of loops. "GNU Profiler (gprof)" and "GNU Coverage (gcov)" are known as profiling tools. Either can be used because both can examine the number of executions of each loop.

In addition, since the number of loops is not particularly visible in arithmetic intensity analysis, a profiling tool is used to measure the number of loops in order to detect loops with a large number of loops and high load. Here, the level of arithmetic intensity indicates whether the processing is suitable for offloading to the FPGA, and the number of loops×arithmetic intensity indicates whether the load associated with offloading to the FPGA is high.

- OpenCL (intermediate language) creation function The PLD processing pattern creation unit 215 creates OpenCL (OpenCL conversion) for offloading each narrowed loop statement to the FPGA as an OpenCL creation function. That is, the PLD processing pattern creation unit 215 compiles OpenCL that offloads the narrowed loop statements. In addition, the PLD processing pattern creation unit 215 lists loop statements whose performance is improved compared to the CPU among the measured performance, and creates OpenCL for offloading by combining the loop statements in the list.

Describe OpenCL conversion.
The PLD processing pattern creation unit 215 converts the loop statement into a high-level language such as OpenCL. First, a CPU processing program is divided into a kernel (FPGA) and a host (CPU) according to the grammar of a high-level language such as OpenCL. For example, when one of ten for statements is to be processed by the FPGA, one of the for statements is cut out as a kernel program and described according to the OpenCL grammar. A grammar example of OpenCL will be described later.

In addition, it is possible to incorporate techniques to speed up the division. In general, there are local memory cache, stream processing, multiple instantiation, unrolling processing of loop statements, integration of nested loop statements, memory interleaving, etc. in order to speed up using FPGA. Although these methods are not absolutely effective depending on the loop statement, they are often used as a technique for speeding up.

A kernel created according to the OpenCL C language grammar is executed on a device (eg FPGA) by a created host (eg CPU) side program using the OpenCL C language run-time API. The part that calls the kernel function hello() from the host side is to call clEnqueueTask(), which is one of the OpenCL runtime APIs.
The basic flow of initialization, execution, and termination of OpenCL written in host code is steps 1 to 13 below. Among these steps 1 to 13, steps 1 to 10 are procedures (preparations) until the kernel function hello() is called from the host side, and step 11 is execution of the kernel.

1. Platform Specific Identify the platform on which OpenCL is running using the function clGetPlatformIDs( ) which provides the platform specific functionality defined in the OpenCL runtime API.

2. Device identification Use the function clGetDeviceIDs( ), which provides device identification functions defined in the OpenCL runtime API, to identify devices such as GPUs used in the platform.

3. Context Creation Using the function clCreateContext( ) that provides the context creation function defined in the OpenCL runtime API, an OpenCL context that serves as an execution environment for operating OpenCL is created.

4. Create Command Queue Create a command queue ready to control the device using the function clCreateCommandQueue( ) that provides the command queue creation functionality defined in the OpenCL runtime API. In OpenCL, the host issues commands to the device (issues a kernel execution command or a memory copy command between the host and the device) through the command queue.

5. Memory object creation Using the function clCreateBuffer(), which provides a function to allocate memory on the device defined in the OpenCL runtime API, create a memory object that allows the host to refer to the memory object.

6. Kernel file loading The kernel running on the device is controlled by the host program. Therefore, the host program must first load the kernel program. The kernel program includes binary data created by the OpenCL compiler and source code written in the OpenCL C language. Read this kernel file (description omitted). Note that the OpenCL runtime API is not used for kernel file reading.

7. Program Object Creation OpenCL recognizes a kernel program as a program project. This procedure is program object creation.
Using the function clCreateProgramWithSource( ) that provides the program object creation function defined in the OpenCL runtime API, create a program object that allows the host to refer to the memory object. Use clCreateProgramWithBinary() when creating from a compiled binary string of a kernel program.

8. Build Build the program object registered as the source code using the OpenCL C compiler/linker.
A program object is built using the function clBuildProgram(), which performs a build with the OpenCL C compiler and linker defined in the OpenCL runtime API. Note that this compilation procedure is not required if a program object is created from a compiled binary string using clCreateProgramWithBinary().

9. Kernel Object Creation A kernel object is created using the function clCreateKernel( ) that provides the kernel object creation function defined in the OpenCL runtime API. One kernel object corresponds to one kernel function, so the kernel function name (hello) is specified when the kernel object is created. Also, when a plurality of kernel functions are described as one program object, one kernel object corresponds to one kernel function, so clCreateKernel( ) is called multiple times.

10. Kernel Argument Setting Kernel arguments are set using the function clSetKernel() that provides the function of giving arguments to the kernel defined in the OpenCL runtime API (passing values to the arguments of kernel functions).
After steps 1 to 10 complete preparations, step 11 is entered to execute the kernel on the device from the host side.

11. Kernel Execution Kernel execution (throwing into the command queue) is a queuing function to the command queue because it acts on the device.
The function clEnqueueTask( ), which provides kernel execution functionality defined in the OpenCL runtime API, is used to queue a command to execute kernel hello on the device. After the command to execute kernel hello is queued, it will be executed in the executable arithmetic unit on the device.

12. Reading from a memory object Using the function clEnqueueReadBuffer(), which provides a function to copy data from device-side memory to host-side memory defined in the OpenCL runtime API, read data from the device-side memory area to the host-side memory area. copy the data to In addition, data is copied from the host-side memory area to the device-side memory area using the function clEnqueueWrightBuffer(), which provides a function to copy data from the host side to the client side memory. Since these functions act on the device, the data copy starts after the copy command is queued in the command queue once.

13. Releasing Objects Finally, release the various objects created so far.
The device execution of the kernel created according to the OpenCL C language has been described above.

• Resource Amount Calculation Function As a resource amount calculation function, the PLD processing pattern creation unit 215 precompiles the created OpenCL and calculates the resource amount to be used (“first resource amount calculation”). The PLD processing pattern creation unit 215 calculates resource efficiency based on the calculated arithmetic intensity and resource amount, and based on the calculated resource efficiency, c loops whose resource efficiency is higher than a predetermined value in each loop statement. choose a sentence.
The PLD processing pattern creation unit 215 calculates the resource amount to be used by precompiling with the combined offload OpenCL (“second resource amount calculation”). Here, without precompilation, the sum of resource amounts in precompilation before the first measurement may be used.

<Performance measurement unit 118>
The performance measurement unit 118 compiles the created PLD processing pattern application, places it in the verification machine 14, and executes performance measurement processing when offloaded to the PLD.

The performance measurement unit 118 executes the arranged binary file, measures the performance when offloaded, and returns the performance measurement result to the offload range extraction unit 213a. In this case, the offload range extraction unit 213a extracts another PLD processing pattern, and the intermediate language file output unit 213b attempts performance measurement based on the extracted intermediate language (see symbol a in FIG. 2). ).

The performance measurement unit 118 includes a binary file placement unit (Deploy binary files) 118a. The binary file placement unit 118a deploys (places) an execution file derived from the intermediate language on the verification machine 14 having a GPU.

A specific example of performance measurement will be described.
The PLD processing pattern creation unit 215 narrows down loop statements with high resource efficiency, and compiles OpenCL for offloading the loop statements narrowed down by the executable file creation unit 119 . The performance measurement unit 118 measures the performance of the compiled program (“first performance measurement”).

Then, the PLD processing pattern creation unit 215 lists the loop statements whose performance is improved compared to the CPU among the performance measured. The PLD processing pattern creation unit 215 creates OpenCL for offloading by combining the loop statements of the list. The PLD processing pattern creation unit 215 precompiles with the combined offload OpenCL and calculates the amount of resources to be used.
Note that the sum of resource amounts in precompilation before the first measurement may be used without precompilation. The executable file creation unit 119 compiles the combined offload OpenCL, and the performance measurement unit 118 measures the performance of the compiled program (“second performance measurement”).

<Executable File Creation Unit 119>
The execution file creation unit 119 selects the PLD processing pattern with the highest evaluation value from a plurality of PLD processing patterns based on the measurement result of the processing time repeated a predetermined number of times, and compiles the PLD processing pattern with the highest evaluation value. to create an executable file.

The automatic offload operation of the offload server 1A configured as described above will be described below.
[Auto Offload Operation]
The offload server 1A of the present embodiment is an example in which elemental technology of environment-adaptive software is applied to FPGA automatic offloading of user application logic.
Description will be made with reference to the automatic offload processing of the offload server 1A shown in FIG.
As shown in FIG. 2, the offload server 1A is applied to elemental technology of environment adaptive software. The offload server 1A has a control unit (automatic offload function unit) 11, a test case DB 131, an intermediate language file 133, and a verification machine .
The offload server 1A acquires an application code 125 used by the user.

A user uses, for example, various devices (Device) 151, a device 152 having a CPU-GPU, a device 153 having a CPU-FPGA, and a device 154 having a CPU. The offload server 1A automatically offloads functional processing to the accelerators of the device 152 having a CPU-GPU and the device 153 having a CPU-FPGA.

The operation of each part will be described below with reference to the step numbers in FIG.
<Step S21: Specify application code>
In step S21, the application code specifying unit 111 (see FIG. 13) specifies the processing function (image analysis, etc.) of the service provided to the user. Specifically, the application code designation unit 111 designates the input application code.

<Step S12: Analyze application code>
In step S12, the application code analysis unit 112 (see FIG. 13) analyzes the source code of the processing function and grasps the structure of specific library usage such as loop statements and FFT library calls.

<Step S13: Extract offloadable area>
In step S13, the PLD processing designation unit 213 (see FIG. 13) identifies loop statements (repetition statements) of the application, designates parallel processing or pipeline processing in the FPGA for each repetition statement, and performs high-level synthesis. Compile with tools. Specifically, the offload range extraction unit 213a (see FIG. 13) identifies processing that can be offloaded to the FPGA, such as a loop statement, and extracts OpenCL as an intermediate language corresponding to the offload processing.

<Step S14: Output intermediate file>
In step S14, the intermediate language file output unit 213b (see FIG. 13) outputs the intermediate language file 133. FIG. Intermediate language extraction is not a one-time process, but iterates to try and optimize executions for suitable offload region searches.

<Step S15: Compile error>
In step S15, the PLD processing pattern creation unit 215 (see FIG. 13) excludes loop statements that cause compilation errors from being offloaded, and repeat statements that do not cause compilation errors to be FPGA-processed. Create a PLD processing pattern that specifies whether or not to perform.

<Step S21: Deploy binary files>
In step S21, the binary file placement unit 118a (see FIG. 13) deploys the execution file derived from the intermediate language to the verification machine 14 having an FPGA. The binary file placement unit 118a activates the placed file, executes an assumed test case, and measures performance when offloading.

<Step S22: Measure performance>
In step S22, the performance measurement unit 118 (see FIG. 13) executes the arranged file and measures the performance and power usage when offloading.
In order to make the area to be offloaded more appropriate, this performance measurement result is returned to the offload range extraction unit 213a, and the offload range extraction unit 213a extracts another pattern. Then, the intermediate language file output unit 213b attempts performance measurement based on the extracted intermediate language (see symbol a in FIG. 2). The performance measurement unit 118 repeats the performance/power consumption measurement in the verification environment and finally determines the code pattern to be deployed.

As indicated by symbol a in FIG. 2, the control unit 21 repeatedly executes steps S12 to S22. The automatic offload function of the control unit 21 is summarized below. That is, the PLD processing designation unit 213 specifies loop statements (repetition statements) of the application, designates parallel processing or pipeline processing in the FPGA for each repetition statement in OpenCL (intermediate language), and uses a high-level synthesis tool. Compile with Then, the PLD processing pattern creation unit 215 creates a PLD processing pattern that excludes loop statements that cause compilation errors from being offloaded, and specifies whether or not to perform PLD processing on loop statements that do not cause compilation errors. do. Then, the binary file placement unit 118a compiles the application of the PLD processing pattern and places it on the verification machine 14, and the performance measurement unit 118 executes the performance measurement processing on the verification machine 14. FIG. The execution file creation unit 119 selects the highest evaluation value (for example, the highest evaluation value = (processing time) ^{- 1/2} ) from a plurality of PLD processing patterns based on the performance measurement results repeated a predetermined number of times. Select a pattern and compile the selected pattern to create an executable.

<Step S23: Deploy final binary files to production environment>
In step S23, the production-environment placement unit 120 determines a pattern specifying the final offload area, and deploys it to the production environment for the user.

<Step S24: Extract performance test cases and run automatically>
In step S24, the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and automatically executes the extracted performance test in order to show the performance to the user after the execution file is arranged.

<Step S25: Provide price and performance to a user to judge>
In step S25, the user providing unit 122 presents information such as price and performance to the user based on the performance test results. Based on the presented information such as price and performance, the user decides to start using the service for a fee.

The above steps S21 to S25 are performed in the background when the user uses the service, and are assumed to be performed, for example, during the first day of provisional use. Also, the processing performed in the background for cost reduction may target only GPU/FPGA offload.

As described above, when the control unit (automatic offload function unit) 21 of the offload server 1A is applied to the element technology of the environment-adaptive software, the source code of the application used by the user is used to offload the function processing. , the offloading area is extracted and the intermediate language is output (steps S12 to S15). The control unit 21 arranges and executes the execution file derived from the intermediate language on the verification machine 14, and verifies the offload effect (steps S21 to S22). After repeating verification and determining an appropriate offload area, the control unit 21 deploys the executable file in the production environment that is actually provided to the user and provides it as a service (step S26).

In the above, the process flow for collectively performing code conversion, resource amount adjustment, and placement location adjustment required for environmental adaptation was explained, but it is not limited to this, and it is also possible to extract only the desired process. For example, when only code conversion for FPGA is desired, only necessary parts such as the environment adaptation function and verification environment in steps S21 to S25 may be used.

[FPGA automatic offload]
The code analysis described above analyzes the application code using a syntax analysis tool such as Clang. Code analysis is difficult to generalize because it requires analysis assuming the device to be offloaded. However, it is possible to understand the structure of the code such as loop statements and reference relationships of variables, whether it is a function block that performs FFT processing as a function block, or whether it is calling a library that performs FFT processing. be. It is difficult for the offload server to automatically determine the function block. This can also be grasped by similarity determination using a similar code detection tool such as Deckard. Here, Clang is a tool for C/C++, but it is necessary to select a tool suitable for the language to be analyzed.

Also, when offloading application processing, it is necessary to consider the GPU, FPGA, IoT GW, etc. according to the offload destination. In general, with regard to performance, it is difficult to automatically discover the setting that maximizes performance at one time. For this reason, offload patterns are tried by repeating performance measurement several times in a verification environment to find a pattern that can speed up the process.

An FPGA-oriented offload technique for application software loop statements will now be described.
[flowchart]
FIG. 14 is a flowchart for explaining the outline of the operation of the offload server 1A.
In step S201, the application code analysis unit 112 analyzes the source code of the application to be offloaded. The application code analysis unit 112 analyzes information on loop statements and variables according to the language of the source code.

In step S202, the PLD processing designation unit 213 identifies loop statements and reference relationships of the application.

Next, the PLD processing pattern creation unit 215 performs processing for narrowing down candidates for whether to try FPGA offloading for the grasped loop statements. Arithmetic strength is one indicator of whether a loop statement has an offload effect.
In step S203, the arithmetic strength calculation unit 214 calculates the arithmetic strength of the loop statement of the application using the arithmetic strength analysis tool. Arithmetic intensity is an index that increases as the number of calculations increases and decreases as the number of accesses increases, and processing with high arithmetic intensity is heavy processing for the processor. Therefore, the arithmetic strength analysis tool analyzes the arithmetic strength of loop statements and narrows down loop statements with high density to offload candidates. Therefore, the arithmetic strength analysis tool analyzes the arithmetic strength of loop statements and narrows down loop statements with high density to offload candidates.

Even high-arithmetic-intensive loop statements can be problematic when processing them in an FPGA, consuming too much FPGA resources. Therefore, calculation of the amount of resources when FPGA processing a high arithmetic intensity loop statement will be described.
When compiling to FPGA, a high-level language such as OpenCL is converted to a hardware description level such as HDL, and based on this, actual wiring processing and the like are performed. At this time, wiring processing and the like take a lot of time, but the time up to the stage of the intermediate state such as HDL takes only minutes. Resources such as Flip Flop and Look Up Table used in FPGA can be known even at the stage of intermediate state such as HDL. Therefore, the amount of resources to be used can be known in a short time by looking at the intermediate state of HDL or the like, even if the compilation is not finished.

Therefore, in this embodiment, the PLD processing pattern creation unit 215 translates the target loop statement into a high-level language such as OpenCL, and first calculates the resource amount. Also, since the arithmetic intensity and the resource amount when the loop statement is offloaded are determined, the arithmetic intensity/resource amount or arithmetic intensity×loop count/resource amount is defined as the resource efficiency. Then, loop statements with high resource efficiency are further narrowed down as offload candidates.

Returning to the flow of FIG. 14, in step S204, the PLD processing pattern creation unit 215 measures the number of loops of loop statements of the application using profiling tools such as gcov and gprof.
In step S205, the PLD processing pattern creation unit 215 narrows down the loop statements with high arithmetic strength and high loop count among the loop statements.

In step S206, the PLD processing pattern creation unit 215 creates OpenCL for offloading each narrowed loop statement to the FPGA.

Here, I will provide a supplementary explanation of converting loop statements into OpenCL (creating OpenCL). That is, two processes are required when converting a loop statement into a high-level language using OpenCL or the like. One is to divide the CPU processing program into a kernel (FPGA) and a host (CPU) according to the grammar of a high-level language such as OpenCL. Another is to include techniques for speeding up the division. In general, there are local memory cache, stream processing, multiple instantiation, unrolling processing of loop statements, integration of nested loop statements, memory interleaving, etc. in order to speed up using FPGA. Although these methods are not absolutely effective depending on the loop statement, they are often used as a technique for speeding up.

Next, since several loop statements with high resource efficiency have been selected, we will use them to create the number of offload patterns to measure performance. There are two ways to increase the speed of an FPGA: by concentrating the FPGA resources on a single process, and by distributing the FPGA resources to a plurality of processes. A certain number of selected single-loop statement patterns are created and pre-compiled as a pre-stage before operating on the actual FPGA.

In step S207, the PLD processing pattern creation unit 215 pre-compiles the created OpenCL and calculates the resource amount to be used ("first resource amount calculation").

In step S208, the PLD processing pattern creation unit 215 narrows down loop statements with high resource efficiency.

In step S209, the execution file creation unit 119 compiles OpenCL for offloading the narrowed down loop statements.

In step S210, the performance measurement unit 118 measures the performance of the compiled program ("first performance measurement"). Since some candidate loop statements remain, the performance measurement unit 118 uses them to actually measure the performance (see the subroutine in FIG. 15 for details).

In step S211, the PLD processing pattern creation unit 215 lists the loop statements whose performance is improved compared to the CPU among the performance-measured ones.

In step S212, the PLD processing pattern creation unit 215 creates OpenCL for offloading by combining the loop statements of the list.
In step S213, the PLD processing pattern creation unit 215 calculates the amount of resources to be used by precompiling with the combined offload OpenCL (“second resource amount calculation”). Note that the sum of resource amounts in precompilation before the first measurement may be used without precompilation. By doing so, the number of times of precompilation can be reduced.

In step S214, the execution file creation unit 119 compiles the combined offload OpenCL.

In step S215, the performance measurement unit 118 measures the performance of the compiled program ("second performance measurement"). The performance measurement unit 118 compiles and measures the selected single-loop statement, creates a combination pattern for the single-loop statement that has been further accelerated, and performs the second performance measurement (for details, see (see subroutine in FIG. 15).

In step S216, the production environment placement unit 120 selects the pattern with the highest performance among the first and second measurements, and terminates the processing of this flow. A short-time pattern is selected as a solution from the measured multiple patterns.

In this way, the FPGA automatic offloading of loop statements creates offload patterns by focusing on loop statements with high arithmetic strength, loop counts, and high resource efficiency, and searches for high-speed patterns through actual measurements in a verification environment (Fig. 14).

FIG. 15 is a flowchart showing performance/power consumption measurement processing of the performance measurement unit 118. FIG. This flow is called and executed by a subroutine call in step S210 or step S215 in FIG.

In step S301, the performance measurement unit 118 measures the processing time required for FPGA offloading.

In step S302, the performance measurement unit 118 sets an evaluation value based on the measured processing time.

In step S303, the performance measurement unit 118 measures the performance of patterns with high evaluation values, which are evaluated such that the higher the evaluation value, the higher the fitness, and returns to step S210 or step S215 in FIG.

[Example of offload pattern creation]
FIG. 16 is a diagram showing a search image of the PLD processing pattern generator 215. As shown in FIG.
The control unit (automatic offload function unit) 21 (see FIG. 13) analyzes the application code 125 (see FIG. 2) used by the user, and determines the code pattern of the application code 125 as shown in FIG. (Code patterns) 241 checks whether the for statement can be parallelized. As indicated by symbol r in FIG. 16, when four for statements are found from the code pattern 241, one digit is assigned to each for statement, here four digits of 1 or 0 are assigned to the four for statements. Here, 1 is set when FPGA processing is performed, and 0 is set when FPGA processing is not performed (that is, when processing is performed by the CPU).

[Flow from C code to search for OpenCL final solution]
Procedures A to F in FIG. 17 are diagrams for explaining the flow from the C code to the search for the final OpenCL solution.
The application code analysis unit 112 (see FIG. 13) parses the "C code" shown in procedure A of FIG. ) specifies the “loop statement, variable information” shown in procedure B in FIG. 17 (see symbol t in FIG. 17).

The arithmetic intensity calculation unit 214 (see FIG. 13) performs arithmetic intensity analysis on the specified "loop statement, variable information" using an arithmetic intensity analysis tool (see symbol u in FIG. 17). The PLD processing pattern creation unit 215 narrows down loop statements with high arithmetic intensity to offload candidates. Furthermore, the PLD processing pattern creation unit 215 performs profiling analysis using a profiling tool to further narrow down loop statements with high arithmetic intensity and high loop count.

Then, the PLD processing pattern creation unit 215 creates OpenCL for offloading each narrowed loop statement to the FPGA (OpenCL conversion) (see symbol v in FIG. 17).
In addition, we will introduce speed-up techniques such as decompression along with code division when converting to OpenCL (described later).

<Concrete example of “High arithmetic intensity, OpenCL conversion” (Part 1): Procedure C>
For example, if 4 for statements (assignment of 4 digits 1 or 0) are found from the code pattern 241 (see FIG. 16) of the application code 130, the arithmetic intensity analysis narrows down (chooses) 3 of them. That is, as indicated by symbol u in FIG. 17, the offload patterns "1000", "0010", and "0001" of the three for statements are narrowed down from the four for statements.

<Example of "expansion" executed with code division when converting to OpenCL>
A loop statement [k=0; k<10; k++] written on the CPU program side when data is transferred from the FPGA to the CPU
}
, specify \pragma unroll above this loop statement. i.e.
\pragma unroll
for(k=0; k<10; k++){
}
described as

If you specify unroll with a syntax suitable for Intel or Xilinx (registered trademark) tools such as \pragma unroll, in the above expansion example, i = 0, i = 1, i = 2 and pipeline execution be able to. For this reason, the amount of resources will be used ten times, but the speed may be increased.
In addition, the number to be unrolled by unroll can be specified to be 5 instead of the total number of loops.
This completes the description of the "deployment" example.

Next, the PLD processing pattern creation unit 215 further narrows down the high arithmetic intensity loop sentences narrowed down as offload candidates by using the resource amount. That is, the PLD processing pattern creation unit 215 calculates the resource amount, and the PLD processing pattern creation unit 215 selects the resource efficiency (=arithmetic intensity/resource at the time of FPGA processing) from the offload candidates for the loop statement with high arithmetic intensity. amount, or (arithmetic intensity×number of loops/resource amount during FPGA processing)) is analyzed to extract loop statements with high resource efficiency.

At symbol v in FIG. 17, the PLD processing pattern creation unit 215 compiles (<precompiles>) OpenCL for offloading the narrowed loop statements.

<Concrete example of “High arithmetic intensity, OpenCL conversion” (Part 2)>
As indicated by symbol u in FIG. , 0010, and 0001.
In the above, "High arithmetic strength, OpenCL conversion" shown in the procedure C of FIG. 17 has been described.

The performance measurement unit 118 measures the performance of the compiled program for the "resource-efficient loop statement" shown in procedure D of FIG. 17 ("first performance measurement").
Then, the PLD processing pattern creation unit 215 lists the loop statements whose performance is improved compared to the CPU among the performance measured. Similarly, we calculate the amount of resources, offload OpenCL compilation, and measure the performance of the compiled program.

<Concrete example of “High arithmetic intensity, OpenCL conversion” (Part 3)>
As indicated by symbol w in FIG. 17, the first measurement is performed for three offload patterns "1000", "0010", and "0001". If two performances of "1000" and "0010" are high among the three measurements, the second measurement is performed for the combination of "1000" and "0010".

At symbol x in FIG. 17, the executable file creation unit 119 compiles (<main compile>) OpenCL for offloading the narrowed loop statements.

"Combination pattern actual measurement" shown in procedure E of FIG. 17 refers to measuring a candidate loop statement alone, and then measuring a verification pattern with its combination.

<Concrete example of “High arithmetic intensity, OpenCL conversion” (Part 4)>
As indicated by symbol y in FIG. 17, the second measurement is performed for "1010" which is a combination of "1000" and "0010". Two measurements were made, and as a result, the highest speed "0010" was selected between the first and second measurements. In such a case, "0010" is the final solution. Here, there are cases where the combination pattern cannot be measured due to resource limitations. In this case, it is possible to skip the combinations and just select the single result with the highest speed.

At symbol z in FIG. 17, the performance measurement unit 118 selects (<selects>) "0010" with the best maximum speed between the first measurement and the second measurement.

As a result, "0010" (see symbol aa in FIG. 17) of the "OpenCL final solution" shown in procedure F of FIG. 17 was selected.

<deploy (deployment)>
Deploy again to the production environment with the PLD processing pattern of the highest processing performance of the OpenCL final solution and provide it to the user.

[Example of implementation]
An implementation example is explained.
FPGA such as Intel PAC with Intel Arria10 GX FPGA can be used.
Intel Acceleration Stack (Intel FPGA SDK for OpenCL, Quartus Prime Version) or the like can be used for FPGA processing.
Intel FPGA SDK for OpenCL is a high-level synthesis tool (HLS) that interprets #pragma for Intel in addition to standard OpenCL.
The implementation example interprets the OpenCL code that describes the kernel processed by the FPGA and the host program processed by the CPU, outputs information such as the amount of resources, and performs the wiring work of the FPGA, etc., so that it can operate on the FPGA. Even a small program of about 100 lines takes a long time of about 3 hours to be able to operate on an actual FPGA. However, when the amount of resources is exceeded, an error occurs early. Also, when the OpenCL code cannot be processed by the FPGA, an error is output after several hours.

In the implementation example, when there is a request to use a C/C++ application, the code of the C/C++ application is first analyzed, the for statement is discovered, and the program structure such as variable data used in the for statement is understood. . LLVM/Clang syntax analysis library can be used for syntax analysis.

The example implementation then runs the Arithmetic Intensity Analysis tool to get an indication of the arithmetic intensity determined by number of computations, number of accesses, etc., to get a sense of the FPGA offload effect of each loop statement. The ROSE framework etc. can be used for arithmetic intensity analysis. Target only loop statements with high arithmetic strength.
Next, a profiling tool such as gcov is used to obtain the loop count of each loop. Candidates are narrowed down to loop statements with the highest number of arithmetic strength times the number of loops.

In the implementation example, the FPGA offloading OpenCL code is then generated for each loop statement with high arithmetic intensity. The OpenCL code is obtained by dividing the corresponding loop statement as the FPGA kernel and the remainder as the CPU host program. When the FPGA kernel code is used, as a technique for speeding up, the expansion processing of the loop statement may be performed by a constant number b. Loop statement expansion processing increases the amount of resources, but is effective in speeding up processing. Therefore, the number of expansions is limited to a certain number b so as not to increase the amount of resources.

In the implementation example, the Intel FPGA SDK for OpenCL is used to precompile the a number of OpenCL codes, and the amount of resources such as Flip Flop and Look Up Table to be used is calculated. The used resource amount is displayed as a percentage of the total resource amount. Here, the resource efficiency of each loop statement is calculated from the arithmetic strength and the resource amount or from the arithmetic strength, the number of loops and the resource amount. For example, a loop statement with an arithmetic strength of 10 and a resource amount of 0.5 has 10/0.5=20 resources, and a loop statement with an arithmetic strength of 3 and a resource amount of 0.3 has 3/0.3=10 resources. Efficiency is high, and the former is high. Alternatively, a value obtained by multiplying the number of loops may be used as the resource efficiency. In each loop statement, select c with high resource efficiency.

In the implementation example, next, a pattern to be measured is created with c loop statements as candidates. For example, if the 1st and 3rd loops are highly resource efficient, create each OpenCL pattern that offloads the 1st and 3rd loops, compiles them, and measures the performance. If you can speed up with offload patterns of multiple single loop statements (for example, if you can speed up both 1st and 3rd), create an OpenCL pattern with that combination, compile and perform Measure (e.g. pattern offloading both #1 and #3).

Note that when creating a combination of single loops, the amount of resources used is also a combination. Therefore, if it does not fit within the upper limit, the combination pattern is not created. When d patterns including combinations are created, performance measurement is performed on a server equipped with an FPGA in the verification environment. For performance measurement, sample processing specified by the application to be accelerated is performed. For example, in the case of a Fourier transform application, performance is measured using transform processing with sample data as a benchmark.
Finally, the implementation selects the fast pattern of the multiple measurement patterns as the solution.

In the second embodiment, the same "resource amount determination and allocation determination" as described in the first embodiment are executed (description omitted).

[evaluation]
Describe your rating.
[FPGA automatic offload of loop statement] of the second embodiment can be evaluated in the same manner as [GPU automatic offload of loop statement] of the first embodiment.

<Evaluation target>
In [FPGA automatic offloading of loop statement] of the second embodiment, the evaluation target is MRI-Q of MRI (Magnetic Resonance Imaging) image processing.
MRI-Q computes a matrix Q that represents the scanner configuration used in the non-Cartesian spatial 3D MRI reconstruction algorithm. MRI-Q is written in C language, executes three-dimensional MRI image processing during performance measurement, and measures processing time with Large (maximum) 64×64×64 size data. CPU processing uses C language, and FPGA processing is based on OpenCL.

<Evaluation method>
Enter the code of the target application, and try to offload loop statements recognized by Clang or the like to the destination GPU or FPGA to determine the offload pattern. At this time, the processing time and power consumption are measured. For the final offload pattern, obtain the change in power consumption over time and confirm the reduction in power consumption compared to the case where all processing is performed by the CPU.
In the [FPGA automatic offloading of loop statement] of the second embodiment, GA is not performed, and arithmetic intensity or the like is used to narrow down the measurement patterns to four patterns.
Offload Eligible Loop Statements: MRI-Q 16
Pattern conformity: The lower the processing time, the higher the evaluation value, which is a high degree of conformity. In the MRI-Q of the second embodiment as well, cost and response time can be improved in the manner shown in FIG. 12, compared to simply placing priority on cheapness and response time.

[Hardware configuration]
The offload servers according to the first and second embodiments are implemented by a computer 900, which is a physical device configured as shown in FIG. 18, for example.
FIG. 18 is a hardware configuration diagram showing an example of a computer that implements the functions of the

offload servers

1 and 1A. Computer 900 has CPU 901 , RAM 902 , ROM 903 , HDD 904 , accelerator 905 , input/output interface (I/F) 906 , media interface (I/F) 907 , and communication interface (I/F: Interface) 908 .

The accelerator 905 is an accelerator (device) that processes at least one of data from the communication I/F 908 and data from the RAM 902 at high speed. For example, the accelerator 905 is an accelerator for the device 151, the device 152 having a CPU-GPU, the device 153 having a CPU-FPGA, and the device 154 having a CPU in FIG.
As the accelerator 905, a type (look-aside type) that returns the execution result to the CPU 901 or the RAM 902 after executing the processing from the CPU 901 or the RAM 902 may be used. On the other hand, as the accelerator 905, a type (in-line type) that performs processing by entering between the communication I/F 908 and the CPU 901 or the RAM 902 may be used.

Accelerator 905 is connected to external device 915 via communication I/F 908 . Input/output I/F 906 is connected to input/output device 916 . A media I/F 907 reads and writes data from a recording medium 917 .

The CPU 901 operates based on programs stored in the ROM 903 or HDD 904, and executes programs (also called applications or apps for short) read into the RAM 902 to operate the

offload servers

1 and 1 shown in FIGS. Control is performed by each processing unit of 1A. This program can be distributed via a communication line or recorded on a recording medium 917 such as a CD-ROM for distribution.
The ROM 903 stores a boot program executed by the CPU 901 when the computer 900 is started, a program depending on the hardware of the computer 900, and the like.

The CPU 901 controls, via the input/output I/F 906, an input/output device 916 comprising an input unit such as a mouse and keyboard, and an output unit such as a display and printer. The CPU 901 acquires data from the input/output device 916 via the input/output I/F 906 and outputs the generated data to the input/output device 916 . A GPU (Graphics Processing Unit) or the like may be used together with the CPU 901 as a processor.

The HDD 904 stores programs executed by the CPU 901 and data used by the programs. The communication I/F 908 receives data from other devices via a communication network (for example, NW (Network)) and outputs the data to the CPU 901, and also transmits data generated by the CPU 901 to other devices via the communication network. Send to

The media I/F 907 reads programs or data stored in the recording medium 917 and outputs them to the CPU 901 via the RAM 902 . The CPU 901 loads a program related to target processing from the recording medium 917 onto the RAM 902 via the media I/F 907, and executes the loaded program. The recording medium 917 includes optical recording media such as DVD (Digital Versatile Disc) and PD (Phase change rewritable Disk), magneto-optical recording media such as MO (Magneto Optical disk), magnetic recording media, conductor memory tape media, semiconductor memories, and the like. is.

For example, when the computer 900 functions as the

offload servers

1 and 1A according to the first and second embodiments, the CPU 901 of the computer 900 executes the programs loaded on the RAM 902 to perform the offload servers 1 and 1A. to realize the function of Also, the data in the RAM 902 is stored in the HDD 904 . The CPU 901 reads a program related to target processing from the recording medium 912 and executes it. In addition, the CPU 901 may read a program related to target processing from another device via a communication network.

[effect]
As described above, the offload server 1 (see FIG. 1) according to the first embodiment is an offload server that offloads specific processing of an application program to an accelerator. The code analysis unit 112 analyzes the reference relationships of variables used in the loop statements of the application program, and for data that may be transferred outside the loop, an explicit A data transfer specification unit 113 that specifies data transfer using a specified line, and a parallel processing specification that specifies loop statements of an application program and compiles each specified loop statement by specifying a parallel processing specification statement in the accelerator. Parallel processing that creates a parallel processing pattern that designates whether or not to perform parallel processing for loop statements that do not cause a compile error and excludes loop statements that cause a compile error from being offloaded. A processing pattern creation unit 117, a performance measurement unit 118 that compiles an application program for a parallel processing pattern, places it in an accelerator verification device, and executes performance measurement processing when offloaded to the accelerator, and a converted application program. is placed on a cloud server, carrier edge server, or user edge server on the network, depending on the cost or response time conditions specified by the user, device and link costs, computational resource limits, and bandwidth limits and a placement setting unit 170 that calculates and sets the placement location of the application program based on a linear programming formula with the cost of the computational resource or the response time as the objective function.

By doing so, for applications automatically offloaded to accelerators such as GPUs and FPGAs, the price conditions and response time conditions requested by the user, the number of applications arranged, etc. can be changed, and the cloud server on the network , a carrier edge server, or a user edge server. This allows the converted application to meet computational resource cost or response time requirements to achieve optimal placement according to user requirements.

An offload server 1A (see FIG. 13) according to the second embodiment is an offload server that offloads specific processing of an application program to a PLD, and includes an application code analysis unit 112 that analyzes the source code of the application program; a PLD processing designation unit 213 that identifies loop statements of an application program, creates and compiles pipeline processing and parallel processing in PLD by a plurality of offload processing patterns designated by OpenCL for each of the identified loop statements; Based on the arithmetic intensity calculation unit 214 that calculates the arithmetic intensity of the loop statements of the application program and the arithmetic intensity calculated by the arithmetic intensity calculation unit 214, the loop statements whose arithmetic intensity is higher than a predetermined threshold are narrowed down as offload candidates, A PLD processing pattern creation unit 215 that creates a PLD processing pattern and an application program for the created PLD processing pattern are compiled, placed in an accelerator verification device, and performance measurement processing when offloaded to the PLD is executed. When arranging the performance measurement unit 118 and the converted application program to either a cloud server, a carrier edge server, or a user edge server on the network according to the cost or response time conditions specified by the user, the device and A placement setting unit 170 that calculates and sets the placement location of an application program based on a linear programming formula with link cost, computational resource upper limit, and bandwidth upper limit as constraint conditions, and with computational resource cost or response time as an objective function. And prepare.

By doing this, it is possible to reduce the number of performance measurements by narrowing down the patterns to be actually measured, placing them in the verification environment, compiling them, and measuring the performance on the actual PLD (for example, FPGA). This allows automatic offloading of application loop statements at high speed in automatic offloading to the PLD. Then, for the converted application, change the price conditions and response time conditions requested by the user, the number of applications to be deployed, etc., and deploy them to either the cloud server, carrier edge server, or user edge server on the network. Calculate proper placement. This allows the converted application to meet the computational resource cost or response time requirements to achieve optimal placement according to the user's wishes.

In the

offload servers

1 and 1A according to the first and second embodiments, the placement setting unit 170 minimizes the cost of computational resources or minimizes the response time when the application program is placed on the server. It is characterized by calculating the placement of

By doing so, the transformed application can be optimally deployed to meet the computational resource cost or response time requirements.

In the

offload servers

1 and 1A according to the first and second embodiments, the placement setting unit 170 performs a linear It is characterized by calculating according to a planning formula.

By doing so, for example, in the case of a request that needs to be placed within a month, the minimization of the response time in expression (1) of [Equation 1] becomes the objective function, and the expression of [Equation 1] One of the constraints is how much the cost of the computational resources of (2) is within. Thus, the transformed application can be optimally placed to meet the computational resource cost requirements.

In the

offload servers

1 and 1A according to the first and second embodiments, the placement setting unit 170 uses the linear programming formula shown in [Formula 2] for the placement that minimizes the response time when the application program is placed on the server. It is characterized by calculating according to

By doing so, for example, in the case of a request that needs to be placed within seconds of the application response time, the minimization of the cost in the expression (5) of [Equation 2] becomes the objective function, and [Equation 2 One of the constraints is how many seconds the response time of expression (6) of ] is within. Thus, the converted application can be optimally placed to meet the user's response time requirements.

The present invention is an offload program for causing a computer to function as the above offload server.

By doing so, each function of the

offload servers

1 and 1A can be realized using a general computer.

Further, among the processes described in each of the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed manually. can also be performed automatically by a known method. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.
Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions. Can be integrated and configured.

In addition, each of the above configurations, functions, processing units, processing means, etc. may be realized in hardware, for example, by designing a part or all of them with an integrated circuit. Further, each configuration, function, etc. described above may be realized by software for a processor to interpret and execute a program for realizing each function. Information such as programs, tables, files, etc. that realize each function is stored in memory, hard disk, SSD (Solid State Drive) and other recording devices, IC (Integrated Circuit) cards, SD (Secure Digital) cards, optical discs, etc. It can be held on a recording medium.

In addition, in the present embodiment, a genetic algorithm (GA) technique is used in order to find a solution to a combinatorial optimization problem within a limited optimization period. It can be something like For example, local search, dynamic programming, or a combination thereof may be used.

Also, in this embodiment, the OpenACC compiler for C/C++ is used, but any compiler can be used as long as it can offload GPU processing. For example, Java lambda (registered trademark) GPU processing, IBM Java 9 SDK (registered trademark) may be used. Note that the parallel processing specification statement depends on these development environments.
For example, in Java (registered trademark), parallel processing can be described in the lambda format since Java 8. IBM (registered trademark) provides a JIT compiler that offloads lambda-style parallel processing descriptions to the GPU. In Java, similar offloading is possible by using these to perform tuning in GA as to whether or not loop processing should be in the lambda format.

Also, in the present embodiment, the for statement is exemplified as the iterative statement (loop statement), but the while statement and the do-while statement other than the for statement are also included. However, the for statement, which specifies loop continuation conditions, etc., is more suitable.

1,

1A offload server

11, 21 control unit 12 input/output unit 13 storage unit 14 verification machine (accelerator verification device)
111 application code specification unit 112 application code analysis unit 113 data transfer specification unit 114 parallel

processing specification unit

114a, 213a offload

range extraction unit

114b, 213b intermediate language file output unit 115 resource ratio determination unit 116 resource amount setting unit 117 parallel processing pattern Creation unit 118 Performance measurement unit 118a Binary file placement unit 119 Execution file creation unit 120 Production environment placement unit 121 Performance measurement test extraction execution unit 122 User provision unit 125 Application code 131 Test case DB
132 equipment resource DB
133 Intermediate language file 151 Various devices 152 Device having CPU-GPU 153 Device having CPU-FPGA 154 Device having CPU 170 Placement setting unit 213 PLD processing designation unit 214 Arithmetic intensity calculation unit 215 PLD processing pattern creation unit 905 Accelerator

Claims

An offload server that offloads specific processing of an application program to an accelerator,
an application code analysis unit that analyzes the source code of the application program;
Analyzing the reference relationship of variables used in the loop statement of the application program, and using an explicit specification line that explicitly specifies data transfer outside the loop for data that may be transferred outside the loop a data transfer designation unit for designating data transfer;
a parallel processing specifying unit that specifies loop statements of the application program, specifies a parallel processing specifying statement in the accelerator for each of the specified loop statements, and compiles them;
A parallel processing pattern creation module that creates a parallel processing pattern that excludes loop statements that cause compilation errors from being offloaded, and specifies whether or not to execute parallel processing for loop statements that do not cause compilation errors. and,
a performance measurement unit that compiles the application program of the parallel processing pattern, places it in an accelerator verification device, and executes performance measurement processing when offloaded to the accelerator;
When the converted application program is placed on any of the cloud server, carrier edge server, and user edge server on the network according to the cost or response time conditions specified by the user, device and link costs, calculation a placement setting unit that calculates and sets a placement location of an application program based on a linear programming formula with resource upper limit and bandwidth upper limit as constraint conditions and with computational resource cost or response time as an objective function;
An offload server comprising:
An offload server that offloads specific processing of an application program to a PLD (Programmable Logic Device),
an application code analysis unit that analyzes the source code of the application program;
A PLD processing designation unit that identifies loop statements of the application program, and creates and compiles pipeline processing and parallel processing in the PLD by a plurality of offload processing patterns designated by OpenCL for each of the identified loop statements. and,
an arithmetic strength calculation unit that calculates the arithmetic strength of the loop statement of the application program;
a PLD processing pattern creation unit configured to create a PLD processing pattern by narrowing down, as offload candidates, loop statements whose arithmetic strength is higher than a predetermined threshold based on the arithmetic strength calculated by the arithmetic strength calculation unit;
a performance measurement unit that compiles the application program of the created PLD processing pattern, places it in an accelerator verification device, and executes performance measurement processing when offloaded to the PLD;
When the converted application program is placed on any of the cloud server, carrier edge server, and user edge server on the network according to the cost or response time conditions specified by the user, device and link costs, calculation a placement setting unit that calculates and sets a placement location of an application program based on a linear programming formula with resource upper limit and bandwidth upper limit as constraint conditions and with computational resource cost or response time as an objective function;
An offload server comprising:
3. The arrangement setting unit calculates the arrangement that minimizes the cost of computational resources or the arrangement that minimizes the response time when arranging the application programs on the server. offload server as described in .
The placement setting unit calculates the placement that minimizes the response time when the application program is placed on the server, according to the linear programming formula shown below.

3. The offload server according to claim 1 or 2, characterized by:
The placement setting unit calculates a placement that minimizes the cost of computational resources when the application program is placed on the server, according to the following linear programming formula:

3. The offload server according to claim 1 or 2, characterized by:
An offload control method for an offload server that offloads specific processing of an application program to an accelerator, comprising:
The offload server is
analyzing the source code of the application program;
Analyzing the reference relationship of variables used in the loop statement of the application program, and using an explicit specification line that explicitly specifies data transfer outside the loop for data that may be transferred outside the loop specifying data transfer;
a step of identifying loop statements of the application program, designating a parallel processing designation statement in the accelerator for each of the identified loop statements, and compiling the program;
a step of creating a parallel processing pattern for excluding loop statements that cause compilation errors from being offloaded and specifying whether or not to perform parallel processing for loop statements that do not cause compilation errors;
Compiling the application program of the parallel processing pattern, placing it in an accelerator verification device, and executing performance measurement processing when offloaded to the accelerator;
When the converted application program is placed on any of the cloud server, carrier edge server, and user edge server on the network according to the cost or response time conditions specified by the user, device and link costs, calculation calculating and setting the placement location of the application program based on a linear programming formula with the resource upper limit and the bandwidth upper limit as the constraint conditions, and with the computational resource cost or the response time as the objective function. off-road control method.
An offload control method for an offload server that offloads specific processing of an application program to a PLD (Programmable Logic Device),
The offload server is
analyzing the source code of the application program;
identifying loop statements of the application program, creating and compiling pipeline processing and parallel processing in the PLD for each of the identified loop statements by a plurality of offload processing patterns specified in OpenCL;
an arithmetic strength calculation unit that calculates the arithmetic strength of the loop statement of the application program;
creating a PLD processing pattern by narrowing down loop statements whose arithmetic strength is higher than a predetermined threshold as offload candidates based on the arithmetic strength calculated by the arithmetic strength calculation unit;
Compiling the application program of the created PLD processing pattern, placing it in an accelerator verification device, and executing performance measurement processing when offloaded to the PLD;
When the converted application program is placed on any of the cloud server, carrier edge server, and user edge server on the network according to the cost or response time conditions specified by the user, device and link costs, calculation calculating and setting the placement location of the application program based on a linear programming formula with the resource upper limit and the bandwidth upper limit as the constraint conditions, and with the computational resource cost or the response time as the objective function. off-road control method.
An offload program for causing a computer to function as the offload server according to any one of claims 1 to 5.