WO2022102071A1 - Serveur de délestage, procédé de commande de délestage et programme de délestage - Google Patents

Serveur de délestage, procédé de commande de délestage et programme de délestage Download PDF

Info

Publication number
WO2022102071A1
WO2022102071A1 PCT/JP2020/042342 JP2020042342W WO2022102071A1 WO 2022102071 A1 WO2022102071 A1 WO 2022102071A1 JP 2020042342 W JP2020042342 W JP 2020042342W WO 2022102071 A1 WO2022102071 A1 WO 2022102071A1
Authority
WO
WIPO (PCT)
Prior art keywords
offload
unit
processing
cpu
gpu
Prior art date
Application number
PCT/JP2020/042342
Other languages
English (en)
Japanese (ja)
Inventor
庸次 山登
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022561797A priority Critical patent/JPWO2022102071A1/ja
Priority to PCT/JP2020/042342 priority patent/WO2022102071A1/fr
Publication of WO2022102071A1 publication Critical patent/WO2022102071A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation

Definitions

  • the present invention relates to an offload server, an offload control method, and an offload program that automatically offloads functional processing to an accelerator such as an FPGA (Field Programmable Gate Array).
  • an accelerator such as an FPGA (Field Programmable Gate Array).
  • Non-Patent Document 1 the code once described is automatically converted, resource setting, etc. are performed so that the GPU, FPGA, manycore CPU, etc. existing in the environment of the placement destination can be used, and the application is operated with high performance. Environmentally adaptable software for this purpose is described.
  • Non-Patent Documents 2, 3 and 4 describe a method of automatically offloading a loop statement and a functional block of an application code to an FPGA or GPU as an element of environment-adaptive software. Further, Non-Patent Document 5 describes a technique for automatic performance test.
  • CUDA is widespread as an environment for performing GPGPU (General Purpose GPU) that uses the parallel computing power of GPU for non-image processing.
  • CUDA is an NVIDIA® environment for GPGPU.
  • OpenCL is a specification for handling heterogeneous devices such as FPGA, manycore CPU, GPU in the same way, and its development environment is being prepared.
  • CUDA and OpenCL are forms that extend the C language and program, and the difficulty of the program is high (explicitly describe the copy and release of memory data between the kernel such as FPGA and the host of the CPU, etc. ).
  • Yamato “Proposal of Automatic Offloading for Functions of Applications,” The 8th IIAE International Conference on Industrial Application Engineering 2020 (ICIAE 2020), pp.4-11, Mar. 2020.
  • Yamato "Automatic verification technology of software patches for user virtual environments on IaaS cloud,” Journal of Cloud Computing, Springer, 2015, 4: 4, DOI: 10.1206 / s13677-015-0028-6, Feb. 2015.
  • Non-Patent Documents 1 to 5 the amount of resources of the offload device after automatic offload is not examined.
  • the present invention has been made in view of such a point, and it is an object to be able to set an appropriate amount of resources of a CPU and an offload device in automatic offload.
  • an offload server that offloads the specific processing of the application program from the CPU to the offload device
  • the application code analysis unit that analyzes the source code of the application program
  • the loop of the application program Analyze the reference relationship of the variables used in the statement, and for data that may be transferred outside the loop, specify the data transfer using an explicitly specified line that explicitly specifies the data transfer outside the loop.
  • the data transfer specification unit, the parallel processing specification unit that specifies the loop statement of the application program and compiles by specifying the parallel processing specification statement in the offload device for each specified loop statement, and the compilation error.
  • the parallel processing pattern creation unit that creates a parallel processing pattern that specifies whether to perform parallel processing for loop statements that do not generate a compilation error while excluding the loop statements that appear, and the above-mentioned Based on the performance measurement unit that compiles the application program of the parallel processing pattern, places it in the accelerator verification device, and executes the performance measurement processing when it is offloaded to the offload device, and the performance measurement result.
  • the CPU and the offload so as to satisfy a predetermined cost condition based on the resource ratio determination unit that determines the ratio of the processing time of the CPU and the offload device as the resource ratio and the determined resource ratio.
  • the offload server is characterized by having a resource amount setting unit for setting the resource amount of the device.
  • the control unit (automatic offload function unit) of the offload server according to the second embodiment of the present invention performs ⁇ process A-2>, ⁇ process B-2>, and ⁇ process C- in the offload process of the functional block. It is a flowchart when 2> and is executed. It is a hardware block diagram which shows an example of the computer which realizes the function of the offload server which concerns on embodiment of this invention.
  • the present embodiment an offload server and the like in the embodiment for carrying out the present invention (hereinafter, referred to as "the present embodiment") will be described.
  • the present inventor has proposed a method of GPU automatic offload of a program loop statement, FPGA automatic offload, and automatic offload of a functional block of a program. (See Non-Patent Documents 2, 3 and 4). The basic idea of the present invention will be described with reference to the study of elemental techniques of Non-Patent Documents 2, 3 and 4.
  • a normal program can be automatically offloaded to an offload device such as a GPU or FPGA by a method such as Non-Patent Document 2.
  • an offload device such as a GPU or FPGA
  • GPUs have been virtualized in the same way as CPUs, and it is becoming possible to allocate percentages of all GPU cores.
  • resource usage is often represented by the number of Look Up Tables and Flip Flop settings, and unused gates can be used for other purposes.
  • the application can be converted into a code for CPU and GPU processing by using a method such as Non-Patent Document 2.
  • a method such as Non-Patent Document 2.
  • the code itself is appropriate, if the amount of resources of the CPU and the GPU is not an appropriate balance, the performance will not be obtained. For example, when performing a certain process, if the processing time of the CPU is 1000 seconds and the processing time of the GPU is 1 second, even if the processing that can be offloaded is speeded up to some extent by the GPU, the CPU becomes a bottleneck as a whole. There is.
  • Map tasks are allocated so that the execution time of the CPU and GPU is the same when the task is processed by the MapReduce (registered trademark) framework using the CPU and GPU. We are trying to improve the overall performance.
  • the present inventor came up with the idea of determining the resource ratio between the CPU and the offload device as follows. That is, in order to avoid a bottleneck in processing on any device, the processing time of the CPU and the offload device should be on the same order from the processing time of the test case with reference to the above non-patent documents. , Determines the resource ratio between the CPU and the offload device (hereinafter referred to as "resource ratio").
  • the present inventor adopts a method of gradually increasing the speed based on the performance measurement result in the verification environment at the time of automatic offload, as in the method of Non-Patent Document 2.
  • the reason is that the performance varies greatly not only with the code structure but also with the actual processing contents such as the specifications of the hardware to be actually processed, the data size, and the number of loops. Also, performance is difficult to predict statically and requires dynamic measurement. Therefore, since there is already a performance measurement result in the verification environment at the time of code conversion, the resource ratio is determined using the result.
  • test case For example, when the processing time of the test case in the verification environment is CPU processing: 10 seconds and GPU processing: 5 seconds, it is considered that the resources on the CPU side are doubled and the processing time is about the same. Therefore, the resource ratio is 2: 1.
  • the user's request to speed up a certain process offload is reflected by preparing a test case including the process and speeding up the test case by a method such as Non-Patent Document 2. To.
  • resource amount determination and automatic verification of the resource amount of the CPU and the offload device (hereinafter referred to as "resource amount")
  • the resource ratio is determined by the above ⁇ Adjustment of resource ratio between CPU and offload device>
  • the application is next placed in the commercial environment.
  • the amount of resources is determined while keeping (maintaining) the resource ratio as much as possible so as to meet the cost requirements specified by the user. For example, regarding the CPU, it is assumed that 1VM is 1000 yen / month, GPU is 4000 yen / month, and the resource ratio is 2: 1.
  • the user's budget is within 10,000 yen per month.
  • the resource ratio is 2: 1
  • the user's budget is less than 5,000 yen a month, an appropriate resource ratio of 2: 1 cannot be kept. In this case, the CPU secures "1" and the GPU secures "1" as the resource amount.
  • the calculation result is invalid even if it is offloaded.
  • the difference in calculation result from the case without offload is also checked.
  • the PGI compiler that processes the GPU is an API (Application Programming Interface) called PGI_compare (registered trademark) or acc_compare (registered trademark) that has a function called PCAST (registered trademark). You can check it. It should be noted that the calculation results may not completely match even if parallel processing or the like is correctly offloaded because the rounding error is different between the GPU and the CPU. Therefore, for example, confirmation is performed according to the IEEE 754 specifications, the user is presented with an acceptable difference, and the user is asked to confirm.
  • ⁇ Resources, resource ratios, test case processing time The resources, resource ratio, and test case processing time in this embodiment will be described.
  • -Resources CPUs, GPUs, FPGAs, etc. are being provided as instances of virtual resources.
  • resources the number of CPU cores, clock, memory amount, disk size, GPU core number, clock, memory amount, FPGA gate scale (LE (registered trademark) in the case of Intel (registered trademark), Xilinx (registered trademark)) In the case of, LC (registered trademark) is the unit). Businesses such as the cloud package them and provide them in the form of small size virtual machines and GPU instances. In the case of virtualization, it can be said that the number of instances used is the amount of resources used.
  • the ratio of the number of instances of CPU, GPU, and FPGA is the resource ratio. If the number of instances is one, two, or three, the resource ratio is 1: 2: 3.
  • test case processing time an offload pattern that speeds up a test case specified by a user is searched for and discovered.
  • the test case is the number of transaction processes such as TPC-C (registered trademark) in the case of DB (database), and the execution of Fourier transform processing in the sample data in the case of FFT.
  • the processing time is the execution time when the sample processing is executed. For example, the processing time of processing A is 10 seconds before offloading, but becomes 2 seconds after offloading. Are obtained respectively.
  • FIG. 1 is a functional block diagram showing a configuration example of the offload server 1 according to the first embodiment of the present invention.
  • the offload server 1 is a device that automatically offloads specific processing of an application to an accelerator. As shown in FIG. 1, the offload server 1 includes a control unit 11, an input / output unit 12, a storage unit 13, and a verification machine 14 (accelerator verification device). To.
  • the input / output unit 12 is an input / output unit for transmitting / receiving information between a communication interface for transmitting / receiving information with each device and the like, an input device such as a touch panel and a keyboard, and an output device such as a monitor. It consists of an output interface.
  • the storage unit 13 is composed of a hard disk, a flash memory, a RAM (RandomAccessMemory), etc., and includes a program (offload program) for executing each function of the control unit 11 and information necessary for processing of the control unit 11 (offload program). For example, an intermediate language file (Intermediate file) 133) is temporarily stored.
  • the storage unit 13 includes a test case DB (Test case database) 131, an equipment resource DB 132, and an intermediate language file (Intermediate file) 133.
  • Test case database Test case database
  • equipment resource DB equipment resource DB
  • intermediate language file Intermediate file
  • the test case DB 131 stores the data of the test items corresponding to the software to be verified.
  • the test item data is, for example, transaction test data such as TPC-C in the case of a database system such as MySQL.
  • the equipment resource DB 132 holds information prepared in advance such as resources and prices of servers and the like held by the business operator, and information on how much they are used. For example, there are 10 servers that can accommodate 3 GPU instances, 1 GPU instance is 5000 yen per month, and of the 10 servers, 3 servers A, B, and C are used. This information is used to determine the amount of resources to be secured when the user specifies operating conditions (conditions such as cost and performance).
  • the user operation conditions are the cost condition (for example, budget within 10,000 yen per month) and performance condition (for example, transaction throughput of TPC-C, etc.) specified by the user at the time of offload request, and one thread for sample Fourier transform processing. Within a few seconds, etc.).
  • the intermediate language file 133 temporarily stores information necessary for processing of the control unit 11 in the form of a programming language intervening between the high-level language and the machine language.
  • the verification machine 14 includes a CPU, GPU, and FPGA as a verification environment for environment-adaptive software.
  • the control unit 11 is an automatic offloading function unit that controls the entire offload server 1.
  • the control unit 11 is realized, for example, by a CPU (Central Processing Unit) (not shown) deploying and executing an application program (offload program) stored in the storage unit 13 in a RAM.
  • a CPU Central Processing Unit
  • application program offload program
  • the control unit 11 includes an application code specification unit (Specify application code) 111, an application code analysis unit (Analyze application code) 112, a data transfer specification unit 113, a parallel processing specification unit 114, a resource ratio determination unit 115, and the like.
  • Resource amount setting unit 116 parallel processing pattern creation unit 117, performance measurement unit 118, execution file creation unit 119, production environment placement unit (Deploy final binary files to production environment) 120, and performance measurement test extraction execution unit. It is provided with (Extract performance test cases and run automatically) 121 and a user providing unit (Provide price and performance to a user to judge) 122.
  • the application code specification unit 111 specifies the input application code. Specifically, the application code designation unit 111 passes the application code described in the received file to the application code analysis unit 112.
  • the application code analysis unit 112 analyzes the source code of the processing function and grasps the structure of the loop statement, the FFT library call, and the like.
  • the data transfer specification unit 113 analyzes the reference relationship of variables used in the loop statement of the application program, and explicitly specifies the data transfer outside the loop for the data that may be transferred outside the loop. Specify the data transfer using the specified line.
  • the parallel processing specification unit 114 specifies a loop statement (repeated statement) of the application program, specifies a parallel processing specification statement in the accelerator for each loop statement, and compiles the loop statement.
  • the parallel processing designation unit 114 includes an extract offloadable area 114a and an intermediate language file output unit 114b.
  • the offload range extraction unit 114a specifies a process that can be offloaded to the GPU / FPGA, such as a loop statement or FFT, and extracts an intermediate language corresponding to the offload process.
  • the intermediate language file output unit 114b outputs the extracted intermediate language file 133.
  • Intermediate language extraction is not a one-time process, it is repeated to try and optimize execution for proper offload area search.
  • the resource ratio determination unit 115 determines the processing time of the CPU and the offload device (test case CPU processing time and offload device processing time) as the resource ratio based on the performance measurement result (described later). Specifically, the resource ratio determination unit 115 determines the resource ratio so that the processing times of the CPU and the offload device are on the same order. Further, when the difference between the processing times of the CPU and the offload device is equal to or greater than a predetermined threshold value, the resource ratio determination unit 115 sets the resource ratio to a predetermined upper limit value.
  • the resource amount setting unit 116 sets the resource amount of the CPU and the offload device so as to satisfy a predetermined cost condition based on the determined resource ratio (described later). Specifically, the resource amount setting unit 116 maintains the determined resource ratio and sets the maximum resource amount that satisfies a predetermined cost condition. Further, when the resource amount setting unit 116 does not satisfy the predetermined cost condition by setting the minimum resource amount that maintains the determined resource ratio, the resource amount setting unit 116 breaks the resource ratio and satisfies the cost condition for the resource amount of the CPU and the offload device. Set with a smaller value (eg, minimum).
  • the parallel processing pattern creation unit 117 excludes the loop statement (repeated statement) in which a compile error occurs from the offload target, and specifies whether to perform parallel processing for the repeated statement in which a compile error does not occur. Create a parallel processing pattern to be performed.
  • the performance measurement unit 118 compiles the application program of the parallel processing pattern, arranges it on the verification machine 14, and executes the performance measurement processing when it is offloaded to the accelerator.
  • the performance measurement unit 118 includes a binary file arrangement unit (Deploy binary files) 118a.
  • the binary file placement unit 118a deploys (places) an executable file derived from an intermediate language on a verification machine 14 equipped with a GPU / FPGA.
  • the performance measurement unit 118 executes the placed binary file, measures the performance when offloaded, and returns the performance measurement result to the offload range extraction unit 114a.
  • the offload range extraction unit 114a performs another parallel processing pattern extraction, and the intermediate language file output unit 114b tries to measure the performance based on the extracted intermediate language (reference numeral a in FIG. 2 below). reference).
  • Executable file creation unit 119 selects a plurality of high-performance parallel processing patterns from a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, crosses the high-performance parallel processing patterns, and suddenly Create multiple different parallel processing patterns by mutation processing. Then, the executable file creation unit 119 newly performs the performance measurement, and after the performance measurement of the specified number of times, selects the parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement result. Compile the parallel processing pattern with the highest processing performance to create an executable file.
  • the production environment placement unit 120 places the created executable file in the production environment for the user (“placement of the final binary file in the production environment”).
  • the production environment placement unit 120 determines a pattern in which the final offload area is specified and deploys it in the production environment for users.
  • the performance measurement test extraction execution unit 121 After arranging the execution file, the performance measurement test extraction execution unit 121 extracts the performance test item from the test case DB 131 and executes the performance test (“arrangement of the final binary file in the production environment”). After arranging the execution file, the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and automatically executes the extracted performance test in order to show the performance to the user.
  • the user providing unit 122 presents information such as price and performance to the user based on the performance test result (“Providing information such as price and performance to the user”). Performance test items are stored in the test case DB 131.
  • the user providing unit 122 presents data such as price and performance to the user together with the performance test result based on the execution result of the performance test corresponding to the test item stored in the test case DB 131.
  • the user determines to start charging for the service based on the presented information such as price and performance.
  • non-patent literature Y. Yamato, M. Muroi, K. Tanaka and M.
  • the offload server 1 can use GA (Genetic Algorithms) for offload optimization.
  • GA Genetic Algorithms
  • the configuration of the offload server 1 when GA is used is as follows. That is, the parallel processing designation unit 114 sets the number of loop statements (repeated statements) in which no compilation error occurs as the gene length based on the genetic algorithm.
  • the parallel processing pattern creation unit 117 maps the possibility of accelerator processing to the gene pattern, with either 1 or 0 when the accelerator processing is performed and 0 or 1 of the other when the accelerator processing is not performed.
  • the parallel processing pattern creation unit 117 prepares a gene pattern of a specified number of individuals in which each value of a gene is randomly created to 1 or 0.
  • the performance measurement unit 118 compiles the application code in which the parallel processing specification statement in the accelerator is specified according to each individual, and arranges it on the verification machine 14.
  • the performance measurement unit 118 executes the performance measurement process on the verification machine 14.
  • the performance measuring unit 118 measures the performance without compiling the application code corresponding to the parallel processing pattern and measuring the performance. Use the same value as the value. Further, the performance measurement unit 118 sets the performance measurement value to a predetermined time (long time) as a timeout treatment for the application code in which a compile error occurs and the application code in which the performance measurement does not end in a predetermined time.
  • Executable file creation unit 119 measures the performance of all individuals and evaluates them so that the shorter the processing time, the higher the goodness of fit.
  • the executable file creation unit 119 selects an individual having a fitness higher than a predetermined value (for example, the upper n% of all the numbers or the upper m n, m of all the numbers is a natural number) from all the individuals as high-performance individuals. , The selected individual is crossed and mutated to create a next-generation individual.
  • the executable file creation unit 119 selects the highest-performance parallel processing pattern as a solution after the processing of the specified number of generations is completed.
  • FIG. 2 is a diagram showing an automatic offload process using the offload server 1.
  • the offload server 1 is applied to the elemental techniques of environment-adaptive software.
  • the offload server 1 has a control unit (automatic offload function unit) 11, a test case DB 131, an equipment resource DB 132, an intermediate language file 133, and a verification machine 14.
  • the offload server 1 acquires the application code 125 used by the user.
  • the user is, for example, a person who has contracted to use various devices (Device 151, a device 152 having a CPU-GPU, a device 153 having a CPU-FPGA, and a device 154 having a CPU).
  • the offload server 1 automatically offloads the functional processing to the accelerator of the device 152 having the CPU-GPU and the device 153 having the CPU-FPGA.
  • step S11 Specify application code>
  • the application code designation unit 111 passes the application code described in the received file to the application code analysis unit 112.
  • Step S12 Analyze application code>
  • the application code analysis unit 112 analyzes the source code of the processing function and grasps the structure of the loop statement, the FFT library call, and the like.
  • Step S13 Extract offloadable area>
  • the parallel processing specification unit 114 specifies a loop statement (repetition statement) of the application, specifies a parallel processing specification statement in the accelerator for each repetition statement, and compiles it.
  • the offload range extraction unit 114a identifies a process that can be offloaded to the GPU / FPGA, such as a loop statement or FFT, and extracts an intermediate language corresponding to the offload process.
  • Step S14 Output intermediate file>
  • the intermediate language file output unit 114b (see FIG. 1) outputs the intermediate language file 133.
  • Intermediate language extraction is not a one-time process, it is repeated to try and optimize execution for proper offload area search.
  • Step S15 Compile error>
  • the parallel processing pattern creation unit 117 excludes the loop statement in which a compile error occurs from the offload target and performs parallel processing in the repeated statement in which no compile error occurs. Create a parallel processing pattern that specifies whether or not to use it.
  • Step S21 Deploy binary files>
  • the binary file arrangement unit 118a (see FIG. 1) deploys an executable file derived from an intermediate language on the verification machine 14 equipped with the GPU / FPGA.
  • Step S22 Measure performances>
  • the performance measuring unit 118 executes the placed file and measures the performance when it is offloaded. In order to make the area to be offloaded more appropriate, this performance measurement result is returned to the offload range extraction unit 114a, and the offload range extraction unit 114a extracts another pattern. Then, the intermediate language file output unit 114b tries to measure the performance based on the extracted intermediate language (see reference numeral a in FIG. 2).
  • the control unit 11 repeatedly executes the steps S12 to S22.
  • the automatic offload function of the control unit 11 is summarized below. That is, the parallel processing specification unit 114 specifies the loop statement (repetition statement) of the application program, specifies the parallel processing specification statement in the GPU for each repetition statement, and compiles it. Then, the parallel processing pattern creation unit 117 creates a parallel processing pattern that excludes the loop statement that causes a compile error from the offload target and specifies whether or not to perform parallel processing for the loop statement that does not cause a compile error. do.
  • the binary file arrangement unit 118a compiles the application program of the corresponding parallel processing pattern and arranges it on the verification machine 14, and the performance measurement unit 118 executes the performance measurement processing on the verification machine 14.
  • the execution file creation unit 119 selects the pattern with the highest processing performance from a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, compiles the selection patterns, and creates an execution file.
  • Step S23 Resource amount setting according to user operation conditions>
  • the control unit 11 sets the resource amount according to the user operation conditions. That is, the resource ratio determination unit 115 of the control unit 11 determines the resource ratio between the CPU and the offload device. Then, the resource amount setting unit 116 refers to the information of the equipment resource DB 132 based on the determined resource ratio, and sets the resource amount of the CPU and the offload device so as to satisfy the user operation condition (according to FIG. 5). Later).
  • Step S24 Deploy final binary files to production environment>
  • the production environment placement unit 120 determines a pattern in which the final offload area is specified and deploys it in the production environment for users.
  • Step S25 Extract performance test cases and run automatically>
  • the performance measurement test extraction execution unit 121 extracts performance test items from the test case DB 131 and automatically executes the extracted performance tests in order to show the performance to the user after arranging the execution file.
  • Step S26 Provide price and performance to a user to judge>
  • the user providing unit 122 presents information such as price and performance to the user based on the performance test result.
  • the user determines to start charging for the service based on the presented information such as price and performance.
  • steps S11 to S26 are performed, for example, in the background of the user's service use, and are assumed to be performed, for example, during the first day of temporary use.
  • control unit (automatic offload function unit) 11 of the offload server 1 is the source code of the application program used by the user for offloading the function processing when applied to the elemental technology of the environment adaptation software.
  • the area to be offloaded is extracted from, and the intermediate language is output (steps S11 to S15).
  • the control unit 11 arranges and executes an executable file derived from the intermediate language on the verification machine 14 to verify the offload effect (steps S21 to S22). After repeating the verification and determining an appropriate offload area, the control unit 11 deploys the executable file in the production environment actually provided to the user and provides it as a service (steps S23 to S26).
  • the GPU automatic offload is a process for repeating steps S12 to S22 of FIG. 2 for the GPU and finally obtaining an offload code to be deployed in step S23.
  • GPU is a device that generally does not guarantee latency, but is suitable for increasing throughput by parallel processing.
  • Typical examples are encryption processing, image processing for camera image analysis, machine learning processing for mass sensor data analysis, and the like, and many of them are repetitive processing. Therefore, we aim to speed up by automatically offloading the repeated statements of the application to the GPU.
  • an appropriate offload area is automatically extracted from a general-purpose program that is not supposed to be parallelized. Therefore, it is possible to first check the parallelizable for statement and then repeat the performance verification trial in the verification environment using GA for the parallelizable for statement group to search for an appropriate area.
  • parallel processing for statements holding and recombining parallel processing patterns that can be accelerated in the form of gene parts, patterns that can be efficiently accelerated from the huge number of parallel processing patterns that can be taken. Can be searched.
  • FIG. 3 is a diagram showing a search image of processing of the control unit (automatic offload function unit) 11 by Simple GA and gene sequence mapping of a for statement.
  • GA is one of the combinatorial optimization methods that imitates the evolutionary process of living organisms.
  • the flow chart of GA is initialization->evaluation->selection->crossover->mutation-> end determination.
  • Simple GA which is a simplified process, is used in GA.
  • Simple GA is a simplified GA in which genes are only 1 and 0, roulette selection, one-point crossover, and mutation reverse the value of one gene.
  • ⁇ Initialization> After checking whether all for statements in the application code can be parallelized, the parallelizable for statements are mapped to the gene sequence. 1 for GPU processing, 0 for no GPU processing. For the gene, a designated number of individuals M is prepared, and 1 or 0 is randomly assigned to one for statement. Specifically, the control unit (automatic offload function unit) 11 (see FIG. 1) acquires the application code 125 (see FIG. 2) used by the user, and as shown in FIG. 3, the application. Check whether the for statement can be parallelized from the code patterns 141 of the code 125. As shown in FIG. 3, when three for statements are found from the code pattern 141 (see the reference numeral b in FIG.
  • one digit is used for each for statement, and here, one digit of three digits is found for each of the three for statements. Assign 0. For example, it is set to 0 when it is processed by the CPU and 1 when it is sent to the GPU. However, at this stage, 1 or 0 is randomly assigned.
  • the code corresponding to the gene length is 3 digits, and the code of the 3 digit gene length is 23 patterns, for example, 100, 101, .... In FIG. 3, the circles ( ⁇ ) in the code pattern 141 are shown as an image of the code.
  • ⁇ Selection> high performance code patterns are selected based on the goodness of fit (see reference numeral d in FIG. 3).
  • the performance measuring unit 118 selects a gene having a high fitness to a specified number of individuals based on the fitness. In this embodiment, roulette selection and elite selection of the highest fitness gene are performed according to the fitness. In FIG. 3, it is shown as a search image that the number of circles (circle) in the selected code patterns 142 is reduced to three.
  • ⁇ Mutation> Introduce mutations to avoid local solutions.
  • a mode in which mutation is not performed may be used.
  • mutation each value of an individual's gene is changed from 0 to 1 or 1 to 0 at a constant mutation rate Pm.
  • OpenACC which is a specification that processes GPU for C / C ++ code
  • compiler that enables GPU offload by specifying bytecode for GPU by specifying it with the directive #pragmaacckernels.
  • GPU processing may be specified by CUDA or Java Lambda expression.
  • a code pattern having a gene length corresponding to the number of for sentences has been obtained.
  • parallel processing patterns 100, 010, 001, ... are randomly assigned.
  • GA processing is performed and the compilation is performed.
  • an error may occur even though the for statement can be offloaded.
  • the for statement is hierarchical (GPU processing can be performed if either is specified).
  • the for statement that caused the error may be left.
  • the verification machine 14 It is deployed on the verification machine 14, and benchmarked, for example, if it is image processing, it is benchmarked by the image processing.
  • the reciprocal of the processing time the processing time of 10 seconds is 1, 100 seconds, 0.1, and 1 second is 10.
  • Select the one with high fitness for example, select 3 to 5 out of 10 and rearrange them to create a new code pattern. In the middle of creation, you may be able to do the same thing as before. In this case, we don't need to do the same benchmark, so we use the same data as before.
  • the code pattern and its processing time are stored in the storage unit 13.
  • step S100 the control unit 11 (see FIG. 1) receives the application code and operating conditions.
  • the operation condition is, for example, a user operation condition, and is acquired by the control unit 11 (see FIG. 1) by referring to the information of the equipment resource DB 132 (see FIG. 1).
  • step S101 the application code analysis unit 112 (see FIG. 1) analyzes the code of the application program.
  • step S102 the parallel processing designation unit 114 (see FIG. 1) specifies a loop statement and a reference relationship of the application program.
  • step S103 the parallel processing designation unit 114 operates the benchmark tool, grasps the number of loop statement loops, and distributes the threshold value.
  • step S104 the parallel processing designation unit 114 checks the parallel processing possibility of each loop statement.
  • the control unit (automatic offload function unit) 11 repeats the processing of steps S106 to S107 for the number of loop statements between the loop start end of step S105 and the loop end of step S108.
  • the parallel processing specification unit 114 compiles or interprets each loop statement by designating the GPU processing by a method according to the language.
  • the parallel processing designation unit 114 deletes the GPU processing designation from the corresponding for statement when an error occurs.
  • the parallel processing designation unit 114 counts the number of for statements that do not generate a compile error and sets the gene length.
  • the parallel processing designation unit 114 prepares a gene sequence of a designated number of individuals. Here, 0 and 1 are randomly assigned and created.
  • the parallel processing designation unit 114 maps the code of the application program to the gene.
  • a designated population pattern is prepared by mapping a gene sequence in which 0 and 1 are randomly assigned to a gene. Depending on the prepared gene sequence, if the gene value is 1, insert a directive that specifies parallel processing into the code of the application program (see, for example, the #pragma directive in FIG. 3).
  • the control unit (automatic offload function unit) 11 repeats the processing of steps S112 to S119 for a specified number of generations between the loop start end of step S111 and the loop end of step S120. Further, in the repetition of the designated number of generations, the designated number of individuals is further repeated for the processing of steps S113-S116 between the loop start end of step S112 and the loop end of step S117. That is, in the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.
  • step S113 the data transfer designation unit 113 designates data transfer by a method according to the language from the variable reference relationship.
  • step S114 the parallel processing pattern creation unit 117 (see FIG. 1) compiles or interprets the code specified by the directive according to the gene pattern on the GPU processing platform. That is, the parallel processing pattern creation unit 117 compiles or interprets the code of the created application program with the PGI compiler on the verification machine 14 equipped with the GPU.
  • a compile error may occur when multiple nested for statements are specified in parallel. In this case, it is treated in the same way as when the processing time at the time of performance measurement has timed out.
  • step S115 the performance measurement unit 118 (see FIG. 1) deploys the executable file on the verification machine 14 equipped with the CPU-GPU.
  • step S116 the performance measuring unit 118 executes the arranged binary file and measures the benchmark performance when offloaded.
  • the same value is used instead of measuring the gene with the same pattern as before.
  • the same measured value as before is used without compiling or measuring the performance of the individual.
  • the executable file creation unit 119 evaluates the individual having a shorter processing time so that the degree of conformity is higher, and selects an individual having higher performance.
  • step S119 the executable file creation unit 119 performs crossover and mutation processing on the selected individual to create a next-generation individual.
  • the selected individuals are subjected to GA treatment of crossover treatment, mutation treatment, and copy treatment as they are to create a next-generation population.
  • step S121 the executable file creation unit 119 solves the C / C ++ code (highest-performance parallel processing pattern) corresponding to the highest-performance gene sequence after the GA processing for the specified number of generations is completed.
  • the above-mentioned number of individuals, number of generations, crossover rate, mutation rate, fitness setting, and selection method are parameters of GA.
  • the GA parameter may be set as follows, for example.
  • the parameters and conditions of Simple GA to be executed can be as follows, for example. Gene length: Number of loop sentences that can be paralleled Number of individuals M: Number of generations below gene length T: Below gene length Compatibility: (Processing time) -1/2 With this setting, the shorter the benchmark processing time, the higher the goodness of fit. Further, by setting the goodness of fit to (processing time) -1 / 2 , it is possible to prevent the goodness of fit of a specific individual having a short processing time from becoming too high and narrowing the search range.
  • time-out is performed and the goodness of fit is calculated assuming that the processing time is a time (long time) such as 1000 seconds. This time-out time may be changed according to the performance measurement characteristics. Selection: Roulette selection However, the highest fitness gene in the generation is not crossed or mutated, and elite preservation is also performed to preserve it in the next generation. Crossover rate Pc: 0.9 Mutation rate Pm: 0.05
  • gcov, gprof, etc. are used to specify in advance an application that has many loops and takes a long time to execute, and an offload trial is performed. This makes it possible to find applications that can be speeded up efficiently.
  • FIG. 5 is a flowchart for setting the resource ratio and the amount of resources added after the GPU or FPGA offload trial.
  • the flowchart shown in FIG. 5 is executed after the GPU offload trial shown in FIGS. 4A-B, and is executed after the GPU and FPGA offload trials shown in FIGS. 7 and 8 described later.
  • step S51 the resource ratio determination unit 115 acquires the user operation condition, the test case CPU processing time, and the offload device processing time.
  • User operation conditions are specified by the user when the user specifies the code that he / she wants to offload.
  • the user operation condition is used when the resource amount setting unit 116 refers to the information of the equipment resource DB 132 and determines the resource amount.
  • the resource ratio determination unit 115 determines the ratio of the processing time of the CPU and the offload device (test case CPU processing time and offload device processing time) as the resource ratio based on the performance measurement result.
  • the resource ratio determination unit 115 determines the resource ratio so that the processing times of the CPU and the offload device are on the same order. By determining the resource ratio so that the processing time of the CPU and offload device is on the same order, the processing time of the CPU and offload device can be aligned, and the CPU and accelerator can be used in a mixed environment such as GPU, FPGA, and manycore CPU. Even if there is, the amount of resources can be set appropriately.
  • the resource ratio determination unit 115 sets the resource ratio to a predetermined upper limit value when the difference between the processing times of the CPU and the offload device is equal to or greater than a predetermined threshold value. That is, if the processing time of the CPU and the offload device in the verification environment has a difference of, for example, 10 times or more, and the resource ratio is increased by 10 times or more, the cost performance deteriorates.
  • the upper limit is a resource ratio such as 5: 1 (the upper limit is the resource ratio of 5: 1 of the processing time).
  • step S53 the resource amount setting unit 116 sets the resource amount based on the user operating conditions and the appropriate resource ratio. That is, the resource amount setting unit 116 determines the resource amount by keeping the resource ratio as much as possible so as to satisfy the cost condition specified by the user.
  • the resource amount setting unit 116 maintains an appropriate resource ratio and sets the maximum resource amount that satisfies the user operation conditions. As a specific example, it is assumed that 1000 yen / month for CPU1VM, 4000 yen / month for GPU, and a resource ratio of 2: 1 are appropriate, and the user has a budget of 10,000 yen or less per month. In this case, 2 for the CPU and 1 for the GPU are secured and arranged in the commercial environment.
  • the resource amount setting unit 116 breaks the resource ratio and sets the resource amount of the CPU and the offload device so as to satisfy the cost condition.
  • the resource ratio cannot be kept because the user budget is insufficient, but the resource amounts of the CPU and the offload device are set smaller, that is, 1 is secured for the CPU and 1 is secured for the GPU.
  • step S53 After the process of step S53 is completed and the resources are secured and placed in the commercial environment, the automatic verification described in FIG. 2 is executed in order to confirm the performance and cost before the user uses them. This makes it possible to secure resources in a commercial environment and present performance and cost to the user after automatic verification.
  • the performance measurement results when determining the solution of the offload pattern are used.
  • the implementation determines the resource ratio from the processing time of the test case so that the processing time of the CPU and the GPU is on the same order. For example, when the processing time of the test case is CPU processing: 10 seconds and GPU processing: 5 seconds, the resources on the CPU side are doubled and the processing time is considered to be equivalent, so the resource ratio is 2: 1. .. Since the number of virtual machines and the like is an integer, the resource ratio is rounded to an integer ratio when calculating from the processing time.
  • the next step is to set the amount of resources when deploying the application in the commercial environment.
  • the resource ratio is kept as much as possible and the number of VMs and the like is determined so as to satisfy the cost request specified by the user at the time of the offload request. Specifically, the maximum number of VMs and the like is selected while keeping the resource ratio within the cost range.
  • resource ratio is 2: 1 and the user has a budget of 10,000 yen or less per month, the CPU is 2 and the GPU is 2.
  • Secure 1 If the resource ratio cannot be kept within the cost range, the resource amount is set so as to be as close to an appropriate resource ratio as possible by starting from one CPU unit and one GPU unit. For example, if the budget is less than 5,000 yen a month, the resource ratio cannot be kept, but the CPU secures 1 and the GPU secures 1.
  • the virtualization function of Xen Server is used to allocate the resources of the CPU and GPU.
  • the present invention is characterized in setting the resource ratio and the amount of resources in automatic offload.
  • the second embodiment describes the setting of the resource ratio and the resource amount in the functional block offload.
  • the functional block offload will be described.
  • the overall configuration and operation of the functional block offload will be described with reference to FIGS. 6 to 8.
  • FIG. 6 is a functional block diagram showing a configuration example of the offload server 200 according to the second embodiment of the present invention.
  • the same components as those in FIG. 1 are designated by the same reference numerals, and the description of overlapping portions will be omitted.
  • the offload server 200 is a device that automatically offloads the specific processing of the application program to the accelerator.
  • the offload server 200 includes a control unit 210, an input / output unit 12, a storage unit 130, and a verification machine 14 (accelerator verification device).
  • the input / output unit 12 is an input / output unit for transmitting / receiving information between a communication interface for transmitting / receiving information with each device and the like, an input device such as a touch panel and a keyboard, and an output device such as a monitor. It consists of an output interface.
  • the storage unit 130 is composed of a hard disk, a flash memory, a RAM, or the like, and includes a program (offload program) for executing each function of the control unit 210 and information necessary for processing of the control unit 210 (for example, an intermediate language file). 133) is temporarily stored.
  • the storage unit 13 includes a code pattern DB (Code pattern database) 230 (described later), a test case DB 131, an equipment resource DB 132, and an intermediate language file 133.
  • code pattern DB Code pattern database
  • Performance test items are stored in the test case DB 131.
  • the test case DB 131 stores information for performing a test such as measuring the performance of a high-speed application. For example, in the case of a deep learning application for image analysis processing, it is a sample image and a test item for executing the sample image.
  • the verification machine 14 includes a CPU (Central Processing Unit), a GPU, and an FPGA as a verification environment for environment-adaptive software.
  • a CPU Central Processing Unit
  • GPU GPU
  • FPGA field-programmable gate array
  • the code pattern DB 230 stores a library and IP core (described later) that can be offloaded to a GPU, FPGA, or the like. That is, the code pattern DB 230 provides a specific library, a GPU library (GPU library) for speeding up a functional block, an FPGA IP core (IP core), and related information for the purpose of ⁇ Process B-1> described later. Hold.
  • the code pattern DB 230 holds a library list (external library list) for arithmetic calculations such as FFT.
  • the code pattern DB 230 stores, for example, a CUDA library and a library usage procedure for using the CUDA library as a GPU library. That is, in the following ⁇ Process C-1>, when the library or IP core to be replaced is mounted on the GPU or FPGA and connected to the host side (CPU) program, the library usage procedure is also registered and used according to the procedure. do. For example, in the CUDA library, since the procedure for using the CUDA library from the C language code is published together with the library, the procedure for using the library is also registered in the code pattern DB 230.
  • the code pattern DB 230 stores a class or structure of processing having the same description when calculated by the host. That is, in the following ⁇ Process B-2>, in order to detect the functional process other than the library call that is not registered, the class, the structure, etc. are detected from the definition description of the source code by the parsing.
  • the code pattern DB 230 registers a class or structure of a process having the same description when calculated by the host for the purpose of ⁇ Process B-2> described later. It should be noted that the similarity detection tool (described later) detects that there is a library or IP core that speeds up the functional processing of the class or structure.
  • the code pattern DB 230 stores the OpenCL code as information related to the IP core.
  • the connection between the CPU and FPGA using the OpenCL interface and the implementation of the IP core in the FPGA can be performed from the OpenCL code as a high-level synthesis tool of FPGA vendors such as Xilinx and Intel. It can be done via (see below).
  • the control unit 210 is an automatic offloading function unit that controls the entire offload server 200, and a CPU (not shown) expands the program (offload program) stored in the storage unit 130 to the RAM. It is realized by executing it.
  • control unit 210 detects a functional block that can speed up processing by offloading to the FPGA or GPU in the existing program code for the CPU, and the detected functional block is used as a library for GPU, an IP core for FPGA, or the like. Performs offload processing of functional blocks that speed up by replacing with.
  • the control unit 210 includes an application code specification unit (Specify application code) 111, an application code analysis unit (Analyze application code) 112, a replacement function detection unit 213, a replacement processing unit 214, a resource ratio determination unit 115, and a resource.
  • Amount setting unit 116, offload pattern creation unit 215, performance measurement unit 118, executable file creation unit 119, production environment placement unit (Deploy final binary files to production environment) 120, and performance measurement test extraction execution unit ( Extract performance test cases and run automatically) 121 and a user provision unit (Provide price and performance to a user to judge) 122 are provided.
  • the application code specification unit 111 specifies the input application code. Specifically, the application code designation unit 111 passes the application code described in the received file to the application code analysis unit 112.
  • the application code analysis unit 112 analyzes the source code of the application program and detects the call of the external library included in the source code. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the library call included in the code and the source code for analyzing the functional processing, together with the loop statement structure and the like.
  • a syntax analysis tool such as Clang
  • the application code analysis unit 112 detects the code of the class or the structure from the source code in the following ⁇ Process A-2>.
  • the replacement function detection unit 213 acquires the GPU library and the IP core from the code pattern DB 230 using the detected call as a key in the later ⁇ Process B-1>. Specifically, the replacement function detection unit 213 extracts an offloadable process that can be offloaded to the GPU or FPGA by collating the detected library call with the code pattern DB 230 using the library name as a key.
  • the code pattern DB 230 stores, for example, a CUDA library and a library usage procedure for using the CUDA library as the GPU library. Then, the replacement function detection unit 213 acquires the CUDA library from the code pattern DB 230 based on the library usage procedure.
  • the replacement function detection unit 213 acquires the GPU library and IP core from the code pattern DB 230 using the definition description code of the detected class or structure (described later) as a key in the postscript ⁇ Process B-2>. Specifically, the replacement function detection unit 213 uses a similarity detection tool that detects the copy code and the definition description code changed after copying, and the code pattern DB 230 is applied to the class or structure included in the replacement source code. Extract GPU libraries and IP cores that can be offloaded to GPUs, FPGAs that are managed in association with similar classes or structures.
  • the replacement processing unit 214 replaces the processing description of the replacement source of the source code of the application program with the processing description of the replacement destination library and IP core acquired by the replacement function detection unit 213. Specifically, the replacement processing unit 214 replaces the extracted offloadable processing with a library for GPU, an IP core for FPGA, or the like. Further, the replacement processing unit 214 offloads the processing description of the replaced library and IP core to the GPU, FPGA, or the like as a functional block to be offloaded. Specifically, the replacement processing unit 214 offloads a functional block replaced with a library for GPU, an IP core for FPGA, or the like by creating an interface with a CPU program. The replacement processing unit 214 outputs an intermediate language file 133 such as CUDA and OpenCL.
  • an intermediate language file 133 such as CUDA and OpenCL.
  • the replacement processing unit 214 replaces the processing description of the replacement source of the source code of the application program with the processing description of the acquired library and the IP core, and at the same time, argument at the replacement source and the replacement destination. If the number or type of return values is different, notify the confirmation.
  • the offload pattern creation unit 215 creates one or more offload patterns. Specifically, by creating an interface with the host program and trying not to offload through performance measurement in the verification environment, an offload pattern that becomes faster is extracted.
  • the code pattern DB 230 stores the OpenCL code as information related to the IP core.
  • the offload pattern creation unit 215 connects the host and the PLD using the OpenCL interface based on the OpenCL code, and the IP core to the PLD based on the OpenCL code. Is implemented.
  • a kernel created according to the OpenCL C language syntax is executed on a device (for example, FPGA) by a program on the host (for example, CPU) side to be created by using the OpenCL C language runtime API.
  • the part that calls the kernel function hello () from the host side is to call clEnqueueTask (), which is one of the OpenCL runtime APIs.
  • the basic flow of initialization, execution, and termination of OpenCL described by the host code is the following steps 1 to 13. Of steps 1 to 13, steps 1 to 10 are procedures (preparations) until the kernel function hello () is called from the host side, and step 11 is the execution of the kernel.
  • Command queue creation Use the function clCreateCommandQueue () that provides the command queue creation function defined in the OpenCL runtime API to create a command queue that is ready to control the device.
  • the host issues an action to the device (issue a kernel execution command or a memory copy command between the host and the device) through the command queue.
  • Creating a memory object Create a memory object that allows the host side to refer to the memory object by using the function clCreateBuffer () that provides the function of allocating memory on the device defined by the OpenCL runtime API.
  • Kernel file reading The kernel executed on the device controls its execution itself by a program on the host side. Therefore, the host program must first load the kernel program. Kernel programs include binary data created by the OpenCL compiler and source code written in the OpenCL C language. Read this kernel file (description omitted). The OpenCL runtime API is not used when reading the kernel file.
  • OpenCL recognizes a kernel program as a program object. This procedure is the creation of a program object.
  • This procedure is the creation of a program object.
  • Kernel object creation Create a kernel object using the function clCreateKernel () that provides the kernel object creation function defined in the OpenCL runtime API. Since one kernel object corresponds to one kernel function, the name of the kernel function (hello) is specified when the kernel object is created. Further, when a plurality of kernel functions are described as one program object, one kernel object has a one-to-one correspondence with one kernel function, so clCreateKernel () is called multiple times.
  • Kernel argument setting Set the kernel argument using the function clSetKernel () that provides the function of giving an argument to the kernel defined by the OpenCL runtime API (passing the value to the argument of the kernel function).
  • clSetKernel that provides the function of giving an argument to the kernel defined by the OpenCL runtime API (passing the value to the argument of the kernel function).
  • Kernel execution Kernel execution (putting it in the command queue) works on the device, so it is a queuing function to the command queue.
  • the function clEnqueueTask () that provides the kernel execution function defined in the OpenCL runtime API is used to queue the command to execute the kernel hello on the device. After the command to execute the kernel hello has been queued, it will be executed in an executable arithmetic unit on the device.
  • Reading from a memory object The memory area on the device side to the memory area on the host side is used by using the function clEnqueueReadBuffer () that provides the function to copy data from the memory on the device side to the memory on the host side as defined in the OpenCL runtime API. Copy the data to.
  • the function clEnqueueWrightBuffer () that provides the function of copying data from the host side to the memory on the device side is used to copy the data from the memory area on the host side to the memory area on the device side. Since these functions work on the device, the copy command is queued to the command queue once, and then the data copy starts.
  • the performance measurement unit 118 compiles the application program of the created processing pattern, arranges it on the verification machine 14, and executes the performance measurement processing when it is offloaded to the GPU, FPGA, or the like.
  • the performance measurement unit 118 includes a binary file arrangement unit (Deploy binary files) 118a.
  • the binary file placement unit 118a deploys (places) a binary file derived from an intermediate language on a verification machine 14 equipped with a GPU or FPGA.
  • the performance measurement unit 118 executes the placed binary file, measures the performance when offloaded, and returns the performance measurement result to the binary file placement unit 118a. In this case, the performance measurement unit 118 tries to measure the performance based on the extracted intermediate language by using another extracted processing pattern.
  • the offload pattern creation unit 215 creates a processing pattern that offloads functional blocks that can be offloaded to the GPU or FPGA, and the executable file creation unit 119 compiles the intermediate language of the created processing pattern.
  • the performance measurement unit 118 measures the performance of the compiled program (“first performance measurement”).
  • the offload pattern creation unit 215 lists the processing patterns whose performance is higher than that of the CPU in the performance measurement.
  • the offload pattern creation unit 215 creates a new processing pattern for offloading by combining the processing patterns of the list.
  • the offload pattern creation unit 215 creates the combined offload processing pattern and the intermediate language, and the executable file creation unit 119 compiles the intermediate language.
  • the performance measurement unit 118 measures the performance of the compiled program (“second performance measurement”).
  • Executable file creation unit 119 compiles the intermediate language of the processing pattern to be offloaded and creates an executable file. Based on the performance measurement results repeated for a certain number of times, the processing pattern with the highest processing performance is selected from one or more processing patterns, the processing pattern with the highest processing performance is compiled, and the final executable file is created.
  • IP core Intelligent Property Core
  • the IP core is partial circuit information for constituting a semiconductor such as FPGA, IC, and LSI, and is particularly organized in functional units.
  • Typical functional examples of the IP core are encryption / decryption processing, arithmetic operations such as FFT (Fast Fourier Transform), image processing, and voice processing.
  • FFT Fast Fourier Transform
  • voice processing Many IP cores pay a license fee, but some are provided free of charge.
  • the IP core is used for automatic offload.
  • GPU although it is not called an IP core, FFT, linear algebra, etc. are typical functional examples, and cuFFT, cuBLAS, etc. implemented using CUDA are provided free of charge as a library for GPU. There is. In the second embodiment, these libraries are utilized for the GPU.
  • the existing program code created for the CPU includes a functional block such as FFT processing that can be accelerated by offloading to the GPU or FPGA
  • a library for the GPU or a library for the GPU is used. Speed up by replacing with IP core for FPGA.
  • process A detection of functional blocks
  • process B detection of whether the functional blocks have existing libraries / IP cores for offload
  • process C three elements of interface matching with the host side
  • Process A detection of functional block
  • Process A detection of a functional block
  • ⁇ Process A-2> which detects a structure and makes it a functional block. That is, ⁇ Process A-1> detects a function call of an existing library and makes it a functional block, and ⁇ Process A-2> does not detect a functional block in ⁇ Process A-1>.
  • the application code analysis unit 112 detects that the function call of the external library is performed from the source code by using the syntax analysis.
  • the details are as follows.
  • the code pattern DB 230 holds a library list for arithmetic calculations such as FFT.
  • the application code analysis unit 112 parses the source code, collates it with the library list held by the code pattern DB 230, and detects that a function call of an external library is being performed.
  • the application code analysis unit 112 detects the functional processing of the class or structure from the definition description of the source code by using the syntactic analysis in order to detect the functional processing other than the unregistered library call as the functional block.
  • the application code analysis unit 112 is, for example, a structure in which some variables defined by using a C language struct are grouped together, or a structure in which the type of an instantiated object is a value type. Detects a class that is a reference type for a body. Further, the application code analysis unit 112 detects a class that is used as an alternative to the structure in Java (registered trademark), for example.
  • Process B receives ⁇ Process A-1>, refers to the code pattern DB 230, acquires a replaceable GPU library, and IP core ⁇ Process B-1>, and ⁇ Process B-1>.
  • A-2> the processing description of the replacement source of the application code is divided into the GPU library of the replacement destination and ⁇ processing B-2> of replacing the IP core processing description. That is, ⁇ Process B-1> acquires the replaceable GPU library and IP core from the code pattern DB 230 using the library name as a key.
  • ⁇ Process B-2> detects the replaceable GPU library / IP core using the code of the class, structure, etc. as a key, and describes the process description of the replacement source of the application code in the GPU library / IP core process description of the replacement destination. It replaces with.
  • the code pattern DB 230 holds a specific library, a GPU library for speeding up a functional block, an FPGA IP core, and related information. Further, in the code pattern DB 230, the code and the execution file are registered together with the function name for the library and the function block of the replacement source.
  • the replacement function detection unit 213 searches for the code pattern DB 230 using the library name as a key for the library call detected by the application code analysis unit 112 in ⁇ Process A-1>, and replaces the code pattern DB 230 with the replaceable GPU library. (Library for GPU that can speed up) and IP core for FPGA are acquired.
  • the replacement function detection unit 213 detects the OpenCL code as an FPGA process for processing the 2D FFT using the external library name as a key (host program, Kernel program) etc.).
  • the OpenCL code is stored in the code pattern DB 230.
  • the GPU library is stored in the code pattern DB 230.
  • the replacement function detection unit 213 searches for the code pattern DB 230 using the code of the class, structure, etc. detected by the application code analysis unit 112 in ⁇ Process A-2> as a key, and uses the code pattern DB 230 to obtain a similarity detection tool.
  • a similarity detection tool Use to acquire a replaceable GPU library (a library for GPU that can speed up) and an IP core for FPGA.
  • the similarity detection tool is a tool such as Deckard that targets the detection of copy code and code changed after copying.
  • the replacement function detection unit 213 performs processing such as matrix calculation code, which has the same description when calculated by the CPU, and processing in which another person's code is copied and changed. The part can be detected. It should be noted that the similarity detection tool is out of scope because it is difficult to detect a newly independently created class or the like.
  • the replacement function detection unit 213 searches for a similar class or structure registered in the code pattern DB 230 by using a similarity detection tool such as Deckard for the class or structure detected by the replacement source CPU code.
  • a similarity detection tool such as Deckard for the class or structure detected by the replacement source CPU code.
  • the processing of the substitution source is a 2D FFT class
  • the class registered in the code pattern DB 230 as a similar class is detected as a 2D FFT class.
  • An IP core or GPU library capable of offloading a 2D FFT is registered in the code pattern DB 230. Therefore, as in ⁇ Process B-1>, the OpenCL code (host program, kernel program, etc.) and GPU library are detected for the 2D FFT.
  • [Process C] (Alignment of interface with host side)
  • [Process C] (matching of the interface with the host side) has ⁇ process C-1> and ⁇ process C-2>.
  • ⁇ Process C-1> receives ⁇ Process B-1>, replaces the process description of the replacement source of the application code with the GPU library and IP core process description of the replacement destination, and for calling the GPU library and IP core.
  • ⁇ Process C-2> receives ⁇ Process B-2> and replaces the process description of the replacement source of the application code with the GPU library and IP core process description of the replacement destination, and also for calling the GPU library and IP core.
  • the description of the interface process for calling the GPU library and the IP core corresponds to "matching the interface with the host side".
  • the replacement processing unit 214 replaces the processing description of the replacement source of the application code with the GPU library and IP core processing description of the replacement destination. Then, the replacement processing unit 214 describes the GPU library and the interface processing for calling the IP core (OpenCL API, etc.), and compiles the created pattern.
  • IP core OpenCL API, etc.
  • the replacement function detection unit 213 searches for the corresponding library or IP core in ⁇ process B-1> in response to the library call detected in ⁇ process A-1>. Therefore, the replacement processing unit 214 mounts the library or IP core to be replaced on the GPU or FPGA, and performs interface processing for connecting to the host side (CPU) program.
  • the replacement processing unit 214 describes the processing description of the replacement source of the application code in the GPU of the replacement destination according to the library usage method registered in the code pattern DB 230.
  • a predetermined description such as calling a function used in the GPU library is performed.
  • the replacement processing unit 214 can perform interface processing with the FPGA via a high-level synthesis tool (for example, Xilinx Vivado, Intel HLS Compiler, etc.).
  • the replacement processing unit 214 connects the CPU and the FPGA using the OpenCL interface from the OpenCL code, for example, via the high-level synthesis tool.
  • the replacement processing unit 214 implements the IP core in the FPGA via a high-level synthesis tool of the FPGA vendor such as Xilinx or Intel.
  • the replacement processing unit 214 replaces the processing description of the replacement source of the application code with the GPU library and IP core processing description of the replacement destination. Then, the replacement processing unit 214 confirms with the user when the number and type of arguments and return values are different between the replacement source and the replacement destination, and describes the interface processing for calling the GPU library and IP core (OpenCL API, etc.). And compile the created pattern. That is, in ⁇ Process C-2>, the replacement processing unit 214 searches for libraries and IP cores that can be accelerated by ⁇ Process B-2> for the classes, structures, etc. detected in ⁇ Process A-2>. are doing. Therefore, the replacement processing unit 214 mounts the corresponding library or IP core on the GPU or FPGA in ⁇ Processing C-2>.
  • ⁇ Process C-2> will be described in more detail.
  • ⁇ Process C-1> since it is a library or IP core that speeds up for a specific library call, it is necessary to generate an interface part, but the arguments assumed by the GPU, FPGA and the host side program, return. The number and type of values were correct.
  • ⁇ Process B-2> is judged by similarity, there is no guarantee that the basic parts such as the number and type of arguments and return values are correct. Libraries and IP cores are existing know-how, and even if the number and type of arguments and return values do not match, they cannot be changed frequently. Therefore, ask the user who requests offload whether to change the number and type of arguments and return values of the original code according to the library and IP core. Then, after confirming and approving, the off-road performance test will be tried.
  • arguments 1 and 2 are required in the CPU program and argument 3 is optional, and arguments 1 and 2 in the library and IP core. If is required, there is no problem even if the option argument 3 is omitted. In such a case, the option argument may be automatically treated as none when creating the processing pattern without confirming with the user. If the number and type of arguments and return values match perfectly, the same process as ⁇ Process C-1> may be used.
  • control unit (automatic offload function unit) 210 of the offload server 200 indicates ⁇ functional block offload: It is a flowchart when ⁇ process A-1>, ⁇ process B-1>, and ⁇ process C-1> are executed in the offload process of "common".
  • step S200 the control unit 210 (see FIG. 6) receives the application code and operating conditions.
  • step S201 the application code analysis unit 112 (see FIG. 6) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the library call included in the code and the source code for analyzing the functional processing, together with the loop statement structure and the like.
  • a syntax analysis tool such as Clang
  • step S202 the replacement function detection unit 213 (see FIG. 6) detects an external library call of the application program.
  • the replacement function detection unit 213 acquires the replaceable GPU library and FPGA IP core from the code pattern DB 230 using the library name as a key. Specifically, the replacement function detection unit 213 can offload the detected replaceable GPU library / IP core to the GPU or FPGA by collating the grasped external library call with the code pattern DB 230. Get as a functional block.
  • step S204 the replacement processing unit 214 replaces the processing description of the replacement source of the application source code with the processing description of the GPU library and FPGA IP core of the replacement destination.
  • step S205 the replacement processing unit 214 offloads the processing description of the replaced GPU library and FPGA IP core to the GPU and FPGA as a functional block to be offloaded.
  • step S206 the replacement processing unit 214 describes the interface processing for calling the GPU library and FPGA IP core.
  • step S207 the executable file creation unit 119 compiles or interprets the created pattern.
  • step S208 the performance measurement unit 118 measures the performance of the created pattern in the verification environment (“first performance measurement”).
  • step S209 the executable file creation unit 119 creates a combination pattern for the pattern that can be speeded up at the time of the first measurement.
  • step S210 the executable file creation unit 119 compiles or interprets the created combination pattern.
  • step S211 the performance measurement unit 118 measures the performance of the created combination pattern in the verification environment (“second performance measurement”).
  • step S212 the production environment placement unit 120 selects the pattern with the highest performance among the first and second measurements and ends the processing of this flow.
  • the control unit (automatic offload function unit) 210 of the offload server 200 performs the offload process of the functional block. It is a flowchart when ⁇ process A-2>, ⁇ process B-2>, and ⁇ process C-2> are executed in. The process from ⁇ Process A-2> may be performed in parallel with the process from ⁇ Process A-1>.
  • step S300 the control unit 210 (see FIG. 6) receives the application code and operating conditions.
  • step S301 the application code analysis unit 112 (see FIG. 6) analyzes the source code of the application to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the library call included in the code and the source code for analyzing the functional processing, together with the loop statement structure and the like.
  • a syntax analysis tool such as Clang
  • step S302 the replacement function detection unit 213 (see FIG. 6) detects the definition description code of the class or structure from the source code.
  • step S303 the replacement function detection unit 213 acquires the replaceable GPU library and FPGA IP core from the code pattern DB 230 by using the similarity detection tool and using the definition description code of the class or structure as a key.
  • step S304 the replacement processing unit 214 replaces the processing description of the replacement source of the application source code with the processing description of the GPU library and FPGA IP core of the replacement destination.
  • step S305 the replacement processing unit 214 confirms with the user when the argument, the number and type of return values are different between the replacement source and the replacement destination.
  • step S306 the replacement function detection unit 213 offloads the processed description of the replaced GPU library and FPGA IP core to the GPU and FPGA as a functional block to be offloaded.
  • step S307 the replacement processing unit 214 describes the interface processing for calling the GPU library and FPGA IP core.
  • step S308 the executable file creation unit 119 compiles or interprets the created pattern.
  • step S309 the performance measurement unit 118 measures the performance of the created pattern in the verification environment (“first performance measurement”).
  • step S310 the executable file creation unit 119 creates a combination pattern for the pattern that can be speeded up at the time of the first measurement.
  • the executable file creation unit 119 in step S311 compiles or interprets the combination pattern created.
  • the performance measurement unit 118 measures the performance of the created combination pattern in the verification environment (“second performance measurement”).
  • step S313 the production environment arrangement unit 120 selects the highest performance pattern in the first and second measurements, and the production environment arrangement unit 120 ends the processing of this flow.
  • the offload server according to the first and second embodiments is realized by, for example, a computer 900 which is a physical device having a configuration as shown in FIG.
  • FIG. 9 is a hardware configuration diagram showing an example of a computer that realizes the functions of the offload servers 1,200.
  • the computer 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, a RAM 903, an HDD (Hard Disk Drive) 904, an input / output I / F (Interface) 905, a communication I / F 906, and a media I / F 907. Have.
  • the CPU 901 operates based on the program stored in the ROM 902 or the HDD 904, and is controlled by each processing unit of the offload servers 1 and 200 shown in FIGS. 1 and 6.
  • the ROM 902 stores a boot program executed by the CPU 901 when the computer 900 is started, a program related to the hardware of the computer 900, and the like.
  • the CPU 901 controls an input device 910 such as a mouse and a keyboard and an output device 911 such as a display via the input / output I / F 905.
  • the CPU 901 acquires data from the input device 910 and outputs the generated data to the output device 911 via the input / output I / F 905.
  • the HDD 904 stores a program executed by the CPU 901, data used by the program, and the like.
  • the communication I / F906 receives data from another device via a communication network (for example, NW (Network) 920) and outputs the data to the CPU 901, and the communication I / F 906 transfers the data generated by the CPU 901 to another device via the communication network. Send to the device.
  • NW Network
  • the media I / F907 reads the program or data stored in the recording medium 912 and outputs the program or data to the CPU 901 via the RAM 903.
  • the CPU 901 loads the program related to the target processing from the recording medium 912 onto the RAM 903 via the media I / F 907, and executes the loaded program.
  • the recording medium 912 is an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto Optical disk), a magnetic recording medium, a conductor memory tape medium, a semiconductor memory, or the like. Is.
  • the CPU 901 of the computer 900 executes the program loaded on the RAM 903 to execute the offload server 1,200.
  • the data in the RAM 903 is stored in the HDD 904.
  • the CPU 901 reads the program related to the target processing from the recording medium 912 and executes it.
  • the CPU 901 may read a program related to the target processing from another device via the communication network (NW920).
  • the offload server 1 (see FIG. 1) according to the first embodiment is an offload server that offloads the specific processing of the application program from the CPU to the offload device, and is an application code that analyzes the source code of the application program.
  • the reference relationship between the analysis unit 112 and the variables used in the loop statement of the application program is analyzed, and for the data that may be transferred outside the loop, the explicit specification that explicitly specifies the data transfer outside the loop.
  • Parallel processing that specifies the data transfer specification unit 113 that specifies data transfer using lines, specifies the loop statement of the application program, and specifies the parallel processing specification statement in the offload device for each specified loop statement and compiles it.
  • a parallel processing pattern creation unit 117 is provided.
  • the resource ratio determination unit 115 that determines the ratio of the processing time of the CPU and the offload device as the resource ratio, and the resources of the CPU and the offload device so as to satisfy a predetermined cost condition based on the determined resource ratio.
  • a resource amount setting unit 116 for setting an amount is provided.
  • the offload server 200 (see FIG. 6) according to the second embodiment is an offload server that offloads the specific processing of the application program from the CPU to the offload device including the GPU or PLD, and is off to the GPU or PLD.
  • the code pattern DB 230 storage unit
  • the application code analysis unit 112 that analyzes the source code of the application program and detects the external library call included in the source code.
  • the replacement function detection unit 213 that acquires the library and IP core from the code pattern DB 230 using the external library call as a key, and the replacement destination that the replacement function detection unit 213 has acquired the processing description of the replacement source of the source code of the application program.
  • the replacement processing unit 214 that replaces the replaced library and IP core processing description as the processing description of the replacement destination of the library and IP core, and offloads the replaced library and IP core processing description to the GPU or PLD as a functional block to be offloaded, and the host program.
  • the offload pattern creation unit 215 that extracts the offload pattern that becomes faster by creating an interface with and trying not to offload through the performance measurement in the verification environment, and the created GPU or PLD processing.
  • the execution file creation unit 119 that compiles the above application of the pattern and creates an execution file, and the created execution file are placed in the accelerator verification device, and the performance measurement processing when offloaded to the GPU or PLD is executed.
  • a resource amount setting unit 116 for setting the resource amount of the CPU and the offload device so as to satisfy the cost condition is provided.
  • the resource ratio determination unit 115 is characterized in that the resource ratio is determined so that the processing times of the CPU and the offload device are on the same order. do.
  • the resource ratio determination unit 115 sets the resource ratio to a predetermined upper limit value when the difference between the processing times of the CPU and the offload device is equal to or more than a predetermined threshold value. It is characterized by setting.
  • the resource amount setting unit 116 is characterized in that the determined resource ratio is maintained and the maximum resource amount satisfying a predetermined cost condition is set. And.
  • the resource amount setting unit 116 together with the CPU when the predetermined cost condition is not satisfied by setting the minimum resource amount that maintains the determined resource ratio. It is characterized by setting the resource amount so that the resource amount of the offload device approaches the resource ratio set while satisfying the cost condition, starting from the minimum.
  • the present invention is an offload program for making a computer function as the above offload server.
  • each function of the above-mentioned offload server 1 can be realized by using a general computer.
  • each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Un serveur de délestage (1) comprend : une unité de mesure de performance (118) qui effectue un processus de mesure de performance d'un scénario lors duquel un programme d'application de motifs de traitement parallèle est compilé est disposée dans une machine de vérification (14) et est délestée vers un dispositif de délestage ; une unité de détermination de rapport de ressources (115) qui détermine le rapport de temps de traitement d'une CPU par rapport au dispositif de délestage en tant que rapport de ressources sur la base du résultat de mesure de performance ; et une unité de réglage de quantité de ressources (116) qui règle des quantités de ressources de la CPU et du dispositif de délestage de façon à satisfaire des conditions prescrites de coût sur la base du rapport de ressources déterminé.
PCT/JP2020/042342 2020-11-12 2020-11-12 Serveur de délestage, procédé de commande de délestage et programme de délestage WO2022102071A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022561797A JPWO2022102071A1 (fr) 2020-11-12 2020-11-12
PCT/JP2020/042342 WO2022102071A1 (fr) 2020-11-12 2020-11-12 Serveur de délestage, procédé de commande de délestage et programme de délestage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/042342 WO2022102071A1 (fr) 2020-11-12 2020-11-12 Serveur de délestage, procédé de commande de délestage et programme de délestage

Publications (1)

Publication Number Publication Date
WO2022102071A1 true WO2022102071A1 (fr) 2022-05-19

Family

ID=81601834

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/042342 WO2022102071A1 (fr) 2020-11-12 2020-11-12 Serveur de délestage, procédé de commande de délestage et programme de délestage

Country Status (2)

Country Link
JP (1) JPWO2022102071A1 (fr)
WO (1) WO2022102071A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020137017A (ja) * 2019-02-22 2020-08-31 日本電信電話株式会社 オフロードサーバのソフトウェア最適配置方法およびプログラム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020137017A (ja) * 2019-02-22 2020-08-31 日本電信電話株式会社 オフロードサーバのソフトウェア最適配置方法およびプログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YAMATO, YOJI: "Evaluation of Automatic GPU and FPGA Offloading for Function Blocks of Applications", IEICE TECHNICAL REPORT, SC, vol. 119, no. 482 (SC2019-44), 9 March 2020 (2020-03-09), JP , pages 59 - 66, XP009537326, ISSN: 2432-6380 *

Also Published As

Publication number Publication date
JPWO2022102071A1 (fr) 2022-05-19

Similar Documents

Publication Publication Date Title
Pérez et al. Simplifying programming and load balancing of data parallel applications on heterogeneous systems
JP7063289B2 (ja) オフロードサーバのソフトウェア最適配置方法およびプログラム
JP6927424B2 (ja) オフロードサーバおよびオフロードプログラム
WO2021156956A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
Carneiro Pessoa et al. GPU‐accelerated backtracking using CUDA Dynamic Parallelism
JP6992911B2 (ja) オフロードサーバおよびオフロードプログラム
WO2022102071A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
WO2022097245A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
US20230065994A1 (en) Offload server, offload control method, and offload program
US20230096849A1 (en) Offload server, offload control method, and offload program
Yamato Proposal and evaluation of GPU offloading parts reconfiguration during applications operations for environment adaptation
JP7473003B2 (ja) オフロードサーバ、オフロード制御方法およびオフロードプログラム
WO2023228369A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
WO2023002546A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
WO2023144926A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
JP7363930B2 (ja) オフロードサーバ、オフロード制御方法およびオフロードプログラム
US12033235B2 (en) Offload server, offload control method, and offload program
WO2024147197A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
WO2024079886A1 (fr) Serveur de délestage, procédé de commande de délestage et programme de délestage
JP7184180B2 (ja) オフロードサーバおよびオフロードプログラム
US20240192934A1 (en) Framework for development and deployment of portable software over heterogenous compute systems
Morman et al. The Future of GNU Radio: Heterogeneous Computing, Distributed Processing, and Scheduler-as-a-Plugin
Toledo et al. Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs. Electronics 2022, 11, 1307
Grossman et al. Distributed, Heterogeneous Scheduling Techniques Motivated by Production Geophysical Applications
Agathos Efficient OpenMP runtime support for general-purpose and embedded multi-core platforms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20961599

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022561797

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20961599

Country of ref document: EP

Kind code of ref document: A1