WO2017131187A1

WO2017131187A1 - Accelerator control device, accelerator control method and program

Info

Publication number: WO2017131187A1
Application number: PCT/JP2017/003028
Authority: WO
Inventors: 鈴木　順; 真樹菅; 佑樹林
Original assignee: 日本電気株式会社
Priority date: 2016-01-29
Filing date: 2017-01-27
Publication date: 2017-08-03
Also published as: JPWO2017131187A1; US10831547B2; US20190026157A1; JP6897574B2

Abstract

The present invention increases the speed of processing of a task using an accelerator. An accelerator control device is provided with: a task storage unit that holds an executable task; a data scheduler that selects, from among the executable tasks, a task with a relatively small amount of data to be input/output to/from memory when executed with an accelerator having a memory, and instructs the accelerator to prepare for data input/output to/from memory for the selected task; and a task scheduler that instructs the accelerator to execute the selected task, and adds, to the task storage unit, the task which becomes executable due to completion of the selected task. The data scheduler, in accordance with the memory usage status, continues the selection of the next task from among the executable tasks held by the task storage unit and the preparations for data input/output in respect to the selected next task.

Description

Accelerator control device, accelerator control method, and program

(Description of related applications)
The present invention is based on the priority claim of Japanese patent application: Japanese Patent Application No. 2016-015352 (filed on Jan. 29, 2016), the entire contents of which are incorporated herein by reference. Shall.
The present invention relates to an accelerator control device, an accelerator control method, and a program, and more particularly, to an accelerator control device, an accelerator control method, and a program that control computation using an accelerator.

In recent years, there has been a growing need for analyzing big data such as satellite images and sensor data in real time to discover unknown phenomena or foreseeing or predicting phenomena that may occur in the future. Here, the data to be analyzed is increased in capacity with the improvement of sensing accuracy. However, it is difficult to occupy a cluster (computer cluster) having a scale of 100 to 1000 for each operator (or business operator) from the viewpoint of cost.

Therefore, recently, in the above-described real-time analysis, there is an increased use of an accelerator equipped with a GPU (Graphical Processing Unit) or the like. Patent Document 1 describes an example of an accelerator control device. As shown in FIG. 22, the accelerator control device described in Patent Document 1 is configured by an information processing device 8. The information processing apparatus 8 includes a shared memory 81 and a plurality of accelerators 821 to 823 connected to the shared memory 81.

The shared memory 81 holds data processed by the accelerators 821 to 823. The accelerators 821 to 823 perform processing on the data moved from the shared memory 81 to the memories (not shown) of the accelerators 821 to 823. The accelerators 821 to 823 move the processed data from their own memory to the shared memory 81 again. These data movement and processing are repeated until the desired processing is completed.

JP 2013-025392 A

The entire disclosure of the above patent document is incorporated herein by reference. The following analysis was made by the present inventors.

In the technique described in Patent Document 1, since it takes time to move data from the shared memory to the accelerator memory, there is a possibility that the calculation using the accelerator cannot be performed at high speed. For the same reason, when a calculation is performed using a plurality of accelerators, the total calculation time cannot be reduced depending on the number of accelerators used, and scalability may not be obtained.

For example, the number of nodes can be reduced to 1/10 by using an accelerator equipped with a GPU (Graphical Processing Unit) or the like instead of employing the cluster technology. On the other hand, when an accelerator is used, the memory capacity is reduced to 1/1000 compared to the cluster technology. Therefore, out-of-core processing involving data exchange between the shared memory (or main memory) and the accelerator memory is increased, which does not fit in the accelerator memory (accelerator memory). In a typical example, by using an accelerator, processing performance and memory capacity change from these values in the case of cluster technology as follows:
Processing performance: 100 gigaflops (CPU: Central Processing Unit) ⇒ 1 teraflop (GPU)
Memory capacity: 1 terabyte (CPU) ⇒ 10 gigabyte (GPU)

However, the I / O (Input / Output) bandwidth for inputting / outputting data to / from the accelerator is extremely narrow compared with the calculation performance of the GPU. In a typical example, the I / O bandwidth is 32 gigabytes / second (GB / s) for a computing performance of 1 teraflop (TFlop). Therefore, data I / O between the accelerator memory and the main memory may become a bottleneck in speeding up the processing.

Therefore, it is an issue to speed up the task processing using an accelerator having a memory. An object of the present invention is to provide an accelerator control device, an accelerator control method, and a program that contribute to solving such a problem. In addition, the other subject thru | or objective of this invention become clear in description of the form for implementing invention mentioned later.

The accelerator control device according to the first aspect of the present invention includes a task storage unit that holds an executable task, and a task that has a relatively small amount of input / output data to and from the memory when executed on the accelerator. A data scheduler that instructs the accelerator to prepare for data input / output in the memory for the selected task, and instructs the accelerator to execute the selected task. A task scheduler that adds a task that can be executed upon completion of the selected task to the task storage unit, and the data scheduler executes execution held by the task storage unit in accordance with the use state of the memory Select the next task from the available tasks and select the next task To continue the preparation of the data input and output.

In the accelerator control method according to the second aspect of the present invention, the amount of input / output data to the memory is relatively small when the executable task is stored in the storage unit and executed on the accelerator having the memory. Selecting a task from the executable tasks, instructing the accelerator to prepare for data input / output in the memory for the selected task, and instructing the accelerator to execute the selected task Adding a task that can be executed upon completion of the selected task to the storage unit, and selecting a next task from among the executable tasks held by the storage unit according to the use state of the memory. Selecting and continuing to prepare for data input / output for the next selected task.

The program according to the third aspect of the present invention includes a process for holding an executable task in a storage unit, and a task having a relatively small amount of input / output data to the memory when executed on an accelerator having a memory. Selecting from the executable tasks, instructing the accelerator to prepare for data input / output in the memory for the selected task, and completing the data input / output preparation in the memory, the selected task The accelerator is instructed to execute the task, and a process for adding a task that can be executed upon completion of the selected task to the storage unit, and an execution held by the storage unit in accordance with the use status of the memory Selecting the next task from the possible tasks, and continuing to prepare for data I / O for the selected next task; To be executed by a computer. The program can also be provided as a program product recorded in a non-transitory computer-readable storage medium.

According to the accelerator control device, the accelerator control method, and the program according to the present invention, it is possible to speed up the task processing using the accelerator having the memory.

It is a block diagram which illustrates the composition of the accelerator control device concerning one embodiment. It is a figure which illustrates operation | movement of the accelerator control apparatus which concerns on one Embodiment. It is a block diagram which illustrates other composition of the accelerator control device concerning one embodiment. It is a figure for demonstrating operation | movement of the accelerator control apparatus which concerns on one Embodiment. It is a figure which illustrates operation | movement of the accelerator control apparatus which concerns on one Embodiment. It is a figure for demonstrating operation | movement of a comparative example. It is a figure for demonstrating the effect of the accelerator control apparatus which concerns on one Embodiment. It is a block diagram which illustrates the composition of the accelerator control device concerning a 1st embodiment. It is a figure which illustrates reservation API (Application | Programming | Interface | Interface) and execution API in the accelerator control apparatus which concerns on 1st Embodiment. It is a figure which illustrates the structure of DAG (Directed | Acyclic | Graph, directed acyclic graph) in the accelerator control apparatus which concerns on 1st Embodiment. It is a figure for demonstrating the division | segmentation of the data and process in the accelerator control apparatus which concerns on 1st Embodiment. It is a figure for demonstrating the division | segmentation of the data and process in the accelerator control apparatus which concerns on 1st Embodiment. It is a block diagram which illustrates the composition of the accelerator control part of the accelerator control device concerning a 1st embodiment. It is a figure which illustrates the structure of the memory management table in the accelerator control apparatus which concerns on 1st Embodiment. It is a figure which illustrates the structure of the data management table in the accelerator control apparatus which concerns on 1st Embodiment. It is a figure which illustrates the task which the non-executable subtask memory | storage part of the accelerator control apparatus which concerns on 1st Embodiment hold | maintains. It is a flowchart which illustrates operation | movement of the accelerator control apparatus which concerns on 1st Embodiment. It is a sequence diagram which illustrates detailed operation | movement of the accelerator control apparatus which concerns on 1st Embodiment. It is a flowchart which illustrates operation | movement of the data scheduler of the accelerator control apparatus which concerns on 1st Embodiment. It is a flowchart which illustrates operation | movement of the prefetch determination part of the accelerator control apparatus which concerns on 1st Embodiment. It is a flowchart which illustrates operation | movement of the next subtask determination part of the accelerator control apparatus which concerns on 1st Embodiment. It is a figure for demonstrating the related technique described in patent document 1. FIG.

First, an outline of one embodiment will be described. Note that the reference numerals of the drawings attached to this summary are merely examples for facilitating understanding, and are not intended to limit the present invention to the illustrated embodiment.

FIG. 1 is a block diagram illustrating the configuration of an accelerator control device 10 according to an embodiment. Referring to FIG. 1, the accelerator control device 10 includes a task storage unit 11, a data scheduler 12, and a task scheduler 13.

The task storage unit 11 holds executable tasks (for example, the tasks shown in FIG. 10 or the subtasks shown in FIGS. 11 and 12). The data scheduler 12 executes the above-described task with a relatively small amount of input / output data to the memory (for example, the smallest) when executing on an accelerator having a memory (for example, an accelerator having the accelerator memory of FIG. 8). Select from the possible tasks and instruct the accelerator to prepare for data input / output in the memory for the selected task. The task scheduler 13 instructs the accelerator to execute the selected task (for example, when preparation for data input / output in the memory is completed), and a task that can be executed upon completion of the selected task (for example, FIG. A task 72) that can be executed upon completion of the ten tasks 71 is added to the task storage unit 11. Here, the data scheduler 12 selects the next task from among the executable tasks held by the task storage unit 11 and prepares data input / output for the selected next task in accordance with the use state of the memory. continue.

That is, the accelerator control device 10 selects a task with a relatively small amount of data input / output to / from the memory of the accelerator as the next task, and selects the selected task according to the use status of the memory (for example, when there is a margin). Adopt a configuration that continues preparation for data input / output. As a result, the input / output amount of data between the accelerator memory and the external memory can be reduced, and at the same time, the I / O bandwidth between the accelerator memory and the external memory can be effectively used. Therefore, according to the accelerator control device 10, it is possible to speed up the task processing using the accelerator having a memory.

FIG. 2 is a diagram illustrating the operation of the accelerator control device 10 shown in FIG. FIG. 2A illustrates a DAG (Directed Acyclic Graph, directed acyclic graph) indicating the processing of the user program. Here, as an example, each node of the DAG represents a subtask (see FIGS. 11 and 12) obtained by dividing the task.

Referring to FIG. 2B, the task scheduler 13 and the data scheduler 12 are operating in parallel. The task scheduler 13 loads the executable subtasks “1” to “3” in the executable list of the task storage unit 11. The data scheduler 12 selects a subtask having the smallest I / O to the accelerator (or accelerator memory) of the input data from the executable list held by the task storage unit 11, and the I / O of the data necessary for the execution of the subtask. O is performed. For example, when only the input data of the subtask “2” is cached in the accelerator memory, the data scheduler 12 selects the subtask “2”. Further, the data scheduler 12 deletes the entry of the selected subtask “2” from the executable list in the task storage unit 11.

Referring to FIG. 2C, the data scheduler 12 completes the input data I / O and output memory allocation for executing the subtask “2”, locks the memory area, and sends the subtask to the task scheduler 13. Notify that execution of “2” is possible. The data scheduler 12 selects a subtask to be subjected to the next I / O from the executable list in the task storage unit 11. Here, as an example, it is assumed that the data scheduler 12 selects the subtask “1”. Further, the task scheduler 13 executes the subtask “2”.

Referring to FIG. 2D, the task scheduler 13 completes the execution of the subtask “2” and notifies the data scheduler 12 of the completion of the execution of the subtask “2”. The data scheduler 12 unlocks the input / output data of the subtask “2”. According to the DAG in FIG. 2A, the subtask “5” can be executed, so the task scheduler 13 loads the subtask “5” in the executable list of the task storage unit 11.

Hereinafter, the same processing is performed by the parallel operation of the task scheduler 13 and the data scheduler 12. When there are a plurality of accelerators, the data scheduler 12 performs the above process for each accelerator.

In this way, while the subtask is executed by the task scheduler 13, the data scheduler 12 selects the subtask that minimizes the data input / output amount to the memory of the accelerator as the next task, and performs data input / output for the selected subtask. Continue preparation. As a result, data input / output between the accelerator memory and the external memory can be reduced, and the I / O bandwidth between the accelerator memory and the external memory can be effectively utilized.

FIG. 3 is a block diagram illustrating another configuration of the accelerator control device 10 according to an embodiment. Referring to FIG. 3, the task storage unit 11 includes a first storage unit 14 that holds a task that is an executable task (or subtask) and whose execution destination accelerator is not limited, and the execution destination accelerator is limited. A second storage unit 15 for holding tasks. At this time, the data scheduler 12 includes a task that the first storage unit 14 holds a task that has a relatively small amount of input / output data to and from the memory when executed on the accelerator (for example, a minimum), and a second Are selected from the tasks held in the storage unit 15 whose execution destination is restricted to the accelerator.

For example, the first storage unit 14 holds the most upstream task or a task for which all upstream tasks have been executed. On the other hand, in the second storage unit 15, at least one of the upstream tasks is waiting to be executed by the accelerator as a task whose execution destination accelerator is restricted (that is, preparation for data input / output is completed and executed by the accelerator) And the task that has completed the execution of all the remaining upstream tasks is held.

FIG. 4 is a diagram for explaining the operation of the accelerator control device 10 shown in FIG. Here, it is assumed that accelerators 51 to 5N (N is a natural number) have GPUs 1 to N, respectively. The first storage unit 14 holds subtasks whose execution destination accelerators are not limited. On the other hand, the 2nd memory | storage part 15 hold | maintains the subtask by which the accelerator (or GPU) of the execution destination was restrict | limited for every accelerator. The subtasks accumulated in the first storage unit 14 and the second storage unit 15 are in the “I / O wait” state.

The data scheduler 12 is a subtask (Ready Sub Tasks) in which the first storage unit 14 holds subtasks having the smallest input / output data amount to the memory when executed on an accelerator (for example, an accelerator corresponding to GPU 1), Further, the subtask is held by the second storage unit 15 and the execution destination is selected from the subtasks limited to the accelerator (for example, GPU 1 Ready Sub Tasks). The subtask selected by the data scheduler 12 is stored in the queue (FIFO: First-In First-Out) for the corresponding GPU when the data input / output preparation for the subtask (I / O in FIG. 4) is completed. "Waiting" state. The subtasks stored in the queue are sequentially executed by the GPU of the corresponding accelerator (for example, GPU 1) (Processing in FIG. 4), and when the execution is completed, the state is “execution complete”.

FIG. 5 is a diagram illustrating the operation of the accelerator control device 10 shown in FIG. FIG. 5A illustrates a DAG indicating the processing of the user program. Here, as an example, each node of the DAG represents a subtask obtained by dividing a task. Referring to FIG. 5B, the data scheduler 12 (or the task scheduler 13) has the subtask “5” that can be executed when the subtask “2” is completed at the timing when the subtask “2” is awaiting execution in the accelerator. Is added to the Local queue corresponding to the accelerator (or GPU) held by the second storage unit 15. The data scheduler 12 refers to the executable list held in the first storage unit 14 and the local queue corresponding to the accelerator (or GPU) to be scheduled held in the second storage unit 15 when the subtask is scheduled. From the subtasks held in the list or queue, the subtask having the smallest input / output data amount to the memory when executed on the accelerator is selected. Here, since the operation is serialized in each accelerator, there is no problem even if the subtask “5” is selected in the state shown in FIG. The data scheduler 12 does not consider other accelerators in the selection of subtasks. When the data scheduler 12 (or the task scheduler 13) selects the subtask “5” and there is a subtask that can be executed upon completion of the subtask “5”, the data queue 12 (or the task scheduler 13) further selects a local queue for the corresponding accelerator (or GPU). Add to After the completion of the subtask, the task scheduler 13 has a corresponding entry in the Local Queue held by the second storage unit 15 (that is, a subtask whose execution destination accelerator is not limited, for example, a subtask in which all upstream subtasks have been executed). If the entry exists, the entry is moved from the second storage unit 15 to the executable list held by the first storage unit 14.

As described above, the accelerator control device 10 illustrated in FIG. 3 is limited in the first storage unit 14 that holds the most upstream task or the task for which all upstream tasks have been executed, and the execution destination accelerator. As a task, at least one of the upstream tasks is waiting to be executed by the accelerator, and the second storage unit 15 holds a task for which all the remaining upstream tasks have been executed. The accelerator control device 10 also includes a task that the first storage unit 14 holds a task having the smallest amount of input / output data to the memory when the data scheduler 12 executes on the accelerator, and a second storage unit 15. Is selected from the tasks whose execution destination is restricted to the accelerator. This makes it possible to further speed up the task processing using the accelerator having the memory. This is because the data scheduler 12 is a candidate for a task that starts the preparation of input / output data for a subsequent task that can be executed upon completion of the task at the time of execution waiting before the task is completed. Because it can be.

Next, effects brought about by the accelerator control device 10 (FIGS. 1 and 3) according to an embodiment will be described in comparison with a comparative example.

FIG. 6 is a diagram for explaining the operation of the comparative example. Referring to FIG. 6, in the comparative example, input data is prepared and an output memory area is secured in order for subtasks that have been completed and are ready to be executed.

FIG. 7 is a diagram for explaining the effect of the accelerator control device 10 according to the embodiment. Referring to FIG. 7, DAG data A to C are each divided into N data partitions (N is a natural number). Similarly, tasks A and B are each divided into N subtasks. For example, when the subtasks STa1 to STaN are applied to the data partitions DPa1 to DPaN, the same result as that obtained when the subtasks STa1 to STaN are not divided (that is, when the task A is applied to the data A) is obtained. Here, it is assumed that all data partitions of both data A and B cannot be simultaneously held in the accelerator memory.

In the comparative example shown in FIG. 6, when the subtask of FIG. 7 is processed, first, subtasks STa1 to STaN are stacked in the FIFO. Thereafter, subtasks STb1 to STbN are stacked in the FIFO. However, since it is impossible to mount all the data A and B in the memory of the accelerator, at least a part of the data partitions DPb1 to DPbN (for example, the data partition DPbx) used later in the execution of the subtasks STa1 to STaN is not possible. Swap out (Swap Out, ie move from accelerator memory to main memory). Furthermore, the data partition DPbx that has been swapped out when the subtask STbx is executed needs to be swapped in (Swap In, that is, moved from the main memory to the accelerator memory).

On the other hand, according to the accelerator control device 10 according to the embodiment, after the subtasks STa1 and STb1 are executed, the subtasks STa2 and STb2 can be executed. No swap (Swap, ie, I / O) occurs for DPbx). Therefore, according to one embodiment, data I / O between the accelerator and the main memory can be reduced as compared with the comparative example, and the processing speed can be increased.

<Embodiment 1>
Next, the accelerator control apparatus according to the first embodiment will be described in detail with reference to the drawings.

[Constitution]
FIG. 8 is a block diagram illustrating the configuration of the accelerator control device 1 according to this embodiment. Referring to FIG. 8, the accelerator control device 1 includes accelerators 51 to 53, a main memory 4, an accelerator control unit 3, a user program 21, and a DAG (Directed Acyclic Graph) creation unit 22. The accelerator control device 1 is realized by a host computer as an example. The user program 21 may be configured outside the accelerator control device 1.

Accelerators 51 to 53 execute calculation processing.

The main memory 4 is a memory for saving data that cannot be held due to a shortage of memory resources of the accelerators 51 to 53.

The accelerator control unit 3 controls the accelerators 51 to 53.

The DAG creation unit 22 creates a DAG (Directed Acyclic Graph) indicating the processing of the user program 21 by calling an API (Application Programming Interface) of the user program 21 and transmits it to the accelerator control unit 3.

In FIG. 8, the number of accelerators is three for convenience of explanation. However, the number of accelerators should just be one or more, and is not limited to the aspect of illustration. Examples of the accelerator include, but are not limited to, GPU (Graphical Processing Unit) from NVIDIA, Xeon Phi from Intel, and the like. The accelerator is a coprocessor of a CPU (Central Processing の Unit) of the computer, and is implemented, for example, by being inserted into an I / O (Input / Output) slot of the computer.

In the following, when the descriptions of the plurality of accelerators 51 to 53 overlap, only the accelerator 51 will be described. The same description applies to the

accelerators

52 and 53.

The accelerator 51 includes a processor 511 that processes data and an accelerator memory 521 that stores data. Here, the local memory included in the accelerator is referred to as an accelerator memory.

The user program 21 is an application program created by a programmer (user) using the accelerators 51 to 53 or an application program executed by the user. The user program 21 is implemented using an API provided by the DAG creation unit 22 as an example. The API provided by the DAG creation unit 22 includes, for example, two types of APIs, a reservation API and an execution API, as shown in FIG.

The reservation API corresponds to one of the tasks (or processes) of the DAG shown in FIG. When the reservation API is called from the user program 21, the DAG creation unit 22 adds one task to the DAG and data generated by the task. For example, in FIG. 10, when the task 71 is called for the data 61 using the reservation API, the DAG creation unit 22 adds the task 71 and its output data 62 to the DAG. The reservation API is an API for reserving a task. That is, immediately after the reservation API is called, no tasks are executed in the accelerators 51 to 53, and only a DAG is generated.

On the other hand, when the execution API is called, a new task and data generated by the task may or may not be added to the DAG. In addition, calling the execution API triggers the execution of a DAG task generated so far. The tasks belonging to the execution API include a case where data after the DAG is processed in the user program 21 and a case of a storeObject that holds data as a data object in the accelerator memory.

The reservation API and execution API may have one or more arguments α, β, γ,... As shown in FIG. One of these arguments may be a kernel function. Here, the kernel function is a function indicating processing executed by the user program 21 on data. Whether an API takes a function as an argument depends on the type of reservation API or execution API. The reservation API and the execution API indicate patterns of processing performed on the data, and actual specific processing is performed in the user program 21 by a kernel function given as an argument of the reservation API and the execution API.

An example of an API that takes a kernel function as an argument is map. In the map, a kernel function is applied to all elements constituting input data. The DAG input data is, for example, an image or a database table. When map is applied to these data, the kernel function is applied individually to each pixel of the image and each entry in the database.

On the other hand, APIs that do not require kernel functions include, for example, storeObject, appendObject, and read. storeObject is an API that stores the calculation result as a data object in the accelerator memories 521 to 523. According to storeObject, it is possible to name data stored as data objects in the accelerator memories 521 to 523. At this time, the name of the object is passed as an argument of storeObject. Further, appendObject is an API used when adding data to the end of an existing object. Furthermore, “read” is an API for acquiring the contents of the data object existing on the accelerators 51 to 53 in the user space.

Also, it is possible to specify data objects held in the accelerator memories 521 to 523 as input data of tasks indicated by the DAG. In this case, the name of an object held by the accelerators 51 to 53 is designated as input data for processing performed by the reservation API or execution API. This name is given by the program that called storeObject.

Here, each data of the DAG may be composed of two or more divisions (data partitions) as shown in FIG. FIG. 11 shows an example in which the data is composed of two data partitions in the data 61, task 71, data 62, task 72, and data 63 of the DAG in FIG. In this case, for example, if the task 71 is applied to both the data partition 61-1 and the data partition 61-2, the same result as the processing when the data 61 is not divided can be obtained. This is a process that belongs to the processing mode of data parallel in parallel calculation and is generally known among engineers in the technical field to which the present invention belongs. In FIG. 11, the processing for the data partition 61-1 is described as a subtask 71-1, etc., but the processing content of the subtask 71-1 is the same as the task 71 in FIG. Further, the processing for a plurality of divisions (data partitions) may be executed by different accelerators in a distributed manner.

FIG. 12 shows a case where the data 61 is divided into data partitions 61-1 to 61-4. Here, the data partition 61-1 and the data partition 61-2 are processed by the accelerator 51. On the other hand, the data partition 61-3 and the data partition 61-4 are processed by the accelerator 52. In this case, compared with the case where all four data partitions are processed by one accelerator, in the ideal case, the calculation performance is doubled.

In the following description, when there is no possibility of misunderstanding, the case of dividing data and tasks will be described, and the case of not dividing data and tasks will be omitted. Therefore, when data is not divided, the data partition in the following description means the original data itself before the division, and the subtask for the data partition means the task for the original data.

The DAG creation unit 22 generates a DAG every time the user program 21 calls a reservation API and an execution API. When the reservation API is called, the DAG creation unit 22 adds processing corresponding to the DAG and output data. On the other hand, when the execution API is called, the DAG creation unit 22 adds DAG processing and output data if necessary, and notifies the accelerator control unit 3 of the DAG generated so far.

Note that the DAG created by the DAG creation unit 22 includes the type of reservation API and execution API called by the user program 21 and the kernel function assigned to each API. The DAG creation unit 22 transmits the identifier of the user program 21 when notifying the DAG. In addition, when the user program 21 is terminated, the DAG creation unit 22 transmits the identifier of the user program 21 to the accelerator control unit 3, and intermediate data other than the data specified by the storeObject among the data generated by the user program 21 Request to erase data.

FIG. 13 is a block diagram illustrating the configuration of the accelerator control unit 3 of the accelerator control device 1 shown in FIG. Referring to FIG. 13, the accelerator control unit 3 includes a program analysis unit 31, a task processing unit 32, a subtask storage unit 36, a data management unit 33, a data management table 34, and a memory management table 35. The program analysis unit 31 analyzes the DAG indicating the processing of the user program 21 received from the DAG creation unit 22. The task processing unit 32 executes DAG processing. The subtask storage unit 36 classifies and holds subtasks included in the DAG that can be executed and those that can be executed. The data management unit 33 manages and prepares data necessary for DAG processing. The memory management table 35 manages the memory of the accelerator. The data management table 34 manages data on the memory of the accelerator. Hereinafter, each of these configurations will be described in detail.

The memory management table 35 is a table for managing the accelerator memories 521 to 523. The accelerator memories 521 to 523 are divided into pages of a certain size and managed. The page size is, for example, 4 KB or 64 KB. As shown in FIG. 14, the memory management table 35 holds information about each page as an entry. The information of each page includes an accelerator number to which the page belongs, a page number, an in-use flag indicating that the page is in use, and an identifier of data held by the page when the page is in use. A data number to indicate, a partition number indicating which data partition of the data the data held by the page is, and a lock flag indicating that the page is in use for calculation and is not allowed to be released Hold. The in-use flag and the lock flag are Boolean values. The data identifier is assigned to DAG data.

Here, as an example, the in-use flag is “1” when the page is in use, and “0” otherwise. The lock flag is “1” when page release is prohibited, and “0” otherwise.

For example, in the first entry of the memory management table 35 shown in FIG. 14, page 1 of the accelerator memory 521 held by the accelerator 51 is used by the data partition 62-1 (that is, the first data partition of the data 61). This page is locked because it is currently being used for calculations. Note that the data held by the locked page cannot be saved in the main memory 4.

The data management table 34 manages data on the accelerator memories 521 to 523. As shown in FIG. 15, the data management table 34 holds information about data in the DAG transmitted from the user program 21. Each entry includes a data number, a partition number of each data, a calculated flag indicating whether the data has been calculated, a swap flag indicating that the data has been saved in the main memory 4, and the data The accelerator number indicating the accelerator number that holds and the page number of the accelerator that holds the data are held. The calculated flag and the swap flag are Boolean values.

Here, as an example, the calculated flag is “1” when it has been calculated, and “0” otherwise. The swap flag is “1” when the data is saved in the main memory 4, and is “0” otherwise.

For example, in the first entry of the data management table 34 shown in FIG. 15, the first data partition (that is, the data partition 62-1) of the data whose data number is 62 has already been calculated, and the accelerator memory of the accelerator 51 This indicates that the data is held on page 1 of 521. Based on the accelerator number and page number held by the entry in the data management table 34, the corresponding entry in the memory management table 35 is referred to retrieve the page information used by each data, or lock the page when used for calculation. It becomes possible to do.

The program analysis unit 31 analyzes the DAG indicating the user process received from the DAG creation unit 22 and divides it into data and tasks. The program analysis unit 31 creates an entry in the data management table 34 for the data in the DAG. Here, the program analysis unit 31 creates a number of entries corresponding to the number of data partitions. At the time of data entry creation, calculation of each data partition has not yet been performed, so the calculated flag in the data management table 34 is “0”.

On the other hand, regarding the data output by the DAG before this time of the user program 21 as the input data of the DAG or the data of the data object previously created by the user program different from the user program 21 and stored in the memory on the accelerator , The entry already exists. Therefore, the program analysis unit 31 does not need to newly create these data entries. Further, the calculated flag of these entries is set to “1” in the data management table 34.

The program analysis unit 31 requests the task processing unit 32 to execute processing divided into units of DAG “tasks”. The program analysis unit 31 makes a subtask request for each DAG task according to the number of data partitions. Further, the program analysis unit 31 cancels the in-use flag in the memory management table 35 of the page used by the deleted entry (for example, changes the in-use flag from “1” to “0”), The accelerator memories 521 to 523 are released.

The data management unit 33 includes a data scheduler 331 and a data moving unit 332. The data scheduler 331 instructs management of data held in the accelerator memories 521 to 523 and reservation of the memory. The data moving unit 332 loads data to the accelerators 51 to 53 and secures the accelerator memories 521 to 523.

The data scheduler 331 manages the accelerator memory 521 of the accelerator 51 with reference to the memory management table 35. The data scheduler 331 also manages the

other accelerators

52 and 53 in the same manner. Further, the data scheduler 331 receives a request for input data and output data necessary for executing the subtask from the task processing unit 32.

When the subtask to be executed is the first subtask of the DAG, the identifier of the data object held in the accelerator memory is specified as input data. When the subtask to be executed is a subtask other than the first subtask, if the previous subtask in the DAG has been completed, the output data of that subtask has already been calculated. In any case, if the swap flag of the corresponding entry in the data management table 34 is “0”, those data partitions have not been saved in the main memory 4, so that the preparation is completed on the accelerator memory.

On the other hand, when the swap flag is “1”, the data scheduler 331 prepares the data partition in the accelerator memory. The data scheduler 331 refers to the memory management table 35 and confirms whether any of the accelerators 51 to 53 has enough free pages to load the saved data partition. If there are enough free pages, the data scheduler 331 requests the data moving unit 332 to load the saved data into the free pages. On the other hand, if there are not enough free pages, the data scheduler 331 refers to the data management table 34 and the memory management table 35, selects a data partition held by an unlocked page, and saves the data partition to the main memory 4. The data transfer unit 332 is requested as follows. Here, the data scheduler 331 makes a save request in units of data partitions. Thereby, since a memory for loading input data can be secured, the data scheduler 331 notifies the data moving unit 332 to load the data partition of the input data.

For the output data of the subtask, the data scheduler 331 refers to the memory management table 35 and secures the memory if the number of pages necessary for the output data of the subtask requested by the task processing unit 32 can be secured from the free page. The data moving unit 332 is requested. At this time, the data scheduler 331 also designates an accelerator that secures a page.

On the other hand, if it is not possible to secure from a free page, the data scheduler 331 performs the same operation as the above-described case of securing memory for loading the saved input data. That is, the data scheduler 331 first notifies the data moving unit 332 to save the data partition held by the unlocked page on the accelerator memory to the main memory 4, and then outputs the output data to the data moving unit 332 Ensure the number of pages to do.

Further, the data scheduler 331 requests the data moving unit 332 to lock the memory area for input data and output data. Further, the data scheduler 331 receives a processing completion notification from the task processing unit 32, unlocks the page locked by the data moving unit 332, and sets the calculated flag of output data in the data management table 34 to “1”. To be set to.

Note that depending on the type of subtask that the task scheduler 321 requests to execute, only one of the input data and the output memory area may be prepared. For example, it is not necessary to prepare an output memory area for a read execution request for acquiring the contents of a data object.

The data moving unit 332 receives an instruction from the data scheduler 331, secures the memory of the accelerator, and moves data to the accelerator.

The data moving unit 332 receives the instruction from the data scheduler 331, secures the accelerator memory, and registers the memory page entry secured in the memory management table 35. In addition, the data moving unit 332 registers the accelerator number and page number corresponding to the reserved memory in the data partition entry of the data management table 34.

The data moving unit 332 receives an instruction from the data scheduler 331 and sets the lock flag of the page being used for calculation to “1”. Further, the data moving unit 332 releases the lock flag of the page for which calculation has been completed from “1” to “0”. Further, the data moving unit 332 sets the calculated flag of the output data to “1” in the data management table 34.

The data moving unit 332 receives the instruction from the data scheduler 331 and saves the data partition to the main memory 4. In this case, the data migration unit 332 sets the swap flag of the entry in the data management table 34 of the saved data partition. In addition, the data migration unit 332 releases the in-use flag of the entry in the memory management table 35 of the page used by the saved data partition.

The task processing unit 32 includes a task scheduler 321 and a task execution unit 322. The task scheduler 321 requests a memory area for input data and output data necessary for execution of the subtask, and requests execution of the subtask. In addition, the task execution unit 322 causes the accelerators 51 to 53 to execute the subtask.

The task scheduler 321 receives an execution request for a subtask included in the DAG from the program analysis unit 31. The task scheduler 321 receives a request in units of processing execution for the data partition. The task scheduler 321 executes the subtasks in order from the upstream of the DAG among the subtasks included in the received request. In the DAG shown in FIG. 11, the subtask 71 corresponds to the upstream subtask. In the DAG, if the upstream subtask is not completed, the downstream (next stage) subtask cannot be executed. The task scheduler 321 requests the data scheduler 331 for memory areas for input data and output data necessary for each subtask to be executed. The task scheduler 321 receives the data and memory area reservation completion for the subtask requested from the data scheduler 331, and then receives the accelerator number, input data address, and output data necessary to execute the subtask corresponding to the task execution unit 322. The write address or entry information of the data management table 34 and the memory management table 35 necessary to know the information is notified, and the task execution unit 322 is caused to execute the subtask. This process is performed in units of data partitions.

When the requested subtask is appendObject for adding data to the data object held by the accelerator, the task scheduler 321 passes the information to be added to the task execution unit 322. This data is included when the program analysis unit 31 receives the DAG of the user program 21.

The task scheduler 321 receives the subtask completion notification from the task execution unit 322, and when the subtask is completed, notifies the data scheduler 331 to unlock the input data and the output data.

Furthermore, when the subtask requested to the task execution unit 322 is a read that acquires the contents of the data object held in the accelerator memory, the task scheduler 321 acquires the data from the task execution unit 322 that executed the read and acquires the data Data is transmitted to the user program 21 through the program analysis unit 31.

The task execution unit 322 receives an instruction from the task scheduler 321, and processes the specified input address and output address of the accelerator specified using the kernel function of the user program 21 received from the task scheduler 321. Also, the task execution unit 322 notifies the task scheduler 321 of the completion of processing. If the requested subtask is appendObject, the task execution unit 322 adds data to the specified data object. On the other hand, when the requested subtask is “read” for acquiring the contents of the data object, the task execution unit 322 acquires information from the corresponding address of the designated data object and notifies the task scheduler 321 of the information.

Next, functions related to the information among the information stored in the subtask storage unit 36 and the functions of the task scheduler 321 and the data scheduler 331 will be described.

First, subtask classification will be described. The subtask has the following four states.
(1) Waiting for I / O Waiting to prepare the input data partition of the subtask and secure the memory of the output data partition for the memory of the accelerator that executes the subtask (for example, before I / O in FIG. 4) Status)
(2) Waiting for execution State where input data partition preparation and output data partition memory reservation are complete and waiting for execution of subtask by accelerator (for example, I / O in FIG. 4 is completed and stored in FIFO) State)
(3) Running State where the subtask is being executed by the processor on the accelerator (eg, Processing state in FIG. 4)
(4) Execution complete State in which execution of subtask is completed (for example, state in which processing in FIG. 4 is completed)

In the following, the preparation of the input data partition of the subtask and the memory allocation of the output data partition in the accelerator are referred to as “preparation of input / output data of subtask”.

Referring to FIG. 13, the subtask storage unit 36 includes an inexecutable subtask storage unit 361, an executable subtask storage unit 362, and an accelerator executable subtask storage unit 363.

The subtask stored in the non-executable subtask storage unit 361 is a subtask that is not a candidate for the data scheduler 331 to prepare input / output data among the subtasks included in the DAG requested to be executed by the user program 21. Here, a subtask that is not a candidate for preparing input / output data includes a subtask that is waiting for I / O in a subtask upstream from the subtask or includes two or more execution subtasks, and executes those subtasks. This is the case when all waiting accelerators are not identical. The subtask waiting for execution is a subtask in which the preparation of input / output data in the data moving unit 332 at the request of the data scheduler 331 is completed, and the task scheduler 321 is notified of completion of execution preparation. The execution of the subtask at the request of 331 is a subtask that is not started (that is, not executed) by the task execution unit 322.

FIG. 16 shows an example of a subtask stored in the non-executable subtask storage unit 361. For example, in FIG. 16A, when the subtask “1” is waiting for I / O, the subtask “2” is stored in the non-executable subtask storage unit 361. In FIG. 16B, when subtask “a” and subtask “b” are waiting for execution by different accelerators, subtask “c” is stored in non-executable subtask storage unit 361.

The subtasks stored in the executable subtask storage unit 362 are subtasks included in the DAG requested to be executed by the user program 21 and are candidates for the data scheduler 331 to prepare input / output data, This subtask has no limitation on the accelerator that prepares the output data. Here, the subtask with no restriction on the accelerator for preparing input / output data is the subtask that is the most upstream subtask of the DAG and there is no subtask upstream from it, or the subtask on which the subtask depends All the subtasks upstream are in a state of completion of execution, and the input data partition of the subtask is a subtask held in the main memory 4 or the accelerator memory of any accelerator.

The accelerator-executable subtask storage unit 363 includes storage areas for the number of accelerators. The subtasks stored in the storage area corresponding to each accelerator are candidates for subtasks in which the data scheduler 331 prepares input / output data only in the accelerator among the subtasks included in the DAG requested to be executed by the user program 21. A subtask that can Here, a subtask that can be a candidate for a subtask that prepares input / output data with only one accelerator is a subtask on which all the subtasks that the subtask depends depend on or are in an execution complete state. At least one of the subtasks is waiting for execution, and all the subtasks waiting for execution are subtasks waiting for execution by an accelerator corresponding to the area in which the subtask is stored.

The task scheduler 321 receives a subtask execution request from the program analysis unit 31. All subtasks that have received an execution request are in an I / O waiting state. The task scheduler 321 stores the subtask upstream of the DAG among the subtasks in the executable subtask storage unit 362 and stores the other subtasks in the nonexecutable subtask storage unit 361. The uppermost subtask is a subtask in which there is no subtask on which the subtask depends. The task scheduler 321 notifies the data scheduler 331 that the subtask is stored in the executable subtask storage unit 362.

Also, the task scheduler 321 receives from the data scheduler 331 the notification of the subtask that has been ready for input / output data and has been waiting for execution, and the identifier of the accelerator waiting for execution, and the subtask notified to the task execution unit 322 Request to run with the notified accelerator.

Further, the task scheduler 321 receives a notification from the task execution unit 322 that the execution of the subtask has been completed and has entered the execution completion state, and notifies the data scheduler 331 to release the lock on the input data and the output memory area of the subtask. . In addition, the task scheduler 321 moves subtasks to be moved from the subtask storage unit 361 that cannot be executed by the subtask that has been executed to the accelerator executable subtask storage unit 363, and from the accelerator executable subtask storage unit 363 to the executable subtask storage unit 362. Search and move. At this time, the task scheduler 321 notifies the data scheduler 331 that the subtask has been moved to the accelerator executable subtask storage unit 363 and the executable subtask storage unit 362. This notification is performed when movement of a subtask occurs in both or either of the accelerator executable subtask storage unit 363 and the executable subtask storage unit 362.

The data scheduler 331 receives the subtask execution completion notification from the task scheduler 321 and releases the lock on the input / output data partition of the subtask. At this time, if the data mover 332 does not input / output data to the accelerator that has been unlocked, the data scheduler 331 performs an “input / output start process” described later.

Further, the data scheduler 331 receives the notification stored in the sub-task storage unit 362 or the accelerator-executable sub-task storage unit 363 that can newly execute the subtask from the task scheduler 321 and does not cause the data movement unit 332 to input / output data. If there are accelerators, the “input / output start processing” described later is performed for all of those accelerators.

Further, the data scheduler 331 receives a notification of the completion of the input / output data of the subtask from the data moving unit 332, locks the memory area holding the input / output data partition in the memory management table 35, and puts the subtask in the execution waiting state. Then, the task scheduler 321 is notified that the subtask is in the execution waiting state. In addition, the data scheduler 331 performs an “input / output start process” to be described later for causing the accelerator that has completed the preparation of the input / output data of the subtask to perform the next input / output process.

The data scheduler 331 makes the next input / output request to the accelerator that is not performing data input / output as “input / output start processing”. The data scheduler 331 uses the prefetch determination unit 334 to determine the next input / output process requested to the accelerator.

When the prefetch determination unit 334 determines that the data partition is swapped out, the data scheduler 331 selects a data partition that is not used as an input data partition in the processing of subtasks included in the DAG in the future from among the data partitions held by the accelerator. An instruction to save the data partition to the main memory 4 is transmitted to the moving unit 332. If there is no data partition that is not used as an input data partition, the data scheduler 331 selects a data partition that has not been referred to most recently from data partitions that are used as an input partition, and sends the data partition 332 to the main memory 4. Send the save instruction. The selection of the data partition that has been least referenced recently is a management method based on the LRU (Least Recently Used) standard, and is common knowledge for engineers in the technical field. Note that the memory area holding the saved data partition must not be locked by the memory management table 35. If there is no unlocked data partition, the data scheduler 331 does nothing.

On the other hand, when the input / output processing determined by the prefetch determination unit 334 is a data partition preparation instruction, the data scheduler 331 uses the subtask determination unit 336 next time to create a subtask to be prepared for input / output data to be executed by the accelerator. To decide. When the accelerator stores the input data partition of the subtask determined by the subtask determination unit 336 next time in the accelerator memory, the data scheduler 331 locks the input data partition. Further, the data scheduler 331 requests the data moving unit 332 to prepare an input data partition that is not held by the accelerator and to secure an output data partition.

Further, the data scheduler 331 receives a notification of completion of saving of the data partition from the data migration unit 332 to the main memory 4 and executes an input / output start process for causing the accelerator that has completed the saving to input / output the next data. To do.

The prefetch determination unit 334 determines an input / output process requested to the accelerator for the data scheduler 331. The prefetch determination unit 334 refers to the memory management table 35, and if the usage amount of the accelerator memory is equal to or greater than a threshold value (for example, 70% to 80% of the capacity of the accelerator memory), the data scheduler 331 swaps the data partition. Let out. On the other hand, if it is less than the threshold value, the prefetch determination unit 334 causes the data scheduler 331 to prepare a data partition.

The next-time subtask determination unit 336 designates a subtask for the data scheduler 331 that causes the designated accelerator to prepare the next input / output data. The next subtask determination unit 336 refers to the executable subtask storage unit 362, the accelerator executable subtask storage unit 363, and the data management table 34, and minimizes the data I / O to the accelerator in the preparation of input / output data. Designate the subtask as the next subtask to prepare the input / output data.

Specifically, the next-time subtask determination unit 336 selects an area corresponding to the accelerator in the accelerator executable subtask storage unit 363 and the executable subtask storage unit 362 when selecting a subtask that minimizes the data I / O of the accelerator. Select by sub-searching the subtasks memorized. Next time, the subtask determination unit 336 counts the data capacity of the input data partition as the data partition that requires the I / O as the total I / O capacity in the subtask to be searched as a data partition that requires I / O. Further, the next subtask determination unit 336, regarding the output data partition, if the data capacity of the output data partition is secured and the amount of use of the accelerator memory exceeds the threshold value, the capacity exceeding the threshold value is calculated as the total I / O capacity. Count on capacity. This is because, when preparing the input / output data of the subtask, it is necessary to save the data partition corresponding to the data capacity exceeding the threshold from the accelerator. The next subtask determination unit 336 selects the subtask having the smallest total I / O capacity counted for each subtask as the subtask having the smallest data I / O of the accelerator.

The data moving unit 332 receives input of the subtask input / output data and the designation of the accelerator for preparing the input / output data from the data scheduler 331, and prepares the input / output data. Regarding the input data partition, the data moving unit 332 loads the input data partition from the main memory 4 or another accelerator that holds the input data partition. On the other hand, regarding the input / output data partition, the data moving unit 332 secures a memory area necessary for outputting the data partition. Further, the data moving unit 332 updates related information held in the memory management table 35 and the data management table 34 regarding the input / output data partitions and the memory areas used by them.

Further, the data moving unit 332 receives an instruction to save the data partition to the main memory 4 from the data scheduler 331 and saves the designated data partition to the main memory 4. Further, the data migration unit 332 updates related information held in the memory management table 35 and the data management table 34 regarding the saved data partition and the memory area used by the data partition.

[Operation]
Next, the operation of this embodiment will be described in detail with reference to FIG. 8, FIG. 13, and FIG. FIG. 17 is a flowchart illustrating the operation of the accelerator control device 1 according to this embodiment.

First, the user program 21 created using the reservation API and the execution API is executed (step A1).

When the user program 21 calls the execution API (Yes in step A2), the DAG creation unit 22 proceeds to a process of notifying the DAG generated so far.

On the other hand, if it is not an execution API call (No in Step A2), the DAG creation unit 22 checks whether or not it is a reservation API call (Step A3).

If it is a reservation API call (Yes in step A3), the DAG creation unit 22 adds the task and data specified by the reservation API to the DAG generated so far (step A4).

Next, when the user program 21 ends (Yes in step A5), the execution of the user program 21 is completed.

On the other hand, when the user program 21 does not end (No in step A5), the process returns to step A1 and the execution of the user program 21 is continued.

When the execution API is called (Yes in Step A2), the DAG creation unit 22 adds the last task and data to the DAG if necessary, and notifies the program analysis unit 31 of the DAG (Step A6).

The program analysis unit 31 receives the DAG and disassembles the tasks constituting the DAG individually. Next, the program analysis unit 31 requests the task processing unit 32 to execute each subtask (step A7). The requested subtask is executed in units of data partitions. For example, in the task 71 shown in FIG. 11, since the task 71 includes two subtasks 71-1 and 71-2, two individual tasks are generated by the program analysis unit 31, and the task processing unit 32 As required. A task for an individual data partition is simply called a task instead of a subtask.

The task scheduler 321 requests the memory area of input data and output data necessary for execution of the next subtask from the data management unit 33 (step A8).

The data scheduler 331 refers to the data management table 34, and determines that the data is ready if the swap flag of the requested data is not set to “1” (Yes in step A9). Then, the data scheduler 331 requests the data moving unit 332 to set the lock flag of the corresponding entry in the memory management table 35 of the memory page used by the input data.

On the other hand, when the swap flag of the requested data is set to “1” (No in step A9), the task scheduler 321 refers to the memory management table 35 and accommodates the data saved in the main memory 4. Therefore, when there is an accelerator that holds a sufficient memory free area, the data moving unit 332 is requested to load input data to the accelerator. The data moving unit 332 loads input data to the designated accelerator, and updates the swap flag, accelerator number, and page number of the corresponding data in the data management table 34 (step A10). In addition, the data scheduler 331 updates the in-use flag, the data number, and the partition number for the page used by the loaded data in the memory management table 35. Further, the data scheduler 331 sets the lock flag to “1” in the memory management table 35.

On the other hand, when there is no accelerator that holds sufficient memory free space to accommodate the data saved in the main memory 4, the data scheduler 331 refers to the memory management table 35 and selects a page for which the lock flag is not set. The data moving unit 332 is requested to select the data being used and save it in the main memory 4. The data mover 332 saves the designated data and updates the swap flag, accelerator number, and page number in the data management table 34. When data is saved in the main memory 4, the accelerator number and page number of the data become invalid. The data scheduler 331 continues the data save request until a memory area necessary for loading the input data into the accelerator becomes available. When the memory for loading the input data becomes empty, the subsequent data loading process is performed when there is an accelerator that has sufficient memory free space to accommodate the data saved in the main memory 4. This is the same as the loading process.

Next, the data scheduler 331 checks whether or not the output memory area of the requested subtask can be secured in the accelerator that holds the input data of the subtask (step A11). If the free memory area is sufficient, it is determined that it can be secured (Yes in step A11).

On the other hand, when the free memory area is not sufficient (No in step A11), the data scheduler 331 refers to the memory management table 35 and instructs the data moving unit 332 to save the data using the page for which the lock flag is not set. Request. The operation (step A12) for saving the designated data by the data moving unit 332 is the same as the operation for saving the data in step A10.

When the memory area sufficient to accommodate the output data is available in the accelerator, the data scheduler 331 requests the data moving unit 332 to secure the output data memory (step A13).

The data moving unit 332 reserves a memory, and describes the accelerator number and page number in the entry of the corresponding data management table 34 of the output data. In addition, the lock flag of the memory management table 35 of the page being used is set. When the memory area for input data and output data is prepared on the accelerator, the data scheduler 331 notifies the task processing unit 32 of the completion of data preparation (step A14).

The task scheduler 321 receives the data preparation completion notification and requests the task execution unit 322 to execute the subtask (step A15).

When the request for the subtask to be executed is the execution of the kernel function given by the user program 21, the task execution unit 322 executes the kernel function on the input data using the accelerator that holds the data, and the result is stored in the output memory area. Output. On the other hand, when the request for the subtask to be executed is data read, the task execution unit 322 reads the data from the accelerator that holds the data and notifies the task scheduler 321 of the data. If the request for the subtask to be executed is append to add data, the task execution unit 322 writes the given data to the memory area of the accelerator that holds the data. When the task execution unit 322 completes the execution of the subtask, the task scheduler 321 notifies the data management unit 33 of the completion of the subtask (step A16).

The task scheduler 321 releases the lock flag in the memory management table 35 for the input data and output data for which processing has been completed, and sets the calculated flag of the corresponding entry in the data management table 34 for the output data. Request to the unit 332 (step A17). The data moving unit 332 performs the requested process.

The task scheduler 321 continues to request data on the subtask and execute the subtask until all the subtasks of the DAG requested from the program analysis unit 31 are completed (No in Step A18).

On the other hand, when the DAG processing is completed (Yes in step A18), the process returns to step A1.

Next, of the operations of the task scheduler 321 and the data scheduler 331, an operation based on information held by the subtask storage unit 36 will be described.

FIG. 18 is a sequence diagram illustrating detailed operations of the task scheduler 321 and the data scheduler 331.

Referring to FIG. 18, when the task scheduler 321 receives a subtask execution request from the program analysis unit 31, the task scheduler 321 stores the subtask upstream of the DAG in the subtask in the executable subtask storage unit 362, and other subtasks Is stored in the inexecutable subtask storage unit 361 (step B1). The task scheduler 321 notifies the data scheduler 331 that the subtask has been stored in the executable subtask storage unit 362 (step B2).

The data scheduler 331 receives the notification stored in the subtask storage unit 362 that can newly execute the subtask from the task scheduler 321, and if there is an accelerator that does not cause the data moving unit 332 to input / output data, all of those accelerators The "input / output start process" is performed for (Step B3).

In addition, the data scheduler 331 receives a notification of the completion of input / output data of the subtask from the data mover 332, locks the memory area holding the input / output data partition in the memory management table 35, and puts the subtask into the execution waiting state. (Step B4), the task scheduler 321 is notified that the subtask is waiting to be executed (Step B5). Further, the data scheduler 331 performs “input / output start processing” for causing the accelerator that has completed the preparation of the input / output data of the subtask to perform the next input / output processing (step B6).

The task scheduler 321 receives from the data scheduler 331 the notification of the subtask that has been ready for input / output data and is waiting for execution, and the identifier of the accelerator waiting for execution, and is notified of the subtask notified to the task execution unit 322. Request to be executed by the accelerator (step B7).

Also, the task scheduler 321 receives a notification from the task execution unit 322 that the execution of the subtask has been completed and has entered the execution completion state, and notifies the data scheduler 331 to release the lock on the input data and the output memory area of the subtask. (Step B8). The data scheduler 331 receives the subtask execution completion notification from the task scheduler 321 and releases the lock of the input / output data partition of the subtask (step B9).

Further, the task scheduler 321 can execute the subtask to be moved from the non-executable subtask storage unit 361 to the accelerator-executable subtask storage unit 363 and the accelerator-executable subtask storage unit 363 when a subtask that has been executed has occurred. A subtask to be moved is searched for and moved to the subtask storage unit 362 (step B10). Further, the task scheduler 321 notifies the data scheduler 331 that the subtask has been moved to the accelerator executable subtask storage unit 363 and the executable subtask storage unit 362 (step B11).

The data scheduler 331 receives the notification stored in the sub-task storage unit 362 or the accelerator-executable sub-task storage unit 363 that can newly execute the subtask from the task scheduler 321 (step B11), and causes the data movement unit 332 to input / output data. If there are any accelerators that are not present, the "input / output start processing" is performed for all the accelerators (step B12).

FIG. 19 is a flowchart illustrating the above-described “input / output start processing” (steps B3, B6, and B12 in FIG. 18) by the data scheduler 331. Referring to FIG. 19, the data scheduler 331 uses the prefetch determination unit 334 to determine an input / output process to be requested next to the accelerator (step C1).

When the prefetch determination unit 334 decides to swap out the data partition (Yes in Step C2), the data scheduler 331 uses the data partition held by the accelerator and does not use it as the input data partition in the processing of the subtask included in the future DAG. The partition or the data scheduler 331 selects a data partition that has not been referred to most recently among the data partitions used as the input data partition, and transmits a save instruction to the main memory 4 to the data moving unit 332 (step C3). .

On the other hand, when the input / output processing determined by the prefetch determination unit 334 is a data partition preparation instruction (No in step C2), the data scheduler 331 uses the subtask determination unit 336 next time to input / output data to be performed by the accelerator. A subtask to be prepared is determined (step C4). Further, the data scheduler 331 locks the input data partition when the accelerator stores the input data partition of the subtask determined by the next subtask determination unit 336 in the accelerator memory. Further, the data scheduler 331 requests the data migration unit 332 to prepare an input data partition that is not held by the accelerator and to secure an output data partition (step C5).

FIG. 20 is a flowchart illustrating the operation of the prefetch determination unit 334 (step C1 in FIG. 19). Referring to FIG. 20, the prefetch determination unit 334 refers to the memory management table 35 (step D1). If the accelerator memory usage is equal to or greater than the threshold (Yes in Step D2), the prefetch determination unit 334 causes the data scheduler 331 to swap out the data partition (Step D3). On the other hand, when it is less than the threshold (No in Step D2), the prefetch determination unit 334 causes the data scheduler 331 to prepare the data partition (Step D4).

FIG. 21 is a flowchart illustrating the operation of the next subtask determination unit 336 (step C4 in FIG. 19). Referring to FIG. 21, the next subtask determination unit 336 searches all areas corresponding to the accelerator in the accelerator executable subtask storage unit 363 and the subtask stored in the executable subtask storage unit 362 and selects one subtask. (Step E1).

Next time, the subtask determination unit 336 calculates the total I / O amount required for the accelerator memory when the selected subtask is executed on the accelerator. Here, the next subtask determination unit 336 calculates the total I / O amount as “the amount of input data to be loaded into the accelerator” + “the amount of data swapped out from the accelerator”.
Calculate from

The next subtask determination unit 336 regards the input data partition as a data partition that requires I / O as the data partition that the specified accelerator memory does not hold, and sets the amount of data as “input to be loaded into the accelerator” Count to “data amount” (step E2).

In addition, the subtask determination unit 336 next time secures “the amount of data swapped out from the accelerator” in the second term of the above equation “the amount of input data loaded as the first term of the above equation” + “the output area on the accelerator memory. Area size to be "-" free space up to the threshold of the accelerator memory at the load destination "
(Step E3). As an example, if the free memory capacity up to the threshold is 1GB, the input data to be newly loaded into the accelerator is 500MB, and the output area to be secured is 1GB, the second term "swapped out from the accelerator" The amount of “data” is
500MB (input data to load) + 1GB (output area to secure)-1GB (free space) = 500MB
It becomes.

When the next subtask determination unit 336 completes the processing of the above steps E1 to E3 for the area corresponding to the accelerator in the accelerator executable subtask storage unit 363 and all the subtasks stored in the executable subtask storage unit 362 (step S1). (Yes in E4), the subtask with the smallest total I / O count is selected as the subtask with the smallest accelerator data I / O (step E5).

According to the accelerator control device 1 according to the present embodiment, the task scheduler 321 executes the subtask, while the data scheduler 331 selects the task that minimizes the data input / output amount to the accelerator memory as the next task and selects it. Continue preparing for data I / O for the task. Thereby, it is possible to effectively utilize the I / O bandwidth between the accelerator memory and the main memory 4 while reducing the input / output of data between the accelerator memory and the main memory 4. Therefore, according to the accelerator control device of the present embodiment, it is possible to avoid the data I / O from becoming a bottleneck in the processing of the task using the accelerator having the accelerator memory, and to speed up the processing.

In this embodiment, one data is divided and held in a plurality of accelerators, the processing of the user program is divided, and the processing is distributed to the accelerators holding the respective data partitions so that the data is loaded into the accelerator. Costs can be reduced and processing time can be reduced according to the number of accelerators used.

<Embodiment 2>
Next, an accelerator control device according to the second embodiment will be described. The accelerator control device of this embodiment has the same configuration as the accelerator control device 1 (FIGS. 8 to 21) of the first embodiment and performs the same operation, so only the difference will be described.

In the first embodiment, the task scheduler 321 receives the notification that the execution of the subtask (step B7 in FIG. 18) is completed from the task execution unit 322, and the execution of the subtask has occurred. The subtask is moved by searching for a subtask to be moved from the inexecutable subtask storage unit 361 to the accelerator executable subtask storage unit 363 and a subtask to be moved from the accelerator executable subtask storage unit 363 to the executable subtask storage unit 362. (Step B10 in FIG. 18). On the other hand, in the present embodiment, when the task scheduler 321 receives a notification from the data scheduler 331 that the subtask is in the execution waiting state (step B5 in FIG. 18), A subtask to be moved is searched from the non-executable subtask storage unit 361 to the accelerator executable subtask storage unit 363, and the subtask is moved. Also, the task scheduler 321 notifies the data scheduler 331 that the subtask has been moved to the accelerator executable subtask storage unit 363.

Furthermore, instead of the task scheduler 321, the data scheduler 331 may search and move a subtask to be moved from the inexecutable subtask storage unit 361 to the accelerator executable subtask storage unit 363. That is, the data scheduler 331 locks the input / output data partition (step B4 in FIG. 18), and the execution of the subtask waiting for execution has caused the non-executable subtask storage unit 361 to transfer to the accelerator executable subtask storage unit 363. The subtask may be moved by searching for the subtask to be moved.

According to the present embodiment, the task scheduler 321 is in a subsequent stage that can be executed upon completion of the subtask when the subtask enters the “execution complete” state (see FIG. 4). Subtasks are also added to the accelerator executable subtask storage unit 363. At this time, the data scheduler 331 also performs the input / output data transfer for the subsequent task that can be executed by the completion of the task at the time of the “execution waiting” state before the task enters the “execution completion” state. Can be a candidate for the task to start preparation. Therefore, according to the present embodiment, the data scheduler 331 can start preparing input / output data for a subsequent subtask earlier than the first embodiment. Therefore, according to the present embodiment, the I / O (Input / Output) band between the accelerator memory and the external memory can be further effectively used as compared with the first embodiment, and the accelerator having the memory is used. It is possible to further speed up the processing of the task that has been performed.

<Embodiment 3>
Next, a third embodiment will be described. In this embodiment, the operation of the accelerator control device 1 according to the first and second embodiments is performed on a computer having a CPU (Central Processing Unit) and a memory. In particular, the CPU is caused to perform the functions of a user program 21, a DAG (Directed Acyclic Graph) creation unit 22, a program analysis unit 31, a task scheduler 321, a task execution unit 322, a data scheduler 331, and a data migration unit 332. . On the other hand, the memory of the computer is used as the data management table 34, the memory management table 35, the subtask storage unit 36, and the main memory 4. Here, the memory is a storage means in a broad sense, and includes a semiconductor memory and a hard disk or flash disk generally called secondary storage. The accelerator is inserted into an I / O (Input / Output) slot of the computer. Alternatively, the accelerator and the computer can be connected using an interconnection for the I / O device.

The present invention can be applied to, for example, an application for speeding up the processing of a computing device including one or more accelerators.

It should be noted that the entire disclosure of the above patent document is incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiment can be changed and adjusted based on the basic technical concept. Further, various combinations or selections of various disclosed elements (including each element of each claim, each element of each embodiment, each element of each drawing, etc.) are possible within the framework of the entire disclosure of the present invention. is there. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea. In particular, with respect to the numerical ranges described in this document, any numerical value or small range included in the range should be construed as being specifically described even if there is no specific description.

1, 10 Accelerator control device 3 Accelerator control unit 4 Main memory 8 Information processing device 11 Task storage unit 12 Data scheduler 13 Task scheduler 14 First storage unit 15 Second storage unit 21 User program 22 DAG creation unit 31 Program analysis unit 32 Task processing unit 33 Data management unit 34 Data management table 35 Memory management table 36 Subtask storage units 51 to 53 Accelerators 61 to 66 Data 61-1 to 61-4, 62-1 to 62-4, 63-1 to 63- 4 Data partitions 71 to 74 Tasks 71-1 to 71-4, 72-1 to 72-4 Subtask 81 Shared memory 321 Task scheduler 322 Task execution unit 331 Data scheduler 332 Data migration unit 334 Prefetch determination unit 336 Times subtask determination unit 361 infeasible subtask storage unit 362 executable subtasks storage unit 363 accelerator executable subtasks storage unit 511-513 processors 521-523 accelerator memories 821-823 Accelerator

Claims

A task storage unit that holds executable tasks;
A task having a relatively small amount of input / output data to / from the memory when executing on an accelerator having a memory is selected from the executable tasks, and preparation for data input / output in the memory for the selected task is performed. A data scheduler for instructing the accelerator;
A task scheduler that instructs the accelerator to execute the selected task, and adds a task that can be executed upon completion of the selected task to the task storage unit,
The data scheduler continues the selection of the next task from the executable tasks held by the task storage unit and the preparation of data input / output for the selected next task in accordance with the use state of the memory.
An accelerator control device characterized by that.
The data scheduler is configured to store a task having a relatively small sum of input data amount to be loaded into the memory and output data amount to be saved from the memory to an external memory when executing on the accelerator. Select from the tasks held by the department,
The accelerator control device according to claim 1.
The data scheduler continues the selection of the next task and the preparation of data input / output for the selected next task when the memory usage is less than a predetermined threshold value.
The accelerator control device according to claim 1.
The task storage unit
A first storage unit holding a task that is an executable task and whose execution accelerator is not limited;
A second storage unit for holding a task whose execution accelerator is restricted;
The data scheduler is a task held in the second storage unit that has a relatively small amount of input / output data to / from a memory when executed on an accelerator, and a task whose execution destination is limited to the accelerator And selecting from tasks held by the first storage unit,
The accelerator control device according to any one of claims 1 to 3.
The first storage unit holds a task in which the uppermost task or all upstream tasks have been executed,
The second storage unit stores a task in which at least one upstream task is waiting to be executed by the accelerator and the execution of all the remaining upstream tasks is completed as a task whose execution destination accelerator is restricted. Hold,
The accelerator control device according to claim 4.
The task scheduler updates a task held by the first and / or second storage unit when the execution of the selected task is completed;
The accelerator control device according to claim 5.
The data scheduler or the task scheduler updates a task held by a second storage unit when preparation for data input / output with respect to the selected task is completed.
The accelerator control device according to claim 5 or 6.
Holding an executable task in a storage unit;
A task having a relatively small amount of input / output data to / from the memory when executing on an accelerator having a memory is selected from the executable tasks, and preparation for data input / output in the memory for the selected task is performed. Instructing the accelerator;
Instructing the accelerator to execute the selected task, and adding a task that can be executed upon completion of the selected task to the storage unit;
Selecting the next task from among the executable tasks held by the storage unit according to the use status of the memory, and continuing the preparation of data input / output for the selected next task,
An accelerator control method characterized by the above.
Tasks held by the storage unit are tasks having a relatively small sum of the amount of input data to be loaded into the memory when executed on the accelerator and the amount of output data to be saved from the memory to an external memory. Choose from,
The accelerator control method according to claim 8.
If the memory usage is below a predetermined threshold, continue to select the next task and prepare for data I / O for the selected next task.
The accelerator control method according to claim 8 or 9.
Holding in the storage unit a first task that is an executable task and whose execution destination accelerator is not limited;
A second task whose execution accelerator is restricted is held in the storage unit, and
A task having a relatively small amount of input / output data to / from memory when executing on the accelerator is the second task whose execution destination is restricted to the accelerator, and among the first task Selected from the
The accelerator control method according to any one of claims 8 to 10.
The first task is the most upstream task or a task in which all upstream tasks have been executed,
The second task is a task whose execution destination accelerator is restricted, and at least one of the upstream tasks is waiting to be executed by the accelerator, and the execution of all the remaining upstream tasks is completed. Is,
The accelerator control method according to claim 11.
The first and / or second task held by the storage unit is updated when the execution of the selected task is completed.
The accelerator control method according to claim 12.
The second task held by the storage unit is updated when preparation for data input / output with respect to the selected task is completed.
The accelerator control method according to claim 11 or 12.
A process of storing executable tasks in the storage unit;
A task having a relatively small amount of input / output data to / from the memory when executing on an accelerator having a memory is selected from the executable tasks, and preparation for data input / output in the memory for the selected task is performed. Processing to instruct the accelerator;
Instructing the accelerator to execute the selected task, and adding a task that can be executed upon completion of the selected task to the storage unit;
Depending on the use status of the memory, the next task is selected from among the executable tasks held by the storage unit, and the process of continuing preparation for data input / output for the selected next task is executed on the computer Let
A program characterized by that.