WO2019188180A1

WO2019188180A1 - Scheduling method and scheduling device

Info

Publication number: WO2019188180A1
Application number: PCT/JP2019/009632
Authority: WO
Inventors: 雅史九里; 英樹杉本
Original assignee: 株式会社デンソー; 株式会社エヌエスアイテクス
Priority date: 2018-03-30
Filing date: 2019-03-11
Publication date: 2019-10-03
Also published as: JP2019179417A

Abstract

The present invention is provided with: a tag reference unit (141) for reading tag information indicating the content of memory access used for each of a plurality of operations at a processing node; and an allocation unit (142) for determining the processing order of the plurality of operations on the basis of the tag information.

Description

Scheduling method and scheduling apparatus

Cross-reference of related applications

This application is based on Japanese Patent Application No. 2018-068434 filed on March 30, 2018, and claims the benefit of its priority. Which is incorporated herein by reference.

The present disclosure relates to a scheduling method and a scheduling device for executing a program having a graph structure composed of a plurality of processing nodes.

The invention described in Patent Document 1 below has been proposed for the purpose of improving the use efficiency of the cache memory. Patent Document 1 listed below is a processor-readable cache memory control program for causing a processor to execute a cache memory control process for controlling a cache memory by dividing it into a shared cache area and a dedicated cache area. The difference between the cache hit rate when a dedicated cache area is allocated and the cache hit rate when a shared cache area is allocated in response to a dedicated area acquisition request that requests allocation of a dedicated cache area According to the effective cache usage based on the memory access frequency, the higher the cache effective usage, the more dedicated cache area is allocated, and the lower the cache effective allocation is, the cache area allocation process for allocating the shared cache area, and the release of the allocated dedicated cache area are requested. Open dedicated area In response to the request, the dedicated cache region opening step of releasing the allocation of dedicated cache area, and has a.

JP 2015-36873 A

In Patent Document 1, since the cache memory is controlled by dividing it into a shared cache area and a dedicated cache area, it cannot be applied to a case where it is not divided into such areas. In particular, when a graph-structured program consisting of multiple processing nodes is executed, a large amount of parallel processing is performed. Therefore, the entire cache memory can be efficiently divided into areas such as a shared cache area and a dedicated cache area. It is necessary to utilize it.

This disclosure is intended to be able to efficiently use a cache memory even when a lot of parallel processing is performed.

The present disclosure relates to a scheduling method for executing a program having a graph structure composed of a plurality of processing nodes, and a tag reference step for reading tag information indicating contents of memory access used for each of a plurality of operations in the processing nodes And an allocation step for determining a processing order of a plurality of operations based on the tag information.

The present disclosure relates to a scheduling device for executing a program having a graph structure including a plurality of processing nodes, and a tag reference unit that reads tag information indicating the contents of memory access used for each of a plurality of operations in the processing nodes And an allocation unit that determines a processing order of a plurality of operations based on the tag information.

According to the present disclosure, by reading the tag information, it is possible to grasp the status of memory access in each of a plurality of operations, and therefore, it is possible to determine the processing order of a plurality of operations so as to reduce cache memory rewriting. it can.

FIG. 1 is a diagram for explaining parallel processing which is a premise of the present embodiment. FIG. 2 is a diagram showing a system configuration example for executing the parallel processing shown in FIG. FIG. 3 is a diagram illustrating a configuration example of the DFP used in FIG. FIG. 4 is a diagram for explaining a functional configuration example of the compiler. FIG. 5 is a diagram for explaining a functional configuration example of the thread scheduler. FIG. 6 is a diagram for explaining conventional scheduling. FIG. 7 is a diagram for explaining the state of memory access when processing is performed based on conventional scheduling. FIG. 8 is a diagram for explaining scheduling according to the present embodiment. FIG. 9 is a diagram for explaining the state of memory access when processing is performed based on scheduling according to the present embodiment.

Hereinafter, the present embodiment will be described with reference to the accompanying drawings. In order to facilitate the understanding of the description, the same constituent elements in the drawings will be denoted by the same reference numerals as much as possible, and redundant description will be omitted.

FIG. 1A shows a program code having a graph structure, FIG. 1B shows a thread state, and FIG. 1C shows a state of parallel processing.

As shown in FIG. 1A, the program to be processed in this embodiment has a graph structure in which data and processing are divided. This graph structure maintains the task parallelism and graph parallelism of the program.

1) When automatic vectorization and graph structure extraction are performed on the program code shown in FIG. 1A by a compiler, a large number of threads as shown in FIG. 1B can be generated.

1) Parallel execution as shown in FIG. 1C can be performed on a large number of threads shown in FIG. 1B by dynamic register placement and thread scheduling by hardware. By dynamically allocating register resources during execution, a plurality of threads can be executed in parallel for different instruction streams.

Next, a data processing system 2, which is a system configuration example including a DFP (Data Flow Processor) 10 as an accelerator for performing dynamic register placement and thread scheduling, will be described with reference to FIG.

The data processing system 2 includes a DFP 10, an event handler 20, a host CPU 21, a ROM 22, a RAM 23, an external interface 24, and a system bus 25. The host CPU 21 is an arithmetic unit that mainly performs data processing. The host CPU 21 supports the OS. The event handler 20 is a part that generates an interrupt process.

ROM 22 is a read-only memory. The RAM 23 is a read / write memory. The external interface 24 is an interface for exchanging information with the outside of the data processing system 2. The system bus 25 is for transmitting and receiving information between the DFP 10, the host CPU 21, the ROM 22, the RAM 23, and the external interface 24.

The DFP 10 is positioned as an individual master provided to cope with the heavy computation load of the host CPU 21. The DFP 10 is configured to support the interrupt generated by the event handler 20.

Next, the DFP 10 will be described with reference to FIG. As shown in FIG. 3, the DFP 10 includes a command unit 12, a thread scheduler 14, an execution core 16, and a memory subsystem 18.

The command unit 12 is configured to be able to communicate information with the config interface. The command unit 12 also functions as a command buffer.

The thread scheduler 14 is a part that schedules processing of a large number of threads as exemplified in FIG. The thread scheduler 14 can perform scheduling across threads.

The execution core 16 has four processing elements, PE # 0, PE # 1, PE # 2, and PE # 3. The execution core 16 has a number of pipelines that can be scheduled independently.

The memory subsystem 18 includes an arbiter 181, an L1 cache 18a, and an L2 cache 18b. The memory subsystem 18 is configured to allow information communication between the system bus interface and the ROM interface.

Subsequently, the compiler 50 will be described with reference to FIG. The compiler 50 includes a locality detection unit 501 and a tag addition unit 502 as functional components.

The locality detection unit 501 is a part that detects memory information used in each processing node constituting the graph structure. The tag assigning unit 502 is a part that assigns tag information based on the memory information detected by the locality detecting unit 501. The tag information indicates the contents of memory access used for each of a plurality of operations in the processing node.

Subsequently, functional components of the thread scheduler 14 will be described with reference to FIG. The thread scheduler 14 includes a tag reference unit 141 and an allocation unit 142 as functional components.

The tag reference unit 141 is a part that reads tag information indicating the contents of memory access used for each of a plurality of operations in the processing node.

The allocation unit 142 is a part that determines the processing order of a plurality of operations based on the tag information.

Prior to describing the processing of the tag reference unit 141 and the allocation unit 142 in the present embodiment, a conventional processing method that does not use tag information for comparison will be described with reference to FIGS. 6 and 7.

FIG. 6A shows a program for the thread 1. In the program for the thread 1, there are “d = a + b” and “e = a + c” as the processing group Gr1, and “g = b + c” as the processing group Gr2.

FIG. 6B shows the state of the memory area. In the memory area, “a”, “b”, “c”, “d”, “e”, and “f” are stored, and “g (for thread 1)” “h (thread) is stored as the storage area after the calculation. 1) ”and“ g (for thread 2) ”are provided.

FIG. 6C shows the state of the cache area. The cache line 1 has a holding area for “a”, “b”, and “c”, and the cache line 2 has a holding area for “d”, “e”, and “f”.

FIG. 6 (D) shows a program for thread 2. In the program for the thread 2, there is “f = a + d” as the processing group Gr3 and “g = b + c” as the processing group Gr4.

In this situation, the transition status of the cache area when processing group Gr1, processing group Gr2, processing group Gr3, and processing group Gr4 are processed in this order will be described with reference to FIG.

FIG. 7 shows the status of the cache area. In executing the processing group Gr1, “a”, “b”, and “c” are held in the cache line 1. When the processing group Gr1 is processed, “d = a + b” and “e = a + c” are calculated, and the calculation results “d”, “e”, and “f” are held in the cache line 2.

When processing group Gr2 is processed subsequent to processing group Gr1, cache line 2 is rewritten because it is necessary to store “g (for thread 1)”, which is the operation result of “g = b + c”.

When processing group Gr3 is processed subsequent to processing group Gr2, cache line 2 is rewritten because it is necessary to store “f” as the operation result.

When the processing group Gr4 is processed subsequent to the processing group Gr3, the cache line 2 is rewritten because it is necessary to store “g (for thread 2)” as the operation result.

An example of avoiding such frequent cache rewriting will be described with reference to FIGS.

FIG. 8A shows a program for the thread 1. In the program for the thread 1, there are “d = a + b” and “e = a + c” as the processing group Gr1, and “g = b + c” as the processing group Gr2. “TAG: a, b, c, d, e” is assigned as tag information to the processing group Gr1. “TAG: c, g (thread 1)” is assigned as tag information to the processing group Gr2.

FIG. 8B shows the state of the memory area. In the memory area, “a”, “b”, “c”, “d”, “e”, and “f” are stored, and “g (for thread 1)” “h (thread) is stored as the storage area after the calculation. 1) ”and“ g (for thread 2) ”are provided.

FIG. 8C shows the status of the cache area. The cache line 1 has a holding area for “a”, “b”, and “c”, and the cache line 2 has a holding area for “d”, “e”, and “f”.

FIG. 8D shows a program for the thread 2. In the program for the thread 2, there is “f = a + d” as the processing group Gr3 and “g = b + c” as the processing group Gr4. “TAG: a, d, f” is assigned as tag information to the processing group Gr3. “TAG: c, g (thread 2)” is assigned as tag information to the processing group Gr4.

If the tag information is assigned in this way, the processing order is set so that the cache memory can be rewritten less. As an example, it is determined that the processing group Gr1 is executed first. Subsequently, the processing group Gr3 to which the tag information “TAG: a, d, f” having many common parts with the tag information “TAG: a, b, c, d, e” of the processing group Gr1 is processed. Is determined.

Subsequently, a process group to which tag information having a lot of common parts with tag information “TAG: a, d, f” of the process group Gr3 is searched. Since there is no group, processing group Gr2 and processing group Gr4 are executed in the order of the original order.

Referring to FIG. 9, the transition state of the cache area when processing group Gr1, processing group Gr3, processing group Gr2, and processing group Gr4 are processed in this order will be described.

In executing the processing group Gr1, “a”, “b”, and “c” are held in the cache line 1. When the processing group Gr1 is processed, “d = a + b” and “e = a + c” are calculated, and the calculation results “d”, “e”, and “f” are held in the cache line 2.

When the processing group Gr3 is processed subsequent to the processing group Gr1, the data used for the processing is already stored in the cache line, so that the cache is not rewritten.

When executing the processing group Gr2 subsequent to the processing group Gr3, it is necessary to store “g (for thread 1)” which is the calculation result of “g = b + c”, so the cache line 2 is rewritten.

When the processing group Gr4 is executed subsequent to the processing group Gr2, cache rewriting does not occur because “g (for thread 2)” is secured.

As described above, the present embodiment is a scheduling method for executing a program having a graph structure composed of a plurality of processing nodes, and shows the contents of memory access used for each of a plurality of operations in the processing nodes. A tag reference step for reading the tag information; and an allocation step for determining a processing order of a plurality of operations based on the tag information.

As a device, it is a thread scheduler 14 as a scheduling device when executing a program having a graph structure composed of a plurality of processing nodes, and a tag indicating the contents of memory access used for each of a plurality of operations in the processing nodes A tag reference unit 141 that reads information and an allocation unit 142 that determines a processing order of a plurality of operations based on the tag information.

In this embodiment, by reading the tag information, it is possible to grasp the status of memory access in each of the plurality of operations, so that the processing order of the plurality of operations can be determined so that rewriting of the cache memory is reduced. .

The embodiment has been described above with reference to specific examples. However, the present disclosure is not limited to these specific examples. Those in which those skilled in the art appropriately modify the design of these specific examples are also included in the scope of the present disclosure as long as they have the features of the present disclosure. Each element included in each of the specific examples described above and their arrangement, conditions, shape, and the like are not limited to those illustrated, and can be changed as appropriate. Each element included in each of the specific examples described above can be appropriately combined as long as no technical contradiction occurs.

Claims

A scheduling method for executing a program having a graph structure composed of a plurality of processing nodes,
A tag reference step for reading tag information indicating the contents of memory access used for each of a plurality of operations in the processing node;
An allocation step for determining a processing order of the plurality of operations based on the tag information.
A scheduling apparatus for executing a program having a graph structure composed of a plurality of processing nodes,
A tag reference unit (141) for reading tag information indicating the contents of memory access used for each of a plurality of operations in the processing node;
An allocation unit (142) that determines a processing order of the plurality of operations based on the tag information.