WO2019188180A1 - Procédé de programmation et dispositif de programmation - Google Patents

Procédé de programmation et dispositif de programmation Download PDF

Info

Publication number
WO2019188180A1
WO2019188180A1 PCT/JP2019/009632 JP2019009632W WO2019188180A1 WO 2019188180 A1 WO2019188180 A1 WO 2019188180A1 JP 2019009632 W JP2019009632 W JP 2019009632W WO 2019188180 A1 WO2019188180 A1 WO 2019188180A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
cache
tag
tag information
thread
Prior art date
Application number
PCT/JP2019/009632
Other languages
English (en)
Japanese (ja)
Inventor
雅史 九里
英樹 杉本
Original Assignee
株式会社デンソー
株式会社エヌエスアイテクス
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社デンソー, 株式会社エヌエスアイテクス filed Critical 株式会社デンソー
Publication of WO2019188180A1 publication Critical patent/WO2019188180A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt

Definitions

  • the present disclosure relates to a scheduling method and a scheduling device for executing a program having a graph structure composed of a plurality of processing nodes.
  • Patent Document 1 The invention described in Patent Document 1 below has been proposed for the purpose of improving the use efficiency of the cache memory.
  • Patent Document 1 listed below is a processor-readable cache memory control program for causing a processor to execute a cache memory control process for controlling a cache memory by dividing it into a shared cache area and a dedicated cache area.
  • the difference between the cache hit rate when a dedicated cache area is allocated and the cache hit rate when a shared cache area is allocated in response to a dedicated area acquisition request that requests allocation of a dedicated cache area According to the effective cache usage based on the memory access frequency, the higher the cache effective usage, the more dedicated cache area is allocated, and the lower the cache effective allocation is, the cache area allocation process for allocating the shared cache area, and the release of the allocated dedicated cache area are requested.
  • Open dedicated area In response to the request, the dedicated cache region opening step of releasing the allocation of dedicated cache area, and has a.
  • Patent Document 1 since the cache memory is controlled by dividing it into a shared cache area and a dedicated cache area, it cannot be applied to a case where it is not divided into such areas. In particular, when a graph-structured program consisting of multiple processing nodes is executed, a large amount of parallel processing is performed. Therefore, the entire cache memory can be efficiently divided into areas such as a shared cache area and a dedicated cache area. It is necessary to utilize it.
  • This disclosure is intended to be able to efficiently use a cache memory even when a lot of parallel processing is performed.
  • the present disclosure relates to a scheduling method for executing a program having a graph structure composed of a plurality of processing nodes, and a tag reference step for reading tag information indicating contents of memory access used for each of a plurality of operations in the processing nodes And an allocation step for determining a processing order of a plurality of operations based on the tag information.
  • the present disclosure relates to a scheduling device for executing a program having a graph structure including a plurality of processing nodes, and a tag reference unit that reads tag information indicating the contents of memory access used for each of a plurality of operations in the processing nodes And an allocation unit that determines a processing order of a plurality of operations based on the tag information.
  • the present disclosure by reading the tag information, it is possible to grasp the status of memory access in each of a plurality of operations, and therefore, it is possible to determine the processing order of a plurality of operations so as to reduce cache memory rewriting. it can.
  • FIG. 1 is a diagram for explaining parallel processing which is a premise of the present embodiment.
  • FIG. 2 is a diagram showing a system configuration example for executing the parallel processing shown in FIG.
  • FIG. 3 is a diagram illustrating a configuration example of the DFP used in FIG.
  • FIG. 4 is a diagram for explaining a functional configuration example of the compiler.
  • FIG. 5 is a diagram for explaining a functional configuration example of the thread scheduler.
  • FIG. 6 is a diagram for explaining conventional scheduling.
  • FIG. 7 is a diagram for explaining the state of memory access when processing is performed based on conventional scheduling.
  • FIG. 8 is a diagram for explaining scheduling according to the present embodiment.
  • FIG. 9 is a diagram for explaining the state of memory access when processing is performed based on scheduling according to the present embodiment.
  • FIG. 1A shows a program code having a graph structure
  • FIG. 1B shows a thread state
  • FIG. 1C shows a state of parallel processing.
  • the program to be processed in this embodiment has a graph structure in which data and processing are divided. This graph structure maintains the task parallelism and graph parallelism of the program.
  • Parallel execution as shown in FIG. 1C can be performed on a large number of threads shown in FIG. 1B by dynamic register placement and thread scheduling by hardware. By dynamically allocating register resources during execution, a plurality of threads can be executed in parallel for different instruction streams.
  • a data processing system 2 which is a system configuration example including a DFP (Data Flow Processor) 10 as an accelerator for performing dynamic register placement and thread scheduling, will be described with reference to FIG.
  • DFP Data Flow Processor
  • the data processing system 2 includes a DFP 10, an event handler 20, a host CPU 21, a ROM 22, a RAM 23, an external interface 24, and a system bus 25.
  • the host CPU 21 is an arithmetic unit that mainly performs data processing.
  • the host CPU 21 supports the OS.
  • the event handler 20 is a part that generates an interrupt process.
  • ROM 22 is a read-only memory.
  • the RAM 23 is a read / write memory.
  • the external interface 24 is an interface for exchanging information with the outside of the data processing system 2.
  • the system bus 25 is for transmitting and receiving information between the DFP 10, the host CPU 21, the ROM 22, the RAM 23, and the external interface 24.
  • the DFP 10 is positioned as an individual master provided to cope with the heavy computation load of the host CPU 21.
  • the DFP 10 is configured to support the interrupt generated by the event handler 20.
  • the DFP 10 includes a command unit 12, a thread scheduler 14, an execution core 16, and a memory subsystem 18.
  • the command unit 12 is configured to be able to communicate information with the config interface.
  • the command unit 12 also functions as a command buffer.
  • the thread scheduler 14 is a part that schedules processing of a large number of threads as exemplified in FIG.
  • the thread scheduler 14 can perform scheduling across threads.
  • the execution core 16 has four processing elements, PE # 0, PE # 1, PE # 2, and PE # 3.
  • the execution core 16 has a number of pipelines that can be scheduled independently.
  • the memory subsystem 18 includes an arbiter 181, an L1 cache 18a, and an L2 cache 18b.
  • the memory subsystem 18 is configured to allow information communication between the system bus interface and the ROM interface.
  • the compiler 50 includes a locality detection unit 501 and a tag addition unit 502 as functional components.
  • the locality detection unit 501 is a part that detects memory information used in each processing node constituting the graph structure.
  • the tag assigning unit 502 is a part that assigns tag information based on the memory information detected by the locality detecting unit 501.
  • the tag information indicates the contents of memory access used for each of a plurality of operations in the processing node.
  • the thread scheduler 14 includes a tag reference unit 141 and an allocation unit 142 as functional components.
  • the tag reference unit 141 is a part that reads tag information indicating the contents of memory access used for each of a plurality of operations in the processing node.
  • the allocation unit 142 is a part that determines the processing order of a plurality of operations based on the tag information.
  • FIG. 6A shows a program for the thread 1.
  • FIG. 6B shows the state of the memory area.
  • “a”, “b”, “c”, “d”, “e”, and “f” are stored, and “g (for thread 1)” “h (thread) is stored as the storage area after the calculation. 1) ”and“ g (for thread 2) ”are provided.
  • FIG. 6C shows the state of the cache area.
  • the cache line 1 has a holding area for “a”, “b”, and “c”, and the cache line 2 has a holding area for “d”, “e”, and “f”.
  • FIG. 6 (D) shows a program for thread 2.
  • FIG. 7 shows the status of the cache area.
  • “a”, “b”, and “c” are held in the cache line 1.
  • FIG. 8A shows a program for the thread 1.
  • “TAG: a, b, c, d, e” is assigned as tag information to the processing group Gr1.
  • “TAG: c, g (thread 1)” is assigned as tag information to the processing group Gr2.
  • FIG. 8B shows the state of the memory area.
  • “a”, “b”, “c”, “d”, “e”, and “f” are stored, and “g (for thread 1)” “h (thread) is stored as the storage area after the calculation. 1) ”and“ g (for thread 2) ”are provided.
  • FIG. 8C shows the status of the cache area.
  • the cache line 1 has a holding area for “a”, “b”, and “c”, and the cache line 2 has a holding area for “d”, “e”, and “f”.
  • FIG. 8D shows a program for the thread 2.
  • “TAG: a, d, f” is assigned as tag information to the processing group Gr3.
  • “TAG: c, g (thread 2)” is assigned as tag information to the processing group Gr4.
  • the processing order is set so that the cache memory can be rewritten less.
  • the processing group Gr1 is executed first.
  • the processing group Gr3 to which the tag information “TAG: a, d, f” having many common parts with the tag information “TAG: a, b, c, d, e” of the processing group Gr1 is processed. Is determined.
  • the present embodiment is a scheduling method for executing a program having a graph structure composed of a plurality of processing nodes, and shows the contents of memory access used for each of a plurality of operations in the processing nodes.
  • a tag reference step for reading the tag information; and an allocation step for determining a processing order of a plurality of operations based on the tag information.
  • a thread scheduler 14 As a device, it is a thread scheduler 14 as a scheduling device when executing a program having a graph structure composed of a plurality of processing nodes, and a tag indicating the contents of memory access used for each of a plurality of operations in the processing nodes
  • a tag reference unit 141 that reads information and an allocation unit 142 that determines a processing order of a plurality of operations based on the tag information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

La présente invention comprend : une unité de référence d'étiquette (141) destinée à lire des informations d'étiquette indiquant le contenu d'accès mémoire utilisé pour chacune d'une pluralité d'opérations au niveau d'un nœud de traitement ; et une unité d'attribution (142) destinée à déterminer l'ordre de traitement de la pluralité d'opérations sur la base des informations d'étiquette.
PCT/JP2019/009632 2018-03-30 2019-03-11 Procédé de programmation et dispositif de programmation WO2019188180A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-068434 2018-03-30
JP2018068434A JP2019179417A (ja) 2018-03-30 2018-03-30 スケジューリング方法、スケジューリング装置

Publications (1)

Publication Number Publication Date
WO2019188180A1 true WO2019188180A1 (fr) 2019-10-03

Family

ID=68061547

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/009632 WO2019188180A1 (fr) 2018-03-30 2019-03-11 Procédé de programmation et dispositif de programmation

Country Status (2)

Country Link
JP (1) JP2019179417A (fr)
WO (1) WO2019188180A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0844577A (ja) * 1994-07-26 1996-02-16 Sumisho Electron Kk データ分割方法及びマルチプロセッサシステム
JPH10116198A (ja) * 1996-09-30 1998-05-06 Nec Corp プログラムのキャッシュ局所性改善方法
JP2008004082A (ja) * 2006-05-26 2008-01-10 Matsushita Electric Ind Co Ltd コンパイラ装置、コンパイル方法およびコンパイラプログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0844577A (ja) * 1994-07-26 1996-02-16 Sumisho Electron Kk データ分割方法及びマルチプロセッサシステム
JPH10116198A (ja) * 1996-09-30 1998-05-06 Nec Corp プログラムのキャッシュ局所性改善方法
JP2008004082A (ja) * 2006-05-26 2008-01-10 Matsushita Electric Ind Co Ltd コンパイラ装置、コンパイル方法およびコンパイラプログラム

Also Published As

Publication number Publication date
JP2019179417A (ja) 2019-10-17

Similar Documents

Publication Publication Date Title
JP6294586B2 (ja) 命令スレッドを組み合わせた実行の管理システムおよび管理方法
TWI537831B (zh) 多核心處理器、用於執行處理程序切換之方法、用於保全一記憶體區塊之方法、用於致能使用一多核心裝置之異動處理之設備、以及用於執行記憶體異動處理之方法
US9378069B2 (en) Lock spin wait operation for multi-threaded applications in a multi-core computing environment
JP2012104140A (ja) 待機状態にあるプロセッサ実行リソースの共有
US9229765B2 (en) Guarantee real time processing of soft real-time operating system by instructing core to enter a waiting period prior to transferring a high priority task
JP2009528610A (ja) タスクの実行フェーズに基づいてキャッシュパーティションを動的にリサイズする方法及び装置
JP2008152470A (ja) データ処理システム及び半導体集積回路
US10545890B2 (en) Information processing device, information processing method, and program
JP5391422B2 (ja) メモリ管理方法、計算機システム及びプログラム
US20150268985A1 (en) Low Latency Data Delivery
JP2009223842A (ja) 仮想計算機制御プログラム及び仮想計算機システム
JP4253796B2 (ja) コンピュータ及び制御方法
WO2023077875A1 (fr) Procédé et appareil pour exécuter des noyaux en parallèle
WO2019188180A1 (fr) Procédé de programmation et dispositif de programmation
JP6156379B2 (ja) スケジューリング装置、及び、スケジューリング方法
JP2013114538A (ja) 情報処理装置、情報処理方法及び制御プログラム
JP4017005B2 (ja) 演算装置
WO2019188177A1 (fr) Dispositif de traitement d'informations
JP7064367B2 (ja) デッドロック回避方法、デッドロック回避装置
WO2019188182A1 (fr) Dispositif de commande de pré-extraction
JP6364827B2 (ja) 情報処理装置、及び、そのリソースアクセス方法、並びに、リソースアクセスプログラム
JP2010026575A (ja) スケジューリング方法およびスケジューリング装置並びにマルチプロセッサシステム
JP2013149108A (ja) 情報処理装置及びその制御方法、プログラム
JP7080698B2 (ja) 情報処理装置
JP4878050B2 (ja) コンピュータ及び制御方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19776200

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19776200

Country of ref document: EP

Kind code of ref document: A1