JP7217341B2

JP7217341B2 - How processors and registers are inherited

Info

Publication number: JP7217341B2
Application number: JP2021514866A
Authority: JP
Inventors: 一嘉石渡; 学根本
Original assignee: Denso Corp; NSI Texe Inc
Current assignee: Denso Corp; NSI Texe Inc
Priority date: 2019-04-18
Filing date: 2020-04-01
Publication date: 2023-02-02
Anticipated expiration: 2040-04-01
Also published as: WO2020213397A1; JPWO2020213397A1

Description

本開示は、プロセッサおよびレジスタの継承方法に関する。 The present disclosure relates to processors and register inheritance methods.

Cross-references to related applications

本出願は、２０１９年４月１８日に日本国に出願した特願２０１９－０７９３８３号に基づくものであって、その優先権の利益を主張するものであり、その特許出願のすべての内容が、参照により本明細書に組み入れられる。 This application is based on Japanese Patent Application No. 2019-079383 filed in Japan on April 18, 2019, and claims the benefit of its priority. incorporated herein by reference.

従来、ＣＰＵ（Central Processing Unit）の負荷が大きな処理は、応答時間およびスループットを向上させるために、複数のスレッドに分割されて並列処理されることがある。複数のスレッドに分割される負荷の大きな処理では、前のスレッドの処理が終わるたびに、演算結果をレジスタからメモリにストアする。その後、演算結果をメモリからレジスタにリロードし、後のスレッドの処理を開始する（以下、「ロード／ストア処理」という。）。 Conventionally, processing with a large CPU (Central Processing Unit) load is sometimes divided into multiple threads and processed in parallel in order to improve response time and throughput. In heavy-load processing divided into multiple threads, each time the processing of the previous thread finishes, the result of the operation is stored from the register to memory. After that, the operation result is reloaded from the memory to the register, and processing of the subsequent thread is started (hereinafter referred to as "load/store processing").

このロード／ストア処理では、ＰＥ（Processing Element）が１つずつしかメモリにアクセスできない。特に、メモリが１つのとき、ロード／ストア処理に由来するメモリへのアクセスの競合が高い頻度で発生する。そのため、スレッドの数が多い負荷の大きな処理では、ロード／ストア処理の時間が長くなる。すなわち、ロード／ストア処理はコストが高く、スレッドの数が多いほど処理速度への影響が大きくなる。 In this load/store processing, only one PE (Processing Element) can access the memory. In particular, when there is only one memory, contention for memory access due to load/store processing frequently occurs. Therefore, load/store processing takes a long time in processing with a large number of threads and a large load. That is, load/store processing is expensive, and the greater the number of threads, the greater the impact on processing speed.

このロード／ストア処理に由来するメモリへのアクセスの競合を解消するために、メモリの数を増やすことが考えられる。しかし、プロセッサの面積の増大やコストの増加を引き起こし複雑性が高くなるため、根本的な解決にはならない。例えば、特許文献１では、スカラ演算に関する上述したロード／ストア処理の問題を改善するために、プロセッサにおけるレジスタ内容の継承装置が提案されている。 In order to eliminate memory access contention resulting from this load/store processing, it is conceivable to increase the number of memories. However, it is not a fundamental solution because it causes an increase in the area and cost of the processor and increases the complexity. For example, Patent Document 1 proposes a register content inheritance device in a processor in order to improve the above-described load/store processing problems related to scalar operations.

特開２０００－０２０３２６号公報JP-A-2000-020326

上記した特許文献１ではスカラ演算を対象としているが、ベクトル演算では、スカラ演算とは異なり、演算に利用するレジスタの領域が一定ではなくベクトル長に応じて可変となり、単純にレジスタ番号でレジスタの領域を指定できない。そのため、スカラ演算の技術をベクトル演算には利用できない。 The above-mentioned patent document 1 targets scalar operations, but unlike scalar operations, in vector operations, the area of registers used for operations is not fixed, but variable according to the vector length. Area cannot be specified. Therefore, the technique of scalar arithmetic cannot be used for vector arithmetic.

本開示は、ベクトル演算においてもロード／ストア処理を可能な限り削減し、処理性能を改善したプロセッサおよびレジスタの継承方法を提供することを目的とする。 An object of the present disclosure is to provide a processor and a register inheritance method that reduce load/store processing as much as possible even in vector operations and improve processing performance.

本開示は上記課題を解決するために以下の技術的手段を採用する。特許請求の範囲に記載した括弧内の符号は、ひとつの態様として後述する実施形態に記載の具体的手段との対応関係を示す一例であって、本開示の技術的範囲を限定するものではない。 The present disclosure employs the following technical means to solve the above problems. The symbols in parentheses described in the claims are an example showing the corresponding relationship with the specific means described in the embodiment described later as one aspect, and do not limit the technical scope of the present disclosure. .

上記目的を達成するために、本開示にかかるプロセッサは、複数の演算器と、複数の演算器に対してスレッドを振り分けるスレッドスケジューラと、複数の演算器で共有されるレジスタファイルと、レジスタファイルに、スレッドで用いるレジスタの領域を割り当てるレジスタコントローラと、スレッドおよびレジスタを識別する情報と当該レジスタが割り当てられたレジスタの領域のアドレスとを関連付けて記憶した管理テーブルとを備え、レジスタコントローラは、前のスレッドで用いたレジスタを後のスレッドで継承させるときには、前のスレッドの処理が終了した後も、継承すべきレジスタの領域を解放せず、当該継承すべきレジスタの領域のアドレスに関連付けられた管理テーブルの情報を後のスレッドおよびレジスタを識別する情報に書き換える。 To achieve the above object, a processor according to the present disclosure includes a plurality of arithmetic units, a thread scheduler that distributes threads to the plurality of arithmetic units, a register file shared by the plurality of arithmetic units, and , a register controller that allocates a register area used by a thread, and a management table that stores information identifying a thread and a register in association with the address of the register area to which the register is allocated, wherein the register controller allocates the previous When the registers used by a thread are to be inherited by a later thread, the area of the register to be inherited is not released even after the processing of the previous thread is completed, and the management associated with the address of the register area to be inherited is maintained. Rewrite the information in the table with information that identifies the later thread and register.

この構成により、レジスタコントローラは、前のスレッドから後のスレッドに継承される継承レジスタをスレッドおよびレジスタを識別する情報を用いて管理する。ベクトル演算では、ベクトル長に応じてレジスタのサイズ（容量）が変わるため、単純にレジスタ番号でレジスタの領域を指定できないが、本開示の構成により、ベクトル長に応じてレジスタの容量が変わるときでも後のスレッドは当該レジスタを継承できる。これにより、マルチスレッドのベクトル演算においてもロード／ストア処理を可能な限り削減し、処理性能を改善できる。 With this configuration, the register controller manages the inherited registers that are inherited from the previous thread to the subsequent threads using information that identifies the threads and registers. In vector operations, the register size (capacity) varies according to the vector length, so the register area cannot be specified simply by the register number. Later threads can inherit the register. As a result, load/store processing can be reduced as much as possible even in multithreaded vector operations, and processing performance can be improved.

図１は、第１の実施形態に係るプロセッサの構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a processor according to the first embodiment. 図２Ａは、第１の実施形態に係るスレッドにレジスタを割り当てる前のレジスタファイルの概念図である。FIG. 2A is a conceptual diagram of a register file before allocating registers to threads according to the first embodiment. 図２Ｂは、第１の実施形態に係る第１のスレッド、第２のスレッド、第３のスレッドにレジスタを割り当てた後のレジスタファイルの概念図である。FIG. 2B is a conceptual diagram of a register file after allocating registers to the first thread, the second thread, and the third thread according to the first embodiment. 図２Ｃは、第１の実施形態に係る第１のスレッドにレジスタを割り当てた後のレジスタのサイズの概念図である。FIG. 2C is a conceptual diagram of the sizes of registers after allocating registers to the first thread according to the first embodiment. 図２Ｄは、第１の実施形態に係る第２のスレッドにレジスタを割り当てた後のレジスタのサイズの概念図である。FIG. 2D is a conceptual diagram of the sizes of registers after allocating registers to the second thread according to the first embodiment. 図３Ａは、第１の実施形態に係るプロセッサで用いられる管理テーブルの例を示す図である。3A is a diagram illustrating an example of a management table used by the processor according to the first embodiment; FIG. 図３Ｂは、第１の実施形態に係る前のスレッドから後のスレッドにレジスタ番号ｖ０の値を継承したときの管理テーブルの例を示す図である。FIG. 3B is a diagram illustrating an example of a management table when inheriting the value of register number v0 from a previous thread to a subsequent thread according to the first embodiment; 図４は、第１の実施形態に係るプロセッサがスレッドの処理終了時にレジスタの処理を行うフロー図である。FIG. 4 is a flowchart of register processing performed by the processor according to the first embodiment when processing of a thread ends. 図５は、第１の実施形態に係るプロセッサがスレッドの処理開始時にレジスタの割り当て処理を行うフロー図である。FIG. 5 is a flowchart of register allocation processing performed by the processor according to the first embodiment when processing of a thread is started. 図６は、第２の実施形態に係るレジスタの退避およびリロードを説明するための図である。FIG. 6 is a diagram for explaining saving and reloading of registers according to the second embodiment. 図７は、第２の実施形態に係るプロセッサがスレッドの処理開始時にレジスタの割り当て処理を行うフロー図である。FIG. 7 is a flow diagram of register allocation processing performed by the processor according to the second embodiment when processing of a thread is started.

以下、図面を参照して本実施形態を説明する。なお、以下に説明する本実施形態は、本実施する場合の一例を示すものであって、本発明を以下に説明する具体的構成に限定するものではない。本発明の実施にあたっては、実施形態に応じた具体的構成が適宜採用されてよい。 Hereinafter, this embodiment will be described with reference to the drawings. In addition, this embodiment described below shows an example of the case of carrying out this invention, Comprising: It does not limit this invention to the concrete structure demonstrated below. In carrying out the present invention, a specific configuration according to the embodiment may be appropriately adopted.

（第１の実施形態）
［プロセッサの構成］
図１は、第１の実施形態に係るプロセッサの構成を示すブロック図である。プロセッサ１００は、キャッシュ１０２、レジスタファイル１０４、演算ユニット１０６、レジスタコントローラ１０８、スレッドスケジューラ１１０、ローカルＲＡＭ１１２を有する。(First embodiment)
[Processor configuration]
FIG. 1 is a block diagram showing the configuration of a processor according to the first embodiment. Processor 100 has cache 102 , register file 104 , arithmetic unit 106 , register controller 108 , thread scheduler 110 and local RAM 112 .

第１の実施形態のプロセッサ１００は、グラフ構造のプログラムを解析、分割して得られた多量のスレッドを処理する。グラフ構造を分割して複数のスレッドを生成しているので、あるスレッドで得られた演算結果出力を後続のスレッドで入力として用いるということが起こるため、スレッド間で演算結果の引継ぎが必要となる。さらに、この引継ぎのときに、処理性能を高めるためにロード／ストア処理を削減することも必要となる。 The processor 100 of the first embodiment processes a large number of threads obtained by analyzing and dividing a graph-structured program. Since multiple threads are generated by dividing the graph structure, the operation result output obtained in one thread may be used as input in the subsequent thread, so it is necessary to hand over the operation result between threads. . In addition, it is also necessary to reduce load/store processing in order to improve processing performance at the time of this handover.

そこで、プロセッサ１００は、多量のスレッドに対してレジスタ資源を動的に配置しつつ、スレッド間でレジスタを継承させる。これにより、プロセッサ１００は、異なる命令ストリームに対してであっても複数のスレッドをＰＥ１１４に割り当て並列実行し、さらに処理性能を高めることもできる。 Therefore, the processor 100 dynamically allocates register resources to a large number of threads and inherits the registers among the threads. As a result, the processor 100 can allocate multiple threads to the PEs 114 and execute them in parallel, even for different instruction streams, and further improve the processing performance.

キャッシュ１０２は、ホストＣＰＵと演算ユニット１０６との間に置かれるキャッシュである。キャッシュ１０２は、システムバスインターフェイスまたはＲＯＭインターフェイスと通信してもよい。 Cache 102 is a cache placed between the host CPU and the computing unit 106 . Cache 102 may communicate with a system bus interface or a ROM interface.

レジスタファイル１０４は、演算データを格納するレジスタファイルである（以下、スレッドに割り当てられる各レジスタを「レジスタ」、スレッドに割り当てられる複数のレジスタを「レジスタの領域」という。）。 The register file 104 is a register file that stores operation data (each register assigned to a thread is hereinafter referred to as a "register", and a plurality of registers assigned to a thread is referred to as a "register area").

このレジスタファイル１０４を、図２を用いて具体的に説明する。図２では特にベクトル演算におけるレジスタにフォーカスし説明する。図２Ａは、第１の実施形態に係るスレッドにレジスタを割り当てる前のレジスタファイル１０４の概念図であり、図２Ｂは、第１の実施形態に係る第１のスレッドｔ１、第２のスレッドｔ２、第３のスレッドｔ３にレジスタを割り当てた後のレジスタファイル１０４の概念図であり、図２Ｃは、第１の実施形態に係る第１のスレッドｔ１にレジスタを割り当てた後のレジスタのサイズの概念図であり、図２Ｄは、第１の実施形態に係る第２のスレッドｔ２にレジスタを割り当てた後のレジスタのサイズの概念図である。 This register file 104 will be specifically described with reference to FIG. In FIG. 2, description will be made with a particular focus on registers in vector operations. FIG. 2A is a conceptual diagram of the register file 104 before allocating registers to threads according to the first embodiment, and FIG. 2B shows the first thread t1, the second thread t2, FIG. 2C is a conceptual diagram of the register file 104 after allocating registers to the third thread t3, and FIG. 2C is a conceptual diagram of the size of the registers after allocating the registers to the first thread t1 according to the first embodiment; , and FIG. 2D is a conceptual diagram of the sizes of registers after the registers are allocated to the second thread t2 according to the first embodiment.

図２Ａに示すように、スレッドにレジスタの領域を割り当てる前のレジスタファイル１０４は、第１のスレッドｔ１、第２のスレッドｔ２、第３のスレッドｔ３のいずれにも、レジスタの領域が割り当てられていない。 As shown in FIG. 2A, in the register file 104 before register areas are allocated to threads, register areas are allocated to all of the first thread t1, the second thread t2, and the third thread t3. do not have.

一方、図２Ｂに示すように、スレッドにレジスタの領域を割り当てた後のレジスタファイル１０４は、第１のスレッドｔ１、第２のスレッドｔ２、第３のスレッドｔ３それぞれに、連続したレジスタの領域が割り当てられている。 On the other hand, as shown in FIG. 2B, the register file 104 after allocating the register areas to the threads has continuous register areas for the first thread t1, the second thread t2, and the third thread t3. assigned.

１つの要素に対して順次演算をするスカラ演算とは異なり、ベクトル演算は複数個のデータを１つのレジスタとして１つの演算を全データに同時に実行する。データの要素数は、ベクトル長という。ベクトル演算に利用するレジスタのサイズは、このベクトル長に依存する。そのため、レジスタのサイズは、スレッドに割り当てるレジスタの最大サイズにより決まる。 Unlike scalar operations that sequentially operate on one element, vector operations simultaneously perform one operation on all data using a plurality of data as one register. The number of data elements is called vector length. The size of registers used for vector operations depends on this vector length. Therefore, the size of the registers is determined by the maximum size of the registers allocated to the threads.

例えば、第１のスレッドｔ１は、第２のスレッドｔ２よりも、ベクトル長が大きい。そのため、図２Ｃに示すように、第１のスレッドｔ１に割り当てるレジスタｖ０～ｖ３１のサイズは、図２Ｄに示すように、第２のスレッドｔ２に割り当てるレジスタｖ０～ｖ３１のサイズよりも大きい。 For example, the first thread t1 has a larger vector length than the second thread t2. Therefore, as shown in FIG. 2C, the size of the registers v0 to v31 allocated to the first thread t1 is larger than the size of the registers v0 to v31 allocated to the second thread t2, as shown in FIG. 2D.

図１に戻って説明を続ける。演算ユニット１０６は、スレッドの処理を実行する実行部である。演算ユニット１０６は、複数のＰＥ１１４を有する。ＰＥ１１４は、演算器である。複数のＰＥ１１４のうちの少なくとも１つは、ベクトル演算するベクトル演算器である。 Returning to FIG. 1, the description continues. The arithmetic unit 106 is an execution unit that executes thread processing. The arithmetic unit 106 has multiple PEs 114 . PE 114 is a calculator. At least one of the plurality of PEs 114 is a vector calculator that performs vector calculations.

レジスタコントローラ１０８は、スレッドにレジスタの領域を割り当てるレジスタのコントローラである。管理テーブル１１６は、スレッドの処理に用いられるレジスタを管理するテーブルである。上述したＰＥ１１４は、管理テーブル１１６を参照して、レジスタを確保し、また、レジスタにアクセスする。 The register controller 108 is a register controller that allocates register areas to threads. The management table 116 is a table for managing registers used for thread processing. The PE 114 described above refers to the management table 116 to reserve and access registers.

管理テーブル１１６は、スレッドおよびレジスタを識別する情報と当該レジスタが割り当てられたレジスタの領域のアドレスとを関連付けて記憶している。具体的には、管理テーブル１１６は、キーとレジスタのアドレスとを紐づけている。キーは、スレッドＩＤ（識別子）とレジスタ番号とを有する。すなわち、管理テーブル１１６は、スレッドＩＤで識別されるスレッドの処理に用いられるレジスタの値が、レジスタファイルのどこに記憶されているかを管理するテーブルである。 The management table 116 associates and stores information identifying a thread and a register with the address of the register area to which the register is allocated. Specifically, the management table 116 associates keys with register addresses. A key has a thread ID (identifier) and a register number. That is, the management table 116 is a table for managing where in the register file the values of the registers used for the processing of the thread identified by the thread ID are stored.

図３Ａは、第１の実施形態に係るプロセッサ１００で用いられる管理テーブル１１６の例を示す図である。図３Ａに示すように、スレッド０のレジスタ番号ｖ０のレジスタは、アドレス０ｘａａａａ１２３４で始まる領域が割り当てられ（行２０２）、レジスタ番号ｖ１のレジスタは、アドレス０ｘｂｂｂｂ１２３４で始まる領域が割り当てられ（行２０４）、レジスタ番号ｖ２のレジスタは、アドレス０ｘｃｃｃｃ１２３４で始まる領域が割り当てられている（行２０６）。 FIG. 3A is a diagram showing an example of the management table 116 used by the processor 100 according to the first embodiment. As shown in FIG. 3A, the register with register number v0 of thread 0 is allocated an area starting at address 0xaaaa1234 (line 202), the register with register number v1 is allocated an area starting with address 0xbbbb1234 (line 204), The register with register number v2 is allocated an area starting at address 0xcccc1234 (line 206).

レジスタコントローラ１０８は、例えば、スレッドの処理が終了した後に、レジスタを解放する。具体的には、レジスタコントローラ１０８は、スレッドの処理を終了するときに、そのスレッドで確保していた領域に対し、使用可能であることを示すフラグを立てることで領域を解放する。 The register controller 108 releases the registers, for example, after processing of the thread is finished. Specifically, when terminating the processing of a thread, the register controller 108 sets a flag indicating that the area secured by the thread is available for use, thereby releasing the area.

また、第１の実施形態では、レジスタコントローラ１０８は、スレッドの処理が終了した後も、レジスタの領域を解放せず、当該スレッドで用いたレジスタを後のスレッドに継承できる。具体的には、レジスタコントローラ１０８は、承継すべきレジスタを解放しないように、使用不可のフラグを立てる。それと共に、レジスタコントローラ１０８は、継承すべきレジスタの領域のアドレスに関連付けられた管理テーブル１１６の情報を後のスレッドＩＤおよびレジスタ番号に書き換える。これにより、レジスタ番号とレジスタのアドレスとが紐づけられ、レジスタのサイズが可変でも、レジスタの継承を行える。 Further, in the first embodiment, the register controller 108 does not release the register area even after the processing of the thread ends, and the register used in the thread can be inherited by the subsequent thread. Specifically, the register controller 108 sets a disabled flag so as not to release the register to be inherited. At the same time, the register controller 108 rewrites the information in the management table 116 associated with the address of the register area to be inherited to the later thread ID and register number. As a result, the register number and the address of the register are linked, and inheritance of the register can be performed even if the size of the register is variable.

この管理テーブル１１６の書き換えを、図３Ｂを用いて具体的に説明する。図３Ｂは、第１の実施形態に係る前のスレッドから後のスレッドにレジスタ番号ｖ０の値を継承したときの管理テーブル１１６の例を示す図である。 Rewriting of the management table 116 will be specifically described with reference to FIG. 3B. FIG. 3B is a diagram showing an example of the management table 116 when inheriting the value of the register number v0 from the previous thread to the subsequent thread according to the first embodiment.

図３Ｂに示すように、前のスレッド０のレジスタ番号ｖ０を後のスレッド１に継承するので、前のスレッド０のレジスタ番号ｖ０に使用不可のフラグを立て、レジスタの領域を解放しないで保持する。管理テーブル１１６のキーは、「スレッド０＿ｖ０」から「スレッド１＿ｖ０」（行３０２）に変更される。これにより、スレッド１を処理するＰＥ１１４は、スレッド１のレジスタ番号ｖ０として、アドレス０ｘａａａａ１２３４を参照する。そうすると、この領域は使用不可として保持されているので、この領域から前のスレッド０で使っていたレジスタ番号ｖ０のレジスタの値を読み出せる。 As shown in FIG. 3B, since the register number v0 of the previous thread 0 is inherited by the subsequent thread 1, a flag is set to disable the register number v0 of the previous thread 0, and the register area is retained without being released. . The key of the management table 116 is changed from "thread 0_v0" to "thread 1_v0" (row 302). As a result, the PE 114 that processes thread 1 refers to address 0xaaaa1234 as the register number v0 of thread 1 . Then, since this area is held as unusable, the value of the register with the register number v0 used by the previous thread 0 can be read from this area.

このように、ベクトル演算でも、アドレス０ｘａａａａ１２３４のレジスタｖ０に格納された前のスレッドの演算結果は、メモリに退避されることなく、後のスレッドに継承される。なお、後のスレッド１の処理が開始された後は、レジスタｖ０の値を書き換えてよいことは言うまでもない。 In this way, even in the vector operation, the operation result of the previous thread stored in register v0 at address 0xaaaa1234 is inherited by the subsequent thread without being saved in memory. It goes without saying that the value of the register v0 may be rewritten after the processing of the subsequent thread 1 is started.

一方、図３Ａのレジスタ番号ｖ１、ｖ２については、後のスレッド１に継承しないので、割り当てられていた領域を解放する。具体的には、それらの領域を使用可能であることを示すフラグを立てる。 On the other hand, since the register numbers v1 and v2 in FIG. 3A are not inherited by the subsequent thread 1, the allocated areas are released. Specifically, flags are set to indicate that these areas are usable.

図３Ｂに示すように、スレッド１のレジスタ番号ｖ１、レジスタ番号ｖ２については、スレッド０からの継承はないので、新たな領域が割り当てられる。後のスレッド１のレジスタ番号ｖ１にはアドレス０ｘｂｂｂｂ５６７８の領域が割り当てられ（行３０４）、レジスタ番号ｖ２にはアドレス０ｘｃｃｃｃ５６７８が割り当てられている（行３０６）。 As shown in FIG. 3B, register number v1 and register number v2 of thread 1 are not inherited from thread 0, so new areas are allocated. The area of address 0xbbbb5678 is allocated to register number v1 of subsequent thread 1 (line 304), and address 0xcccc5678 is allocated to register number v2 (line 306).

図１に戻って説明を続ける。スレッドスケジューラ１１０は、ＰＥ１１４にスレッドを割り振るスケジューラである。具体的には、スレッドスケジューラ１１０は、ＰＥ１１４にスレッドを渡す。ＰＥ１１４は、レジスタコントローラ１０８が有する管理テーブル１１６を参照して、このスレッドを処理するレジスタを確保し、スレッドを処理する。 Returning to FIG. 1, the description continues. The thread scheduler 110 is a scheduler that allocates threads to the PEs 114 . Specifically, thread scheduler 110 passes the thread to PE 114 . The PE 114 refers to the management table 116 of the register controller 108, secures a register for processing this thread, and processes the thread.

また、スレッドスケジューラ１１０は、レジスタの継承があるときは、レジスタコントローラ１０８に対して、前のスレッドに後続する後のスレッドと後のスレッドに継承されるレジスタ番号との対応関係を指定する。指定を受けたレジスタコントローラ１０８は、後のスレッドが前のスレッドで用いられたレジスタを継承するように管理テーブル１１６を書き換える。 Further, when there is register inheritance, the thread scheduler 110 designates to the register controller 108 the correspondence relationship between the subsequent thread succeeding the preceding thread and the register numbers inherited by the subsequent thread. The designated register controller 108 rewrites the management table 116 so that the later thread inherits the registers used by the earlier thread.

ローカルＲＡＭ１１２は、読み書き用の揮発性メモリである。演算ユニット１０６の演算結果を記憶し、システムバスインターフェイスと通信を行う。また、後述するレジスタファイル１０４から退避された全データを記憶する場合にも使用される。 The local RAM 112 is a read/write volatile memory. It stores the calculation result of the calculation unit 106 and communicates with the system bus interface. It is also used to store all data saved from the register file 104, which will be described later.

［プロセッサの動作］
以下では、上述したプロセッサ１００の動作を示すフローを説明する。図４は、第１の実施形態に係るプロセッサ１００が、スレッドの処理終了時に、レジスタの処理を行うフロー図である。ＰＥ１１４が前のスレッドを処理すると、フローが開始する。[Processor operation]
Below, the flow which shows operation|movement of the processor 100 mentioned above is demonstrated. FIG. 4 is a flowchart of register processing performed by the processor 100 according to the first embodiment at the end of thread processing. Flow begins when the PE 114 has processed the previous thread.

まず、スレッドスケジューラ１１０が、レジスタコントローラ１０８に対して後のスレッドを指定し、レジスタコントローラ１０８が、後のスレッドに継承されるレジスタがあるか否かを判定する（ステップＳ１０２）。 First, the thread scheduler 110 designates a later thread to the register controller 108, and the register controller 108 determines whether or not there is a register inherited by the later thread (step S102).

レジスタコントローラ１０８が、後のスレッドに継承されるレジスタがあると判定すると（ステップＳ１０２：Ｙｅｓ）、継承されるレジスタに使用不可のフラグを立てる（ステップＳ１０４）。 When the register controller 108 determines that there is a register inherited by the subsequent thread (step S102: Yes), the inherited register is flagged as unusable (step S104).

その後、レジスタコントローラ１０８が、管理テーブル１１６のキーを前のスレッドのキーから後のスレッドのキーに書き換え（ステップＳ１０６）、継承されるレジスタ以外のレジスタに使用可能のフラグを立て（ステップＳ１０８）、フローが終了する。 After that, the register controller 108 rewrites the key of the management table 116 from the key of the previous thread to the key of the subsequent thread (step S106), sets a usable flag to the register other than the inherited register (step S108), Flow ends.

一方、スレッドスケジューラ１１０が、レジスタコントローラ１０８に対して後のスレッドを指定せず、レジスタコントローラ１０８が、後のスレッドに継承されるレジスタがないと判定すると（ステップＳ１０２：Ｎｏ）、処理終了に係るスレッドが用いていたレジスタに使用可能のフラグを立て（ステップＳ１１０）、フローが終了する。 On the other hand, when the thread scheduler 110 does not designate a later thread to the register controller 108 and the register controller 108 determines that there is no register to be inherited by the later thread (step S102: No), the process ends. A usable flag is set in the register used by the thread (step S110), and the flow ends.

図５は、第１の実施形態のプロセッサ１００が、スレッドの処理開始時に、レジスタの割り当て処理を行うフロー図である。スレッドスケジューラ１１０が、ＰＥ１１４に対して後のスレッドを割り当てると、フローが開始する。 FIG. 5 is a flowchart of register allocation processing performed by the processor 100 according to the first embodiment at the start of thread processing. Flow begins when thread scheduler 110 assigns a later thread to PE 114 .

まず、レジスタコントローラ１０８が、管理テーブル１１６において、使用可能フラグが立っているレジスタの領域にレジスタを割り当てる（ステップＳ２０２）。なお、前のスレッドから継承されるレジスタについては、ここで割り当てを行う必要はない。レジスタコントローラ１０８は、管理テーブル１１６のキーとアドレスを、後のスレッドのキーと後のスレッドを処理するレジスタのアドレスとに書き換える。 First, the register controller 108 allocates a register to a register area for which a usable flag is set in the management table 116 (step S202). Note that registers inherited from the previous thread need not be allocated here. The register controller 108 rewrites the key and address of the management table 116 with the key of the subsequent thread and the address of the register that processes the subsequent thread.

その後、ＰＥ１１４が、管理テーブル１１６を参照し、後のスレッドを処理し（ステップＳ２０４）、フローが終了する。 After that, the PE 114 refers to the management table 116, processes subsequent threads (step S204), and the flow ends.

このようにして、プロセッサ１００は、レジスタ継承を利用することにより、前のスレッドを処理して得られた値を後のスレッドに継承できる。これにより、対象データに対してロード／ストア処理を削除し、プロセッサ１００の処理性能を改善できる。 In this way, processor 100 can inherit values obtained by processing a previous thread to subsequent threads by using register inheritance. This eliminates the load/store processing for the target data and improves the processing performance of the processor 100 .

また、プロセッサ１００は、レジスタ番号に加えてレジスタのアドレスを用いて、継承させるレジスタを管理する。これにより、ベクトル長に応じてレジスタの容量が変わるベクトル演算においても、レジスタ継承を利用できる。 Also, the processor 100 manages inherited registers using register addresses in addition to register numbers. As a result, register inheritance can be used even in vector operations in which the capacity of a register changes according to the vector length.

（第２の実施形態）
［プロセッサの構成］
次に、第２の実施形態に係るプロセッサ１００について説明する。第２の実施形態に係るプロセッサ１００の基本的な構成は第１の実施形態のプロセッサ１００と同じであるが（図１参照）、第２の実施形態のプロセッサ１００は、第１の実施形態のプロセッサ１００とは異なり、レジスタファイル１０４にフラグメンテーション（断片化）が発生したときに、連続したレジスタの領域を割り当てる処理を行う。(Second embodiment)
[Processor configuration]
Next, the processor 100 according to the second embodiment will be explained. The basic configuration of the processor 100 according to the second embodiment is the same as the processor 100 of the first embodiment (see FIG. 1), but the processor 100 of the second embodiment has Unlike the processor 100, when fragmentation occurs in the register file 104, a process of allocating continuous register areas is performed.

具体的には、レジスタコントローラ１０８は、例えば、レジスタファイル１０４において、レジスタファイル１０４に連続するレジスタの領域を確保できるか否かを判定する。レジスタコントローラ１０８は、フラグメンテーションにより、レジスタファイル１０４に連続するレジスタの領域を確保できないと判定したときは、レジスタファイル１０４に格納されたレジスタのデータをいったんローカルＲＡＭ１１２に退避する。 Specifically, the register controller 108 determines, for example, whether or not a register area contiguous to the register file 104 can be secured in the register file 104 . When the register controller 108 determines that a continuous register area cannot be secured in the register file 104 due to fragmentation, the register data stored in the register file 104 is temporarily saved in the local RAM 112 .

そして、ローカルＲＡＭ１１２に退避させたデータをレジスタファイル１０４にリロードしてレジスタに連続した領域を割り当てることによりフラグメンテーションを解消すると共に、管理テーブル１１６の継承すべきレジスタのアドレスを更新する。 Then, by reloading the data saved in the local RAM 112 into the register file 104 and allocating continuous areas to the registers, fragmentation is eliminated and the address of the register to be inherited in the management table 116 is updated.

このレジスタの退避およびリロードを、図６を用いて具体的に説明する。図６は、レジスタの退避およびリロードを説明するための図である。図６に示すように、レジスタファイル１０４の面積はレジスタファイル１０４の容量を表す。レジスタコントローラ１０８が、レジスタファイル１０４にスレッドを処理するレジスタの領域を確保する。レジスタコントローラ１０８が、例えば、スレッドｔ１とスレッドｔ２とを演算するレジスタの領域を確保する。 Saving and reloading of this register will be specifically described with reference to FIG. FIG. 6 is a diagram for explaining saving and reloading of registers. As shown in FIG. 6, the area of register file 104 represents the capacity of register file 104 . The register controller 108 reserves a register area for processing threads in the register file 104 . The register controller 108, for example, secures a register area for operations of the thread t1 and the thread t2.

レジスタファイル１０４には、スレッドｔ３に必要なレジスタを確保するだけの空き領域がある。しかし、レジスタの空き領域は、フラグメンテーションが生じる。そのため、スレッドｔ３を処理できる連続したレジスタの空き領域を確保できないことがある。 Register file 104 has an empty area sufficient to secure the registers required for thread t3. However, empty areas of registers are subject to fragmentation. Therefore, it may not be possible to secure a continuous register free area for processing the thread t3.

特に、レジスタ継承を繰り返し行うと、レジスタの領域が部分的にしか解放されない。そのため、レジスタの領域のフラグメンテーションが発生しやすい。そうすると、レジスタコントローラ１０８が、スレッドを処理できる連続したレジスタの領域を確保することが難しくなる。 In particular, repeated register inheritance frees up space for registers only partially. Therefore, fragmentation of the register area is likely to occur. This makes it difficult for the register controller 108 to secure a continuous register area capable of processing threads.

このようにスレッドに必要なレジスタのために連続した領域を確保できないときに、レジスタコントローラ１０８は、レジスタファイル１０４に格納されている全データをローカルＲＡＭ１１２に退避する。 When a continuous area cannot be secured for the registers required for the thread, the register controller 108 saves all the data stored in the register file 104 to the local RAM 112 .

レジスタコントローラ１０８は、ローカルＲＡＭ１１２から全データをリロードし、スレッド毎にレジスタの領域を確保しなおす。このとき、レジスタコントローラ１０８が、可能な限りレジスタの領域の空きがないように、スレッドｔ１とスレッドｔ２とを処理するレジスタの領域を確保する。これにより、後のスレッドｔ３の処理に必要なレジスタとして、連続したレジスタの領域が確保される。 The register controller 108 reloads all data from the local RAM 112 and reallocates a register area for each thread. At this time, the register controller 108 secures a register area for processing the thread t1 and the thread t2 so that there is no empty register area as much as possible. As a result, a contiguous register area is secured as a register necessary for subsequent processing of thread t3.

レジスタコントローラ１０８は、ローカルＲＡＭ１１２から全データをリロードする際に、スレッドおよびレジスタと、レジスタ領域のアドレスの対応関係を管理テーブル１１６に書き込む。これにより、レジスタの再割当が行われレジスタのアドレスが変更されたときでも、スレッドは、管理テーブル１１６が有するキーに基づいて、レジスタにアクセスできるため処理に影響はない。なお、そもそもレジスタファイル１０４の空き容量が不足しているときは、処理中のスレッドが完了し対象レジスタ領域が解放され、次のスレッドに必要な空き容量が確保可能な時点で上述したレジスタの再割当の処理が実行される。 When reloading all the data from the local RAM 112 , the register controller 108 writes in the management table 116 the correspondence between threads, registers, and register area addresses. As a result, even when the registers are reassigned and the addresses of the registers are changed, the threads can access the registers based on the keys in the management table 116, so there is no effect on processing. In the first place, when the free space of the register file 104 is insufficient, the thread being processed is completed, the target register area is released, and at the time when the free space required for the next thread can be secured, the above-mentioned register can be restored. Allocation processing is performed.

［プロセッサの動作］
第２の実施形態に係るプロセッサ１００の動作は、第２のフローのみ上述した第１の実施形態に係る第２のフローとは相違する。以下では、この相違点のみ説明する。[Processor operation]
The operation of the processor 100 according to the second embodiment differs from the second flow according to the first embodiment described above only for the second flow. Only this difference will be described below.

図７は、第２の実施形態に係るプロセッサ１００が、スレッドの処理開始時に、レジスタの割り当て処理を行うフロー図である。第２の実施形態に係るフローは、第１の実施形態に係るフローとは異なり、連続したレジスタ領域を確保できないときに、レジスタファイル１０４から全データをローカルＲＡＭ１１２に退避し、改めて全データをレジスタファイル１０４にリロードするステップを含む。スレッドスケジューラ１１０が、ＰＥ１１４に対して後のスレッドを割り当てると、フローが開始する。 FIG. 7 is a flowchart of register allocation processing performed by the processor 100 according to the second embodiment at the start of thread processing. Unlike the flow according to the first embodiment, the flow according to the second embodiment saves all data from the register file 104 to the local RAM 112 when a continuous register area cannot be secured, and stores all the data again in the registers. Including reloading to file 104 . Flow begins when thread scheduler 110 assigns a later thread to PE 114 .

まず、レジスタコントローラ１０８が、後のスレッドを処理するための連続したレジスタの領域を確保できるか否かを判定する（ステップＳ３０２）。 First, the register controller 108 determines whether or not a continuous register area for processing subsequent threads can be secured (step S302).

レジスタコントローラ１０８が、後のスレッドを処理するための連続したレジスタの領域を確保できると判定すると（ステップＳ３０２：Ｙｅｓ）、ＰＥ１１４が、使用可能フラグが立っているレジスタの領域にレジスタを割り当てる（ステップＳ３０４）。 When the register controller 108 determines that a continuous register area for processing subsequent threads can be secured (step S302: Yes), the PE 114 allocates a register to the register area for which the available flag is set (step S304).

その後、レジスタコントローラ１０８が、管理テーブル１１６のキーとアドレスとを、後のスレッドのキーと後のスレッドを処理するレジスタのアドレスとに更新し（ステップＳ３１０）する。 After that, the register controller 108 updates the key and address of the management table 116 to the key of the subsequent thread and the address of the register that processes the subsequent thread (step S310).

一方、レジスタコントローラ１０８が、後のスレッドを処理するための連続したレジスタの領域を確保できないと判定すると（ステップＳ３０２：Ｎｏ）、全データをレジスタファイル１０４からローカルＲＡＭ１１２に退避し（ステップＳ３０６）、改めて全データをローカルＲＡＭ１１２からレジスタファイルにリロードし（ステップＳ３０８）、スレッド毎に連続したレジスタの領域を確保しなおす。 On the other hand, when the register controller 108 determines that a continuous register area for processing subsequent threads cannot be secured (step S302: No), it saves all data from the register file 104 to the local RAM 112 (step S306), All the data is reloaded from the local RAM 112 to the register file (step S308), and a continuous register area is reallocated for each thread.

その後、レジスタコントローラ１０８が、管理テーブル１１６のキーとアドレスとを、後のスレッドのキーと後のスレッドを処理するレジスタのアドレスとに更新する（ステップＳ３１０）。 After that, the register controller 108 updates the key and address of the management table 116 to the key of the subsequent thread and the address of the register that processes the subsequent thread (step S310).

レジスタコントローラ１０８によってレジスタ領域にレジスタが確保され、管理テーブル１１６にレジスタとそのアドレスの情報が記憶された後に、ＰＥ１１４は、後のスレッドを処理し（ステップＳ３１２）、フローが終了する。 After the registers are secured in the register area by the register controller 108 and the information of the registers and their addresses are stored in the management table 116, the PE 114 processes subsequent threads (step S312), and the flow ends.

このように、第２の実施形態に係る第２のフローでは、第１の実施形態に係る第２のフローとは異なり、連続したレジスタ領域を確保できないときに、レジスタファイル１０４から全データをローカルＲＡＭ１１２に退避し、改めて全データをレジスタファイル１０４にリロードするステップを有する。これにより、レジスタコントローラ１０８は、レジスタ継承を利用することで発生しやすくなったレジスタの領域のフラグメンテーションを解消し、ベクトル演算の演算効率を高く保持することができる。

As described above, in the second flow according to the second embodiment, unlike the second flow according to the first embodiment, all data is transferred from the register file 104 to the local register when a continuous register area cannot be secured. It has a step of saving to the RAM 112 and reloading all the data to the register file 104 again. As a result, the register controller 108 can eliminate the fragmentation of the register area, which tends to occur due to the use of register inheritance, and maintain high computational efficiency of the vector computation.

Claims

a plurality of calculators (114);
a thread scheduler (110) that distributes threads to the plurality of computing units;
a register file (104) shared by the plurality of computing units;
a register controller (108) allocating a register area used by the thread to the register file;
a management table (116) storing information identifying the thread and the register in association with the address of the register area to which the register is allocated;
with
When the register used in the previous thread is to be inherited by the subsequent thread, the register controller does not release the area of the register to be inherited even after the process of the previous thread is completed. A processor (100) that rewrites the information in the management table associated with the address of the area to the information that identifies the subsequent thread and register.

The register file processes not only single data but also vector registers capable of collectively handling a plurality of data,
2. The processor according to claim 1, wherein said plurality of calculators include calculators capable of processing vector registers as well.

The register controller obtains information on the previous thread and registers to be inherited by the subsequent thread from the thread scheduler, and converts the information on the previous thread and registers to be inherited, stored in the management table, into the 3. A processor according to claim 1 or 2, which rewrites to later thread and register information.

The register controller, when ending the processing of the previous thread, can use a register area other than the registers inherited by the subsequent thread, out of the register areas reserved in the previous thread. 4. The processor according to any one of claims 1 to 3, wherein the area is released by setting a flag indicating that it is.

When the register controller cannot secure a continuous area in the register file for the registers used in the thread, the register controller temporarily saves the data of the registers stored in the register file to the memory, and saves the data in the memory. 5. The processor according to any one of claims 1 to 4, wherein data is reloaded into a register file to allocate contiguous areas to each register, and the management table is updated.

a plurality of calculators (114);
a thread scheduler (110) that distributes threads to the plurality of computing units;
a register file (104) shared by the plurality of computing units;
a register controller (108) allocating a register area used by the thread to the register file;
a management table (116) storing information identifying the thread and the register in association with the address of the register area to which the register is allocated;
In a processor (100) comprising:
The register controller does not release the register area to be inherited by the subsequent thread even after the processing of the thread is completed, and stores the information in the management table associated with the address of the register area to be inherited. A register inheritance method that rewrites information that identifies later threads and registers.