JP2017027432A

JP2017027432A - Processor

Info

Publication number: JP2017027432A
Application number: JP2015146623A
Authority: JP
Inventors: 辰朗木曽; Tatsuro Kiso
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2017-02-02

Abstract

PROBLEM TO BE SOLVED: To provide means for eliminating a performance bottleneck in a processor that is capable of executing a plurality of execution slots conforming to the same instruction in parallel.SOLUTION: Provided is an SIMD-type processor capable of parallel execution that has a plurality of execution slots for making the same instruction adaptable to a plurality of data simultaneously. When an atomic instruction included in the same instruction is executed, the processor determines execution slots, out of the plurality of execution slots corresponding to the atomic instruction, the designated addresses of which overlap, executes one execution slot out of the execution slots having overlapping addresses, and returns the result of the execution. For the remaining other execution slots, the processor returns a result indicating that instruction execution has failed, without executing instructions.SELECTED DRAWING: Figure 7

Description

本発明は、同一の命令を複数のデータに対して同時に適応可能にするため、複数の実行スロットを設けた、並列実行可能なＳＩＭＤ（Single Instruction Multi Data）型プロセッサに関する。 The present invention relates to a SIMD (Single Instruction Multi Data) processor that is provided with a plurality of execution slots and can be executed in parallel in order to simultaneously adapt the same instruction to a plurality of data.

近年では、画像処理および音声処理などに適したマルチメディア命令を実装した各種のプロセッサが実用化されている。マルチメディア命令は、主として、同一の演算を多数のデータに対して繰り返さなければならないような用途に向けられる。マルチメディア命令を実現するための実装手法の一つとして、ＳＩＭＤ（Single Instruction Multiple Data）が知られている。 In recent years, various processors equipped with multimedia instructions suitable for image processing and sound processing have been put into practical use. Multimedia instructions are primarily intended for applications where the same operation must be repeated for multiple data. SIMD (Single Instruction Multiple Data) is known as one of implementation methods for realizing multimedia instructions.

ＳＩＭＤでは、指示された単一の命令に応答して、それぞれに異なるデータが入力される複数の実行スロットが互いに同期して処理を実行する。すなわち、ＳＩＭＤ型プロセッサは、単一の命令ストリームと複数のデータストリームとの組み合わせをサポートする。 In SIMD, in response to an instructed single command, a plurality of execution slots to which different data are input respectively execute processing in synchronization with each other. That is, the SIMD type processor supports a combination of a single instruction stream and a plurality of data streams.

ＳＩＭＤ型プロセッサの一例として、ＳＩＭＴ（Single Instruction stream Multiple Thread）型プロセッサが提案されている。ＳＩＭＴ型プロセッサでは、分岐命令に従って実行ユニットの各々が、当該スロット上の命令実行を有効にするか、無効にするかを管理することで、割り当てられたスレッドの実行を完結する。複数のスレッドが並列実行されるので、並列数が増加した場合であっても、その並列数を意識することなく、プログラミングができる。 As an example of the SIMD type processor, an SIMT (Single Instruction Stream Multiple Thread) type processor has been proposed. In the SIMT type processor, each execution unit manages whether to enable or disable the instruction execution on the slot according to the branch instruction, thereby completing the execution of the assigned thread. Since a plurality of threads are executed in parallel, programming can be performed without being aware of the parallel number even when the parallel number is increased.

例えば、米国特許第７６２７７２３号明細書（特許文献１）は、ＳＩＭＴ型プロセッサの一例を示す。特許文献１に記載の並列処理サブシステムでは、複数のＲＯＰ（Raster Operation）ユニットの各々は、メモリの特定アドレス範囲を担当し、当該担当するアドレス範囲にアクセスして割り当てられたスレッドを実行する。 For example, US Pat. No. 7,627,723 (Patent Document 1) shows an example of a SIMT type processor. In the parallel processing subsystem described in Patent Document 1, each of a plurality of ROP (Raster Operation) units is responsible for a specific address range of the memory, and accesses the assigned address range to execute the assigned thread.

米国特許第７６２７７２３号明細書US Pat. No. 7,627,723

ＳＩＭＴ型プロセッサにおいて、実行スロットの各々がスレッドを独立に実行する。各スレッドの処理には、例えば、処理の実行により得られたデータを共通のメモリへ書き戻すような処理が含まれ得る。このような処理では、アトミック性（不可分性）を維持する必要がある。そのため、ＳＩＭＴ型プロセッサでは、アトミック（atomic）命令が用意されており、スレッド間の同期処理に利用される。 In a SIMT type processor, each execution slot executes a thread independently. The processing of each thread may include, for example, processing for writing back data obtained by executing processing to a common memory. In such processing, it is necessary to maintain atomicity (inseparability). Therefore, in the SIMT type processor, an atomic instruction is prepared and used for synchronization processing between threads.

アトミック命令は、典型的には、メモリからのデータ読み込み（ｒｅａｄ）、更新後データの計算（ｍｏｄｉｆｙ）、更新後データのメモリへの書き込み（ｗｒｉｔｅ）を含む。あるスレッドの実行を担う実行スロットがアトミック命令に従ってアトミック処理を実行している最中は、他のスレッドの実行を担う実行スロットの同一のメモリに対するアクセスを排除する動作が実行される。このように、ＳＩＭＴ型プロセッサにおいて、複数のスレッドに対してアトミック命令を与えることで、同一のメモリへアクセスする複数のスレッドの処理が直列化されて実行されることになり、この処理の直列化によってアトミック性が維持される。 The atomic instruction typically includes reading data from the memory (read), calculating updated data (modify), and writing updated data to the memory (write). While an execution slot responsible for execution of a thread is executing an atomic process in accordance with an atomic instruction, an operation for eliminating access to the same memory in the execution slot responsible for execution of another thread is executed. In this way, in an SIMT type processor, by giving an atomic instruction to a plurality of threads, the processing of a plurality of threads accessing the same memory is executed serially. Maintains atomicity.

しかしながら、アトミック性を維持するためにスレッドの処理が直列化されることで、ＳＩＭＴ型プロセッサにおける性能上のボトルネックが生じる場合がある。そこで、本発明は、このような性能上のボトルネックを解消する一つの解決手段を提供することを目的とすることである。 However, serialization of thread processing to maintain atomicity may cause a performance bottleneck in the SIMT type processor. Therefore, an object of the present invention is to provide one solution means for eliminating such a performance bottleneck.

本発明のある局面に従えば、同一の命令を複数のデータに対して同時に適応可能にするため、複数の実行スロットを設けた、並列実行可能なＳＩＭＤ(Single Instruction Multi Data)型プロセッサが提供される。同一の命令は、実行スロットごとに指定されたアドレスに格納されているデータを読み込むとともに、当該読み込まれたデータが所定条件を満たした場合に、当該指定されたアドレスへデータを書き込む処理を、アトミック性を維持しつつ実行するためのアトミック命令を含む。アトミック命令が実行されると、プロセッサは、当該アトミック命令に対応する複数の実行スロットのうち指定されたアドレスが重複している実行スロットを特定し、アドレスが重複している実行スロットのうち１つの実行スロットで命令を実行し、当該実行の結果を返すとともに、残りの実行スロットについては、命令を実行することなく、命令の実行が失敗したことを示す結果を返す。 According to one aspect of the present invention, there is provided a SIMD (Single Instruction Multi Data) type processor that is provided with a plurality of execution slots and can be executed in parallel so that the same instruction can be simultaneously applied to a plurality of data. The The same instruction reads the data stored at the specified address for each execution slot, and when the read data satisfies a predetermined condition, performs the process of writing the data to the specified address. Includes atomic instructions to execute while maintaining sex. When the atomic instruction is executed, the processor identifies an execution slot in which the designated address is duplicated among a plurality of execution slots corresponding to the atomic instruction, and selects one of the execution slots in which the address is duplicated. An instruction is executed in the execution slot, and the result of the execution is returned. For the remaining execution slots, a result indicating that the execution of the instruction has failed is returned without executing the instruction.

好ましくは、プロセッサは、アドレスが重複している実行スロットのうち１つの実行スロットの命令実行前に、残りの実行スロットについて、命令の実行が失敗したことを示す結果を返す。 Preferably, the processor returns a result indicating that the instruction execution has failed for the remaining execution slots before executing the instruction of one execution slot among the execution slots having overlapping addresses.

好ましくは、プロセッサは、アドレスが重複している実行スロットのうち１つの実行スロットの命令実行後に、指定されたアドレスから読み込んだ値をさらなる結果として返し、残りの実行スロットについて、アドレスが重複している実行スロットのうち１つの実行スロットの命令実行が成功した場合には、当該１つの実行スロットの命令がメモリに書き込んだ値をさらなる結果として返し、アドレスが重複している実行スロットのうち１つの実行スロットの命令実行が失敗した場合には、当該１つの実行スロットが命令実行途中でメモリから読み込んだ値をさらなる結果として返す。 Preferably, the processor returns the value read from the designated address as a further result after executing the instruction in one execution slot among the execution slots having the same address, and the address is duplicated for the remaining execution slots. If execution of an instruction in one execution slot is successful, the value written to the memory by the instruction in the one execution slot is returned as a further result, and one of the execution slots with duplicate addresses is returned. When the instruction execution in the execution slot fails, the value read from the memory during the instruction execution by the one execution slot is returned as a further result.

好ましくは、プロセッサは、アドレスが重複している実行スロットのうち、実行スロットに割り当てられた番号に基づいて、実行する１つの実行スロットを決定する。 Preferably, the processor determines one execution slot to be executed based on a number assigned to the execution slot among execution slots having overlapping addresses.

好ましくは、プロセッサは、複数のプロセッサコアおよび複数のプロセッサコアの間で共有されるメモリを含む。 Preferably, the processor includes a plurality of processor cores and a memory shared between the plurality of processor cores.

本発明によれば、同一の命令に従う複数のスレッドを並列実行可能なプロセッサにおける性能上のボトルネックを解消できる。 According to the present invention, a performance bottleneck in a processor capable of executing a plurality of threads following the same instruction in parallel can be solved.

実施の形態１に従うコンピュータの装置構成を示す模式図である。FIG. 3 is a schematic diagram showing a device configuration of a computer according to the first embodiment. 関連技術に係るアトミック命令の処理手順を説明するための模式図である。It is a schematic diagram for demonstrating the process sequence of the atomic instruction which concerns on related technology. 命令を実行すべき実行スロットを定めるための処理を説明するための図である。It is a figure for demonstrating the process for determining the execution slot which should perform an instruction. 関連技術に係るＣＡＳ命令１を用いたｌｏｃｋ−ｆｒｅｅアルゴリズムの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the lock-free algorithm using the CAS instruction | command 1 which concerns on related technology. 図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズムの実行過程におけるメモリに格納されるデータの変化を説明するための図である（実行スロット数が４の場合）。FIG. 5 is a diagram for explaining changes in data stored in a memory in the execution process of the lock-free algorithm shown in FIG. 4 (when the number of execution slots is 4). 図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズムの実行過程におけるメモリに格納されるデータの変化の別の例を説明するための図である（実行スロット数が８の場合）。It is a figure for demonstrating another example of the change of the data stored in the memory in the execution process of the lock-free algorithm shown in FIG. 4 (when the number of execution slots is 8). 実施の形態１に従うＣＡＳ命令２の処理手順を説明するための模式図である。It is a schematic diagram for explaining the processing procedure of CAS instruction 2 according to the first embodiment. 実施の形態１に従うＣＡＳ命令２を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the lock-free algorithm 300 using the CAS instruction | command 2 according to Embodiment 1. 図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズムの実行過程におけるメモリに格納されるデータの変化を説明するための図である（実行スロット数が４の場合）。FIG. 9 is a diagram for explaining a change in data stored in a memory in the execution process of the lock-free algorithm shown in FIG. 8 (when the number of execution slots is 4). 図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズムの実行過程におけるメモリに格納されるデータの変化の別の例を説明するための図である（実行スロット数が８の場合）。It is a figure for demonstrating another example of the change of the data stored in the memory in the execution process of the lock-free algorithm shown in FIG. 8 (when the number of execution slots is 8). 実施の形態２に従うアトミック命令の処理手順を説明するための模式図である。FIG. 11 is a schematic diagram for describing an atomic instruction processing procedure according to the second embodiment. 実施の形態２に従うＣＳＷ命令を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３５０の処理手順を示すフローチャートである。10 is a flowchart showing a processing procedure of a lock-free algorithm 350 using a CSW instruction according to the second embodiment. 実施の形態３に従うコンピュータの装置構成を示す模式図である。FIG. 10 is a schematic diagram showing a device configuration of a computer according to a third embodiment. 実施の形態４に従うコンピュータの装置構成を示す模式図である。FIG. 10 is a schematic diagram showing a device configuration of a computer according to a fourth embodiment. 図１４に示すローカルストレージの構成例を示す模式図である。It is a schematic diagram which shows the structural example of the local storage shown in FIG. 図１４に示す実施の形態４に従うコンピュータにおけるメモリアクセスの処理手順を説明するためのフローチャートである。It is a flowchart for demonstrating the memory access processing procedure in the computer according to Embodiment 4 shown in FIG. 図１６のフローチャートによって定義される処理手順において用いられるアクセスフラグのデータ構造例を示す模式図である。It is a schematic diagram which shows the example of a data structure of the access flag used in the process sequence defined by the flowchart of FIG. 実施の形態５に従うコンピュータに関連したハードウェア構成を示す模式図である。FIG. 16 is a schematic diagram showing a hardware configuration related to a computer according to a fifth embodiment. アトミック処理を実現するためのマイクロコードの一例を示す図である。It is a figure which shows an example of the microcode for implement | achieving an atomic process.

本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰り返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. In addition, about the same or equivalent part in a figure, the same code | symbol is attached | subjected and the description is not repeated.

＜Ａ．概要＞
本実施の形態に従うプロセッサは、同一の命令を複数のデータに対して同時に適応可能にするため、複数の実行スロットを設けた、並列実行可能なＳＩＭＤ(Single Instruction Multi Data)型プロセッサに向けられる。このような、同一の命令に従う複数のスレッドを並列実行可能なプロセッサにおいて、アクセス先のアドレスが重複するスレッドの並列実行が命令されると、同じアドレスへのアクセスが競合しないように、スレッド間の排他処理が実行される。この結果、アドレスが重複するスレッドの処理については、直列化されて実行される。本願発明者は、ある種のアトミック命令の実行時には、このスレッドの直列化が性能上のボトルネックであるという新たな知見を見出した。 <A. Overview>
The processor according to the present embodiment is directed to a SIMD (Single Instruction Multi Data) type processor that is provided with a plurality of execution slots and can be executed in parallel, so that the same instruction can be simultaneously applied to a plurality of data. In such a processor capable of executing a plurality of threads that follow the same instruction in parallel, if parallel execution of a thread with overlapping access destination addresses is instructed, access between the same addresses will not conflict. Exclusive processing is executed. As a result, processing of threads with overlapping addresses is serialized and executed. The inventor of the present application has found a new finding that the serialization of the thread is a performance bottleneck when executing a certain atomic instruction.

そこで、アクセス先のアドレスが重複する複数のスレッドが存在している場合であっても、スレッドの直列化ではなく、処理性能を向上させる別の解決手段に想到した。以下、本願発明者が想到した新たなアトミック命令について説明する。 Therefore, even when there are a plurality of threads having overlapping addresses of access destinations, the inventors have come up with another solution that improves processing performance, not serializing threads. Hereinafter, a new atomic instruction conceived by the present inventor will be described.

本実施の形態に従うアトミック命令のプロセッサへの実装方法は、どのような方法であってもよい。例えば、ｘ８６系と称されるＣＰＵ（Central Processing Unit）といったＣＩＳＣ（Complex Instruction Set Computer）プロセッサでは、単一のプロセッサ命令として実装されてもよい。また、ＰｏｗｅｒＰＣ（登録商標）およびＳＰＡＲＣ（登録商標）といったＲＩＳＣプロセッサでは、複数種類のプロセッサ命令の組み合わせで実現されてもよい。この場合には、Load-LinkおよびStore-Conditionalと称される２種類のプロセッサ命令が用いられてもよい。 Any method may be used for mounting the atomic instruction on the processor according to the present embodiment. For example, a CISC (Complex Instruction Set Computer) processor such as a CPU (Central Processing Unit) called an x86 system may be implemented as a single processor instruction. Further, RISC processors such as PowerPC (registered trademark) and SPARC (registered trademark) may be realized by a combination of a plurality of types of processor instructions. In this case, two types of processor instructions called Load-Link and Store-Conditional may be used.

また、ＧＰＵ（Graphics Processing Unit）では、ＳＩＭＤ動作する形での、単一のプロセッサ命令として実装してもよい。あるいは、本実施の形態に従う処理手順を複数のプロセッサ命令に分解して実装してもよい。 In addition, a GPU (Graphics Processing Unit) may be implemented as a single processor instruction in the form of SIMD operation. Alternatively, the processing procedure according to the present embodiment may be divided into a plurality of processor instructions and mounted.

プロセッサに実装されるプロセッサ命令に応じて、高級言語で作成されたプログラムから、コンパイラまたはアセンブラなどを用いて最適化された命令列が生成されることになる。この場合、プログラムを作成する観点から見れば、プロセッサのアーキテクチャの違いなどを意識する必要はない。 An instruction sequence optimized using a compiler or an assembler is generated from a program created in a high-level language in accordance with a processor instruction installed in the processor. In this case, from the viewpoint of creating a program, there is no need to be aware of differences in processor architecture.

＜Ｂ．実施の形態１＞
（ｂ１：装置構成）
まず、実施の形態１として、本実施の形態に従うアトミック処理をマルチコアのＳＩＭＤ型プロセッサに実装した構成について説明する。典型的には、実施の形態１に従うプロセッサは、同一の命令に従う複数のスレッドを並列実行可能なＳＩＭＴ型を想定している。但し、本発明の技術的範囲は、特許請求の範囲の記載に基づいて定められるべきであり、例えば、将来的に実用化される新たなアーキテクチャを排除するものではない。 <B. Embodiment 1>
(B1: Device configuration)
First, as Embodiment 1, a configuration in which atomic processing according to this embodiment is mounted on a multi-core SIMD type processor will be described. Typically, the processor according to the first embodiment assumes a SIMT type in which a plurality of threads following the same instruction can be executed in parallel. However, the technical scope of the present invention should be determined based on the description of the scope of claims, and does not exclude, for example, a new architecture to be put into practical use in the future.

図１は、実施の形態１に従うコンピュータ１の装置構成を示す模式図である。図１を参照して、実施の形態１に従うコンピュータ１は、マルチコアプロセッサ１０と、外部メモリ１８０と、外部機器１９０とを含む。 FIG. 1 is a schematic diagram showing a device configuration of a computer 1 according to the first embodiment. Referring to FIG. 1, computer 1 according to the first embodiment includes a multi-core processor 10, an external memory 180, and an external device 190.

マルチコアプロセッサ１０は、複数のプロセッサコア１００−１，１００−２（以下、「プロセッサコア１００」とも総称する。）と、共有キャッシュ１１０と、メモリコントローラ１２０と、ＩＯコントローラ１３０とを含む。これらの各コンポーネントは、内部バス１４０を介して互いにデータを遣り取り可能に接続されている。共有キャッシュ１１０は、複数のプロセッサコア１００−１，１００−２間で共有される。図１には、説明の便宜上、２つのプロセッサコア１００からなる構成について例示するが、プロセッサコア１００の数については特に制約はない。 The multi-core processor 10 includes a plurality of processor cores 100-1 and 100-2 (hereinafter also collectively referred to as “processor core 100”), a shared cache 110, a memory controller 120, and an IO controller 130. These components are connected to each other via an internal bus 140 so as to exchange data. The shared cache 110 is shared between the plurality of processor cores 100-1 and 100-2. FIG. 1 illustrates a configuration including two processor cores 100 for convenience of explanation, but the number of processor cores 100 is not particularly limited.

マルチコアプロセッサ１０（または、プロセッサコア１００の各々）は、同一の命令に従う複数のスレッドを並列実行可能にするため複数の実行スロットを有したプロセッサである。実施の形態１においては、プロセッサコア１００の各々にＳＩＭＤ命令セットが実装されているものとする。ＳＩＭＤ命令セットは、共有キャッシュ１１０上の複数のデータアドレスについて、リード・モディファイ・ライトを並列実行するためのアトミック命令を含む。このアトミック命令では、並列実行されるリード・モディファイ・ライトの間でのアトミック性が維持されるものとする。本明細書において、「リード・モディファイ・ライト」は、メモリからのデータ読み込み（ｒｅａｄ）、更新後データの計算（ｍｏｄｉｆｙ）、更新後データのメモリへの書き込み（ｗｒｉｔｅ）の処理を総称したものである。但し、「リード・モディファイ・ライト」には、３つの処理のうち一部のみを実行する場合を含み得る。 The multi-core processor 10 (or each of the processor cores 100) is a processor having a plurality of execution slots so that a plurality of threads following the same instruction can be executed in parallel. In the first embodiment, it is assumed that a SIMD instruction set is implemented in each processor core 100. The SIMD instruction set includes atomic instructions for executing read-modify-write in parallel for a plurality of data addresses on the shared cache 110. In this atomic instruction, it is assumed that atomicity is maintained between read-modify-write executed in parallel. In this specification, “read / modify / write” is a general term for processing of reading data from memory (reading), calculating updated data (modify), and writing updated data to memory (write). is there. However, “read / modify / write” may include a case where only a part of the three processes is executed.

共有キャッシュ１１０は、プロセッサコア１００の各々がアクセス可能なメモリ領域である。 The shared cache 110 is a memory area that can be accessed by each of the processor cores 100.

メモリコントローラ１２０は、外部メモリバス１６０を介して接続された外部メモリ１８０との間で、データを遣り取りする。メモリコントローラ１２０は、内部バス１５０を介して共有キャッシュ１１０に接続されている。メモリコントローラ１２０は、典型的には、外部メモリ１８０から読み込んだデータを共有キャッシュ１１０へ書き込むとともに、共有キャッシュ１１０から読み込んだデータを外部メモリ１８０へ書き込む。 The memory controller 120 exchanges data with the external memory 180 connected via the external memory bus 160. The memory controller 120 is connected to the shared cache 110 via the internal bus 150. The memory controller 120 typically writes data read from the external memory 180 to the shared cache 110 and writes data read from the shared cache 110 to the external memory 180.

ＩＯコントローラ１３０は、外部機器バス１７０を介して接続された外部機器１９０との間で、データを遣り取りする。ＩＯコントローラ１３０は、典型的には、外部機器１９０から読み込んだデータを共有キャッシュ１１０へ書き込むとともに、共有キャッシュ１１０から読み込んだデータを外部機器１９０へ書き込む。 The IO controller 130 exchanges data with the external device 190 connected via the external device bus 170. The IO controller 130 typically writes data read from the external device 190 to the shared cache 110 and writes data read from the shared cache 110 to the external device 190.

外部メモリ１８０は、共有キャッシュ１１０に比較してアクセス速度が低速ではあるが、より記憶容量の大きなメモリ領域である。外部機器１９０は、各種の入出力装置および外部記憶装置を含む。 The external memory 180 is a memory area with a larger storage capacity although the access speed is lower than that of the shared cache 110. The external device 190 includes various input / output devices and an external storage device.

（ｂ２：関連技術に係るアトミック命令の処理手順）
まず、実施の形態１に従うコンピュータ１において、関連技術に係るアトミック命令の処理手順について説明する。図２は、関連技術に係るアトミック命令の処理手順を説明するための模式図である。マルチコアプロセッサ１０に与えられるＳＩＭＤ命令２００＃は、４つの実行スロット２０１〜２０４（ｓｌｏｔ０〜ｓｌｏｔ３）に適用される。実行スロット２０１〜２０４の各々には、並列実行される同一のアトミック命令と、それぞれに独立な入力データとが指定される。 (B2: Atomic instruction processing procedure according to related technology)
First, in the computer 1 according to the first embodiment, an atomic instruction processing procedure according to related technology will be described. FIG. 2 is a schematic diagram for explaining an atomic instruction processing procedure according to the related art. The SIMD instruction 200 # given to the multi-core processor 10 is applied to the four execution slots 201 to 204 (slot 0 to slot 3). In each of the execution slots 201 to 204, the same atomic instruction to be executed in parallel and independent input data are designated.

ＳＩＭＤ命令２００＃を含む命令列は、予め定められたロジックに従って、複数のプロセッサコア１００−１，１００−２（図１参照）の一方または両方に与えられる。 The instruction sequence including the SIMD instruction 200 # is given to one or both of the plurality of processor cores 100-1 and 100-2 (see FIG. 1) according to a predetermined logic.

実行スロット２０１〜２０４に対応付けられて、フラグ領域２１１〜２１４、および、戻り値領域２２１〜２２４がメモリ領域内にそれぞれ設けられる。フラグ領域２１１〜２１４は、対応する実行スロットに指定されている命令の実行状態を格納するための領域である。戻り値領域２２１〜２２４は、対応する実行スロットに指定されている命令の実行によって得られた戻り値を格納するための領域である。フラグ領域２１１〜２１４および戻り値領域２２１〜２２４は、共有キャッシュ１１０の特定のアドレス範囲を用いて実装してもよいし、特殊レジスタを用いて実装してもよい。 Corresponding to the execution slots 201 to 204, flag areas 211 to 214 and return value areas 221 to 224 are provided in the memory area, respectively. The flag areas 211 to 214 are areas for storing the execution state of the instruction specified in the corresponding execution slot. The return value areas 221 to 224 are areas for storing return values obtained by executing instructions specified in the corresponding execution slots. The flag areas 211 to 214 and the return value areas 221 to 224 may be mounted using a specific address range of the shared cache 110 or may be mounted using a special register.

まず、ＳＩＭＤ命令２００＃の実行が開始されると（（１）命令実行開始）、フラグ領域２１１〜２１４に格納されているそれぞれの値が初期化される。図２に示す処理手順においては、フラグ領域２１１〜２１４の各々には、対応する実行スロットに指定されている命令について、「未処理」または「処理済」を意味する値が設定される。 First, when the execution of the SIMD instruction 200 # is started ((1) instruction execution start), the respective values stored in the flag areas 211 to 214 are initialized. In the processing procedure shown in FIG. 2, a value meaning “unprocessed” or “processed” is set in each of the flag areas 211 to 214 for the instruction specified in the corresponding execution slot.

初期化において、命令が有効に指定されている実行スロットに対応するフラグ領域には、「未処理」を意味する値が設定され、命令が無効に指定されている実行スロットに対応するフラグ領域には、「処理済」を意味する値が設定される。 In initialization, a value indicating “unprocessed” is set in the flag area corresponding to the execution slot in which the instruction is designated as valid, and the flag area corresponding to the execution slot in which the instruction is designated as invalid. Is set to a value meaning “processed”.

例えば、フラグ領域２１１〜２１４をメモリの特定のアドレス範囲を用いて実装する場合には、状態値に応じた特殊値（例えば、「−１」または「ＦＦＦＦ」など）が指定され、あるいは、フラグ領域２１１〜２１４を特殊レジスタにより実装する場合には、それらの特殊レジスタの値が、状態値に応じてオフまたはオンに設定される。 For example, when the flag areas 211 to 214 are mounted using a specific address range of the memory, a special value (for example, “−1” or “FFFF”) corresponding to the state value is designated, or a flag When the areas 211 to 214 are implemented by special registers, the values of these special registers are set to off or on according to the state value.

図２において、ＳＩＭＤ命令２００＃内に記載される「アドレスＡ」、「アドレスＢ」、「アドレスＣ」は、対応する実行スロット２０１〜２０４のアトミック命令が作用する先に指定されている命令がデータを読み込む先のアドレスを意味する。図２に示す例では、実行スロット２０１および２０３には、いずれも「アドレスＡ」に対して処理を行なうことが指定されている。 In FIG. 2, “address A”, “address B”, and “address C” described in the SIMD instruction 200 # are the instructions specified before the atomic instructions in the corresponding execution slots 201 to 204 act. It means the address where data is read. In the example shown in FIG. 2, execution slots 201 and 203 are both designated to perform processing on “address A”.

初期化が終了すると、プロセッサコア１００は、「未処理」に設定されている実行スロットの中で同時処理される実行スロットを特定するとともに、同時処理される実行スロットがアクセスするアドレスに重複があるか否かを判断する（（２）アドレス重複確認）。すなわち、プロセッサコア１００は、実行スロットに対応するアドレスの中で同一のものがあるか否かを判断する。 When the initialization is completed, the processor core 100 specifies an execution slot that is simultaneously processed among execution slots that are set to “unprocessed”, and there is an overlap in addresses accessed by the execution slots that are simultaneously processed. ((2) Address duplication check). That is, the processor core 100 determines whether there is the same address corresponding to the execution slot.

アトミック命令では同じアドレスへのアクセスが競合しないように、実行スロット間で排他処理が実行される。すなわち、アドレスに重複がある場合には、プロセッサコア１００は、アドレスが重複している実行スロットのうち、ただ１つだけを「未処理」に維持し、その他の実行スロットについては、一時的に無効化する（図２中の符号２４２で示される実行スロット）。 In an atomic instruction, exclusive processing is executed between execution slots so that accesses to the same address do not compete. That is, when there is an overlap in the address, the processor core 100 keeps only one of the execution slots with the overlapping address in “unprocessed”, and the other execution slots are temporarily Invalidate (execution slot indicated by reference numeral 242 in FIG. 2).

（２）アドレス重複確認の具体的な手順例を考える際は、以下のようなハードウェア的な制約を考慮する必要がある。 (2) When considering a specific procedure example for address duplication confirmation, it is necessary to consider the following hardware restrictions.

図１に示すように、マルチコアプロセッサ１０では、複数のプロセッサコア１００が共有キャッシュ１１０を共有するように接続されている。マルチプロセッサ１０内において、１メモリサイクル内では、共有キャッシュ１１０の予めハードウェアによって定められた限られた数のキャッシュラインにしかアクセスできないという制約が存在する。そのため、プロセッサコア１００は、ＳＩＭＤアトミック命令のような複数のアドレスに対するメモリ操作を行うＳＩＭＤ命令を実行する場合、各メモリサイクルでは限られた数の同じキャッシュラインに対するアクセスを行なう実行スロットのみ並列に実行し、全実行スロットの命令を完了するまでには１以上の複数のメモリサイクルにわたって命令の実行を行なうことになる。以下では１メモリサイクルでアクセス可能なキャッシュライン数が１であるとして説明を行なうが、１メモリサイクルでアクセス可能なキャッシュライン数が２以上の場合でも、アクセスするキャッシュラインの決定部分で複数を選択し、それぞれのキャッシュラインについて、関連する実行スロットのみが有効であると考えることで、同様に処理可能となる。このような制約のもとで、アドレス重複確認の具体的な手順を以下で説明する。 As shown in FIG. 1, in the multi-core processor 10, a plurality of processor cores 100 are connected so as to share a shared cache 110. Within the multiprocessor 10, there is a restriction that only a limited number of cache lines defined in advance by hardware of the shared cache 110 can be accessed within one memory cycle. Therefore, when executing a SIMD instruction that performs a memory operation on a plurality of addresses, such as a SIMD atomic instruction, the processor core 100 executes only an execution slot that accesses a limited number of the same cache lines in parallel in each memory cycle. However, the instruction is executed over one or more memory cycles before the instruction of all execution slots is completed. In the following description, the number of cache lines that can be accessed in one memory cycle is assumed to be 1. However, even when the number of cache lines that can be accessed in one memory cycle is 2 or more, a plurality is selected in the determination part of the cache line to be accessed. By considering that only the relevant execution slot is valid for each cache line, the same processing can be performed. A specific procedure for address duplication confirmation will be described below under such constraints.

プロセッサコア１００は、「未処理」に設定されている実行スロットの中で、実行スロットの番号が最小となる実行スロットを抽出し、その抽出した実行スロットに対応するアドレスをアドレスＸとして設定する。そして、プロセッサコア１００は、他の「未処理」に設定されている実行スロットに対応するそれぞれのアドレスと、アドレスＸとを比較し、同一のキャッシュライン上のアドレスにない実行スロットを一時的に無効化する。さらに、同一キャッシュライン内の各アドレスについて、命令を実行すべき実行スロットを定めるための処理を行なう。 The processor core 100 extracts the execution slot having the smallest execution slot number from the execution slots set to “unprocessed”, and sets the address corresponding to the extracted execution slot as the address X. Then, the processor core 100 compares each address corresponding to another execution slot set to “unprocessed” with the address X, and temporarily executes an execution slot not at the address on the same cache line. Disable it. Further, for each address in the same cache line, processing for determining an execution slot in which an instruction is to be executed is performed.

図３は、命令を実行すべき実行スロットを定めるための処理を説明するための図である。 FIG. 3 is a diagram for explaining processing for determining an execution slot in which an instruction is to be executed.

典型的には、同一のキャッシュライン内にある、同時処理可能なアドレスの数は、キャッシュラインサイズとアトミック命令のデータ幅とによって定まる。図３（Ａ）に示すように、例えば、キャッシュラインサイズが６４ｂｙｔｅ（２^６）であり、データ幅が３２ｂｉｔ（＝４ｂｙｔｅ（２^２））である場合には、同一キャッシュライン上にある、同時処理可能なアドレスの数は６４／４＝１６（２^４）となる。これらのアドレスの順に、０−１５の番号を付け、番号ごとに実行スロット数のｂｉｔ数のフラグを設ける。図３（Ｂ）に示すように、フラグの各ビットは、対応する実行スロットが当該アドレスへのアクセスである場合に「１」となり、異なる場合（すなわち、実行スロットが無効である場合、または、当該アドレスへのアクセスではない場合）に「０」となるようにする。 Typically, the number of addresses that can be simultaneously processed in the same cache line is determined by the cache line size and the data width of the atomic instruction. As shown in FIG. 3A, for example, when the cache line size is 64 bytes (2 ⁶ ) and the data width is 32 bits (= 4 bytes (2 ² )), The number of addresses that can be processed is 64/4 = 16 (2 ⁴ ). Numbers 0-15 are assigned in the order of these addresses, and a flag indicating the number of execution slots is provided for each number. As shown in FIG. 3B, each bit of the flag becomes “1” when the corresponding execution slot is an access to the address, and when it is different (that is, when the execution slot is invalid, or It is set to “0” when the address is not accessed.

このように定めたアドレスごとのフラグについて、「１」となる最小のｂｉｔ番号が、当該アドレスへのアクセスを許された実行スロットの番号とする。各実行スロットは、命令の作用アドレスに対応するアドレスのフラグから定められたアクセスを許された実行スロットの番号が自分の実行スロットの番号と異なる場合は、実行スロットを一時的に無効化する。 For the flag for each address thus determined, the smallest bit number that is “1” is the number of the execution slot that is allowed to access the address. Each execution slot temporarily invalidates the execution slot when the number of the execution slot permitted to access is different from the number of its own execution slot, which is determined from the flag of the address corresponding to the action address of the instruction.

図３を参照して、以下のような手順で、命令を実行すべき実行スロットが決定される。まず、上位アドレスでキャッシュラインが判定され、同一キャッシュラインアドレスの下位部分で、アドレスを分類して、各アドレスにアクセスする実行スロットが決定される。この場合、（１）実行スロット０のキャッシュラインが選択され、（２）キャッシュラインの異なる実行スロット３を一時無効化され、（３）キャッシュラインの異なるアドレスごとにアクセス要求のある実行スロットが集計される（集計結果は図３（Ｂ）のテーブルに格納される）。（４）各アドレスにアクセスする実行スロットが決定される。この例では、アドレス１にアクセス可能なのは実行スロット１であり、アドレス３にアクセス可能なのは実行スロット０となる。そして、それ以外の実行スロット（スロット２）は無効化される。 Referring to FIG. 3, an execution slot in which an instruction is to be executed is determined by the following procedure. First, the cache line is determined by the upper address, and the address is classified by the lower part of the same cache line address, and an execution slot for accessing each address is determined. In this case, (1) the cache line of execution slot 0 is selected, (2) execution slot 3 having a different cache line is temporarily invalidated, and (3) execution slots having access requests for different addresses of the cache line are aggregated. (Aggregation results are stored in the table of FIG. 3B). (4) An execution slot for accessing each address is determined. In this example, execution slot 1 can access address 1, and execution slot 0 can access address 3. The other execution slots (slot 2) are invalidated.

なお、並列実行可能な実行スロット数が少ない場合には、キャッシュラインの異なるアドレスごとにフラグを生成するのではなく、各実行スロットに対応するアドレス同士のキャッシュライン内アドレスを示すビット同士を全比較して重複を排除するように実装してもよい。 If the number of execution slots that can be executed in parallel is small, a flag is not generated for each different address of the cache line, but the bits indicating the address in the cache line of the addresses corresponding to each execution slot are all compared. Thus, it may be implemented so as to eliminate duplication.

このように、プロセッサコア１００は、同一アドレスへの複数のアクセスのうち、ただ１つだけを有効化して、その他のアクセスを一時的に無効化する。図２に示す例では、実行スロット２０１および実行スロット２０３のいずれにおいても「アドレスＡ」へのアクセスが指定されているので、実行スロット２０１のみが有効化されて、実行スロット２０３については一時的に無効化される。 As described above, the processor core 100 enables only one of a plurality of accesses to the same address and temporarily disables other accesses. In the example shown in FIG. 2, since access to “address A” is specified in both the execution slot 201 and the execution slot 203, only the execution slot 201 is validated, and the execution slot 203 is temporarily stored. It is invalidated.

続いて、「未処理」に設定されている（すなわち、有効化されている）実行スロットに指定されているアドレスについて、プロセッサコア１００は、アトミック処理（リード・モディファイ・ライト：メモリからのデータ読み込み、更新後データの計算、更新後データのメモリへの書き込み）を実行する（（３）アトミック処理）。プロセッサコア１００は、最初のメモリからのデータ読み込みによって得られた値を、対応する戻り値領域へ書き込む。そして、プロセッサコア１００は、指定されたアトミック処理の実行が完了した実行スロットを「処理済」に設定する。 Subsequently, the processor core 100 performs atomic processing (read-modify-write: reading data from the memory) for the address specified in the execution slot set to “unprocessed” (that is, enabled). , Calculation of post-update data, and writing of post-update data into memory) ((3) atomic processing). The processor core 100 writes the value obtained by reading the data from the first memory into the corresponding return value area. Then, the processor core 100 sets the execution slot in which the designated atomic process is completed to “processed”.

図２に示す例では、プロセッサコア１００は、命令の実行が成功したことを示す成否結果値２３１，２３２，２３４を戻り値領域２２１，２２２，２２４へ書き込む。また、プロセッサコア１００は、一時的に無効化した実行スロット２０３に対応するフラグ領域２１３を「未処理」に再設定する。 In the example illustrated in FIG. 2, the processor core 100 writes success / failure result values 231, 232, and 234 that indicate successful execution of instructions to the return value areas 221, 222, and 224. Further, the processor core 100 resets the flag area 213 corresponding to the execution slot 203 temporarily invalidated to “unprocessed”.

そして、プロセッサコア１００は、「未処理」に設定されている実行スロットが存在しているか否を判断する（（４）未処理の実行スロットの確認）。「未処理」に設定されている実行スロットが存在している場合には、上述の処理（図２中の符号２４０＃で示される３つの処理）が繰り返され、そうでなければ、ＳＩＭＤ命令２００＃の実行が終了する。 Then, the processor core 100 determines whether or not there is an execution slot set to “unprocessed” ((4) confirmation of an unprocessed execution slot). If there is an execution slot set to “unprocessed”, the above-described processing (three processes indicated by reference numeral 240 # in FIG. 2) is repeated, otherwise, the SIMD instruction 200 Execution of # ends.

図２に示す例では、実行スロット２０３が「未処理」のままであるので、（５）アドレス重複確認、（６）アトミック処理、（７）未処理の実行スロットの確認が実行される。この結果、プロセッサコア１００は、成功フラグ２３３を戻り値領域２２３へ書き込む。符号２４０＃で示される３つの処理が複数回（図２の例では、２回）繰り返して実行されることで、実行スロットのすべてが「処理済」に設定されると、ＳＩＭＤ命令２００＃の実行が完了する。 In the example shown in FIG. 2, since the execution slot 203 remains “unprocessed”, (5) address duplication confirmation, (6) atomic processing, and (7) unprocessed execution slot confirmation are executed. As a result, the processor core 100 writes the success flag 233 in the return value area 223. When all the execution slots are set to “processed” by repeatedly executing the three processes indicated by reference numeral 240 # a plurality of times (in the example of FIG. 2, twice), the SIMD instruction 200 # Execution is complete.

図２に示すように、関連技術に係るアトミック命令の処理手順においては、ＳＩＭＤ命令２００＃に含まれる実行スロットのうち、有効化された実行スロットの各々について１回だけアトミック処理が実行される。このとき、互いに重複したアドレスが指定された実行スロット同士を並列実行できないので、アドレスが重複している場合には、それらの実行スロットのうち１つだけを有効化するとともに、それ以外の実行スロットについては処理を後回しにするという手順で、すべての実行スロットが処理済になるまで処理が繰り返される。このような処理手順によって、各実行スロットについて指定されているアドレスのデータに対して、リード・モディファイ・ライトがアトミックに処理される。 As shown in FIG. 2, in the atomic instruction processing procedure according to the related art, among the execution slots included in the SIMD instruction 200 #, the atomic process is executed only once for each of the enabled execution slots. At this time, since the execution slots assigned with the overlapping addresses cannot be executed in parallel, if the addresses are overlapping, only one of the execution slots is enabled and the other execution slots are also enabled. The process is repeated until all execution slots have been processed in the procedure of postponing the process. By such a processing procedure, read-modify-write is processed atomically with respect to the data at the address specified for each execution slot.

（ｂ３：ＣＡＳ命令）
以下、アトミックに処理されるＳＩＭＤ命令の一例として、コンペア・アンド・スワップ（Compare-And-Swap）命令（以下、「ＣＡＳ命令」とも総称する。）を、上述の図２に示す処理手順に従って実行する場合について説明する。 (B3: CAS instruction)
Hereinafter, as an example of an SIMD instruction processed atomically, a compare-and-swap instruction (hereinafter also collectively referred to as “CAS instruction”) is executed according to the above-described processing procedure shown in FIG. The case where it does is demonstrated.

ＣＡＳ命令は、指定されたアドレスに格納されているデータ（値）を読み込み、それを命令の結果として返すとともに、その読み込んだ第１のデータと指定された第２のデータとを比較し、両データが等しい場合に、当該指定されたアドレスに第３のデータを書き込むアトミック命令である。 The CAS instruction reads data (value) stored at a specified address, returns it as a result of the instruction, compares the read first data with the specified second data, This is an atomic instruction for writing the third data to the designated address when the data are equal.

実行スロットにＣＡＳ命令が指定されている場合には、メモリの指定されたアドレスからデータを読み込み、読み込んだデータと旧データとを比較し、両データが一致していた場合にのみ、新データをメモリへ書き込むという処理が実行される。ＣＡＳ命令では、読み込んだデータの比較結果が実行の結果（戻り値）として返される。すなわち、結果（戻り値）は、該当の実行スロットに対応する戻り値領域へ書き込まれる。 If a CAS instruction is specified in the execution slot, data is read from the specified address in the memory, the read data is compared with the old data, and the new data is only read if both data match. A process of writing to the memory is executed. In the CAS instruction, the comparison result of the read data is returned as the execution result (return value). That is, the result (return value) is written to the return value area corresponding to the corresponding execution slot.

（ｂ４：関連技術に係るＣＡＳ命令の実装・動作およびそれを用いたｌｏｃｋ−ｆｒｅｅアルゴリズム）
以下、図２に示す関連技術に係る処理手順に従って実装されるＣＡＳ命令（以下、本実施の形態に実装形態と区別するために、便宜上、「ＣＡＳ命令１」または「atomic-compare-and-swap1」とも記述する。）の処理手順について説明する。併せて、ＣＡＳ命令１を用いた、ｌｏｃｋ−ｆｒｅｅアルゴリズムについて説明する。 (B4: CAS instruction implementation / operation according to related technology and lock-free algorithm using the same)
Hereinafter, a CAS instruction implemented according to the processing procedure according to the related art shown in FIG. 2 (hereinafter, for the sake of convenience, in order to distinguish this embodiment from the implementation form, “CAS instruction 1” or “atomic-compare-and-swap1”). The processing procedure is also described. In addition, the lock-free algorithm using the CAS instruction 1 will be described.

ｌｏｃｋ−ｆｒｅｅアルゴリズムとは、複数のスレッド間で共有されるデータ（典型的には、共有キャッシュ１１０に格納されるデータ）に対して、いずれかのスレッドがロックをかけないで処理を実行するアルゴリズムである。すなわち、複数のスレッドが同時並行的に、共有データを毀損することなく、読み込みおよび書き込みを実現する。例えば、あるスレッドの実行に必要なデータが他のスレッドによってロックされており、そのロックが解除されるまで、スレッドの実行が待たされるようなことはない。 The lock-free algorithm is an algorithm for executing processing without locking any data shared between a plurality of threads (typically, data stored in the shared cache 110). It is. That is, a plurality of threads can read and write simultaneously and in parallel without damaging the shared data. For example, data necessary for execution of a certain thread is locked by another thread, and the execution of the thread is not waited until the lock is released.

このようなｌｏｃｋ−ｆｒｅｅアルゴリズムとは、重み付けヒストグラム計算、複数種別の待ち行列登録といった処理に利用される。 Such a lock-free algorithm is used for processing such as weighted histogram calculation and multiple types of queue registration.

図４は、関連技術に係るＣＡＳ命令１を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃の処理手順を示すフローチャートである。図４に示すフローチャートは、マルチコアプロセッサ１０で実行されるプログラムの一部の処理を示すものであり、各ステップに対応付けて、プログラムに用いられるコードの一例を示す。 FIG. 4 is a flowchart showing the processing procedure of the lock-free algorithm 300 # using the CAS instruction 1 according to the related art. The flowchart shown in FIG. 4 shows a part of processing of a program executed by the multi-core processor 10, and shows an example of code used in the program in association with each step.

図４を参照して、プロセッサコア１００は、データ更新を準備する（ステップＳ１００）。具体的には、プロセッサコア１００は、指定されたアドレスに格納されている更新前データから更新後データを計算するための変数％ｉｎｐｕｔをセットする（コード３０２）。 Referring to FIG. 4, processor core 100 prepares for data update (step S100). Specifically, the processor core 100 sets a variable% input for calculating post-update data from pre-update data stored at a specified address (code 302).

続いて、プロセッサコア１００は、更新前データを読み込む（ステップＳ１０２）。具体的には、プロセッサコア１００は、指定されたアドレス（ａｄｄｒｅｓｓＸ）に格納されている更新前データを読み出して、変数％ｏｌｄをセットする（コード３０４）。 Subsequently, the processor core 100 reads pre-update data (step S102). Specifically, the processor core 100 reads the pre-update data stored in the designated address (addressX) and sets the variable% old (code 304).

続いて、プロセッサコア１００は、更新後データを計算する（ステップＳ１０４）。具体的には、プロセッサコア１００は、指定された演算式に従って、変数％ｏｌｄと変数％ｉｎｐｕｔとをパラメータとして、更新後データ（変数％ｎｅｗ）を計算する（コード３０６）。 Subsequently, the processor core 100 calculates updated data (step S104). Specifically, the processor core 100 calculates post-update data (variable% new) using the variable% old and the variable% input as parameters according to the specified arithmetic expression (code 306).

続いて、プロセッサコア１００は、更新後データの書き込みを試行する（ステップＳ１０６）。具体的には、プロセッサコア１００は、ＣＡＳ命令１（atomic-compare-and-swap1）を実行して、指定されたアドレス（ａｄｄｒｅｓｓＸ）に格納されているデータと更新後データを計算する際に用いた更新前データ（すなわち、変数％ｏｌｄの値）とを比較し、両データが一致していれば、更新後データ（すなわち、変数％ｎｅｗの値）を指定されたアドレスへ書き込む（コード３０８）。ＣＡＳ命令１（atomic-compare-and-swap1）の戻り値は、指定されたアドレスに格納されていたデータであり、変数％ｒｅｔにセットされる。そこで、現ステップにおける指定されたアドレスに格納されているデータ（すなわち、変数％ｒｅｔの値）と更新後データを計算する際に用いた更新前データ（すなわち、変数％ｏｌｄの値）とを比較し、両データが一致していれば、変数％ｃｏｎｄに「ｔｒｕｅ（真）」がセットされ、そうでなければ、変数％ｃｏｎｄには「ｆａｌｓｅ（偽）」がセットされる（コード３１０）。 Subsequently, the processor core 100 tries to write updated data (step S106). Specifically, the processor core 100 executes CAS instruction 1 (atomic-compare-and-swap1) to calculate data stored at a specified address (addressX) and updated data. The data before update (that is, the value of variable% old) is compared, and if the two data match, the data after update (that is, the value of variable% new) is written to the designated address (code 308). . The return value of the CAS instruction 1 (atomic-compare-and-swap1) is data stored at the specified address, and is set in the variable% ret. Therefore, the data stored at the specified address in the current step (ie, the value of variable% ret) is compared with the pre-update data (ie, the value of variable% old) used in calculating the updated data. If the two data match, “true (true)” is set to the variable% cond. Otherwise, “false (false)” is set to the variable% cond (code 310).

更新後データの書き込みが成功した場合（ステップＳ１０６において、変数％ｃｏｎｄ＝＝ｔｒｕｅの場合：コード３１２）、処理は終了する。これに対して、更新後データの書き込みが失敗した場合（ステップＳ１０６において、変数％ｃｏｎｄ＝＝ｆａｌｓｅの場合：コード３１４）、更新後データを再度計算するために、プロセッサコア１００は、更新前データを更新する（ステップＳ１０８）。この場合は、実行中のスレッドによる更新後データの計算中に、別のスレッドが対象のアドレスのデータを変更したことを意味し、更新前データの更新が必要となる。具体的には、プロセッサコア１００は、変数％ｒｅｔの値を変数％ｏｌｄにセットする（コード３１６）。そして、ステップＳ１０４以下の処理が再度実行される。 If writing of the updated data is successful (in step S106, if variable% cond = true: code 312), the process ends. On the other hand, if the writing of the updated data fails (in step S106, the variable% cond == false: code 314), the processor core 100 determines that the updated data is to be recalculated. Is updated (step S108). In this case, it means that another thread has changed the data at the target address while the post-update data is being calculated by the executing thread, and it is necessary to update the pre-update data. Specifically, the processor core 100 sets the value of the variable% ret to the variable% old (code 316). And the process after step S104 is performed again.

（ｂ５：関連技術に係るＣＡＳ命令（ＣＡＳ命令１）を用いたｌｏｃｋ−ｆｒｅｅアルゴリズムに適用した場合の同一アドレスに対する複数の実行スロットのアトミック処理回数）
（ｉ）実行スロット数が４の場合
図５は、図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃の実行過程におけるメモリに格納されるデータの変化を説明するための図である（実行スロット数が４の場合）。 (B5: The number of atomic processes of a plurality of execution slots for the same address when applied to the lock-free algorithm using the CAS instruction (CAS instruction 1) according to the related technology)
(I) When the Number of Execution Slots is 4 FIG. 5 is a diagram for explaining changes in data stored in the memory during the execution process of the lock-free algorithm 300 # shown in FIG. 4 (the number of execution slots is 4). in the case of).

図５には、ＳＩＭＤ命令の実行スロット数が４である場合を示す。図４に示されるｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃が複数スレッドで実行される場合、コンパイラにより、４つのスレッドに対する処理が単一の命令流としてＳＩＭＤＣＡＳ命令１を含むプロセッサ命令列に変換されマルチコアプロセッサ１０上で実行される（図４のステップＳ１０６）。図５では、マルチコアプロセッサ１０に与えられたｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃中のＳＩＭＤＣＡＳ命令１を実行する際の動作例を示している。 FIG. 5 shows a case where the number of execution slots of the SIMD instruction is four. When the lock-free algorithm 300 # shown in FIG. 4 is executed by a plurality of threads, the processing for the four threads is converted into a processor instruction sequence including the SIMDCAS instruction 1 as a single instruction stream on the multi-core processor 10 by the compiler. (Step S106 in FIG. 4). FIG. 5 shows an operation example when the SIMDCAS instruction 1 in the lock-free algorithm 300 # given to the multi-core processor 10 is executed.

図５に示す例では、実行スロット２０１および２０３のＣＡＳ命令１の操作対象アドレスとして、いずれも「アドレスＡ」が指定されている（図４のコード３０８中のａｄｄｒｅｓｓＸに対応）ものとする。そのため、図５（Ａ）中の符号２４０＃で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が２回繰り返される。 In the example shown in FIG. 5, it is assumed that “address A” is designated as the operation target address of the CAS instruction 1 in the execution slots 201 and 203 (corresponding to addressX in the code 308 in FIG. 4). Therefore, three processes (address duplication confirmation, atomic process, and confirmation of unprocessed execution slots) indicated by reference numeral 240 # in FIG. 5A are repeated twice.

但し、実行スロット２０３に指定されている命令の実行が終了すると、戻り値（図４のコード３０８内の変数％ｒｅｔに対応）は、実行スロット２０１に指定されている命令の実行によりアドレスＡに書き込まれた更新後データとなり、これは通常更新前の値とは異なる。すなわち、図４に示されるｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃のステップＳ１０６において、更新後データの書き込みは失敗したと判断される。そのため、図４に示されるステップＳ１０４およびＳ１０６が再度実行される。すなわち、図５（Ｂ）に示されるように、ＣＡＳ命令１が再度実行される。 However, when the execution of the instruction specified in the execution slot 203 is completed, the return value (corresponding to the variable% ret in the code 308 in FIG. 4) is set to the address A by the execution of the instruction specified in the execution slot 201. It becomes the post-update data written, which is different from the value before the normal update. That is, in step S106 of the lock-free algorithm 300 # shown in FIG. 4, it is determined that writing of the updated data has failed. Therefore, steps S104 and S106 shown in FIG. 4 are executed again. That is, as shown in FIG. 5B, the CAS instruction 1 is executed again.

図５（Ｂ）に示すように、２回目のＣＡＳ命令１においては、ＳＩＭＤ命令２００＃に指定される複数の命令のうち、実行スロット２０３に指定されている命令のみが実行される。 As shown in FIG. 5B, in the second CAS instruction 1, only the instruction specified in the execution slot 203 is executed among the plurality of instructions specified in the SIMD instruction 200 #.

図５（Ａ）および図５（Ｂ）に示すように、図４に示されるｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃が実行されると、図５中の符号２４０＃で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）がトータルで３回繰り返されることになる。 As shown in FIGS. 5A and 5B, when the lock-free algorithm 300 # shown in FIG. 4 is executed, three processes (address duplication confirmation) indicated by reference numeral 240 # in FIG. , Atomic processing, confirmation of unprocessed execution slots) is repeated three times in total.

（ｉｉ）実行スロット数が８の場合
図６は、図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃の実行過程におけるメモリに格納されるデータの変化の別の例を説明するための図である（実行スロット数が８の場合）。 (Ii) When the Number of Execution Slots is 8 FIG. 6 is a diagram for explaining another example of change in data stored in the memory in the execution process of the lock-free algorithm 300 # shown in FIG. When the number of slots is 8.)

図６（Ａ）に示すように、１回目のＣＡＳ１命令（ＳＩＭＤ命令２００＃）の実行が開始されると（（１）命令実行開始）、プロセッサコア１００は、初期化などの所定の処理を行った上で、同時処理される実行スロットがアクセスするアドレスに重複があるか否かを判断する（（２）アドレス重複確認）。図６（Ａ）に示す例では、実行スロット２０１，２０３，２０６のいずれにおいても「アドレスＡ」へのアクセスが指定されているので、実行スロット２０１のみが有効化されて、実行スロット２０３，２０６については一時的に無効化される（図６（Ａ）の※１参照）。また、実行スロット２０４および２０８のいずれにおいても「アドレスＣ」へのアクセスが指定されているので、実行スロット２０４のみが有効化されて、実行スロット２０８については一時的に無効化される（図６（Ａ）の※２参照）。この状態で、有効化されている実行スロットに指定されている命令について、プロセッサコア１００は、アトミック処理（ＣＡＳ命令１）を実行する（（３）アトミック処理）。プロセッサコア１００は、ＣＡＳ命令１の実行に伴い各実行スロットの作用アドレス上のメモリを更新するとともに、戻り値領域にメモリから読み込んだ値を書き込む。また、アトミック処理を実行した実行スロット２０１，２０２，２０４，２０５，２０７についてフラグ領域２１３を処理済みに設定する。さらに、プロセッサコア１００は、一時的に無効化した実行スロット２０３，２０６，２０８に対応するフラグ領域２１３を「未処理」に再設定する。 As shown in FIG. 6A, when execution of the first CAS1 instruction (SIMD instruction 200 #) is started ((1) instruction execution start), the processor core 100 performs predetermined processing such as initialization. After that, it is determined whether or not there is duplication in the addresses accessed by the simultaneously executed execution slots ((2) address duplication confirmation). In the example shown in FIG. 6A, since access to “address A” is specified in any of execution slots 201, 203, and 206, only execution slot 201 is validated and execution slots 203 and 206 are displayed. Is temporarily invalidated (see * 1 in FIG. 6A). In addition, since access to “address C” is specified in both execution slots 204 and 208, only the execution slot 204 is enabled and the execution slot 208 is temporarily disabled (FIG. 6). (See * 2 in (A)). In this state, the processor core 100 executes an atomic process (CAS instruction 1) for the instruction specified in the activated execution slot ((3) atomic process). As the CAS instruction 1 is executed, the processor core 100 updates the memory on the working address of each execution slot and writes the value read from the memory in the return value area. In addition, the flag area 213 is set as processed for the execution slots 201, 202, 204, 205, and 207 that have executed the atomic process. Further, the processor core 100 resets the flag area 213 corresponding to the execution slots 203, 206, 208 temporarily invalidated to “unprocessed”.

そして、プロセッサコア１００は、「未処理」に設定されている実行スロットが存在しているか否を判断する（（４）未処理の実行スロットの確認）。続いて、（５）アドレス重複確認、（６）アトミック処理、（７）未処理の実行スロットの確認が実行される。このとき、実行スロット２０３および２０６のいずれにおいても「アドレスＡ」へのアクセスが指定されているので、実行スロット２０３のみが有効化されて、実行スロット２０６については再度一時的に無効化される（図６（Ａ）の※３参照）。 Then, the processor core 100 determines whether or not there is an execution slot set to “unprocessed” ((4) confirmation of an unprocessed execution slot). Subsequently, (5) address duplication confirmation, (6) atomic processing, and (7) unprocessed execution slot confirmation are executed. At this time, since access to “address A” is specified in both execution slots 203 and 206, only the execution slot 203 is enabled and the execution slot 206 is temporarily disabled again ( (See * 3 in FIG. 6A).

符号２４０＃で示される３つの処理が２回繰り返された後であっても、実行スロット２０６が依然として「未処理」であるので、（８）アドレス重複確認、（９）アトミック処理、（１０）未処理の実行スロットの確認が実行される。 Even after three processes indicated by reference numeral 240 # are repeated twice, the execution slot 206 is still “unprocessed”, so (8) address duplication confirmation, (9) atomic process, (10) Confirmation of unprocessed execution slots is performed.

図６（Ａ）に示される１回目のＣＡＳ命令１の実行が終了したとしても、実行スロット２０３，２０６，２０８に対応する戻り値（図４のコード３０８内の変数％ｒｅｔに対応）は、他の実行スロットにより変更された更新後データとなる。すなわち、実行スロット２０３，２０６，２０８にそれぞれ対応するｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃（図４参照）では、ステップＳ１０６において、更新後データの書き込みは失敗したと判断される。そのため、図６（Ｂ）に示されるように、ＣＡＳ命令１が再度実行される。 Even if execution of the first CAS instruction 1 shown in FIG. 6A is completed, the return values corresponding to the execution slots 203, 206, and 208 (corresponding to the variable% ret in the code 308 of FIG. 4) are The updated data is changed by another execution slot. That is, in the lock-free algorithm 300 # (see FIG. 4) respectively corresponding to the execution slots 203, 206, and 208, it is determined in step S106 that the writing of the updated data has failed. Therefore, as shown in FIG. 6B, the CAS instruction 1 is executed again.

図６（Ｂ）において、実行スロット２０３および２０６のいずれにおいても「アドレスＡ」へのアクセスが指定されているので、図６（Ｂ）中の符号２４０＃で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）は２回繰り返される。図６（Ｂ）に示される２回目のＣＡＳ命令１の実行が終了したとしても、実行スロット２０６に対応する戻り値（図４のコード３０８内の変数％ｒｅｔに対応）は、他の実行スロットにより変更された更新後データとなる。そのため、図６（Ｃ）に示されるように、ＣＡＳ命令１がさらに再度実行される。 In FIG. 6B, since access to “address A” is designated in both execution slots 203 and 206, three processes indicated by reference numeral 240 # in FIG. , Atomic processing, confirmation of unprocessed execution slots) is repeated twice. Even if the execution of the second CAS instruction 1 shown in FIG. 6B is completed, the return value corresponding to the execution slot 206 (corresponding to the variable% ret in the code 308 of FIG. 4) is the other execution slot. The updated data is changed by. Therefore, as shown in FIG. 6C, the CAS instruction 1 is executed again.

図６（Ｃ）において、３回目のＣＡＳ命令１では、実行スロット２０６に指定されている命令について、符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行される。 In FIG. 6C, in the third CAS instruction 1, three processes indicated by reference numeral 240 (address duplication confirmation, atomic process, confirmation of unprocessed execution slot) for the instruction specified in the execution slot 206 Is executed once.

図６（Ａ）〜図６（Ｃ）に示すように、図４に示されるｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃が実行されると、図６中の符号２４０＃で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）がトータルで６回繰り返されることになる。 As shown in FIGS. 6A to 6C, when the lock-free algorithm 300 # shown in FIG. 4 is executed, three processes (address duplication confirmation) indicated by reference numeral 240 # in FIG. , Atomic processing, confirmation of unprocessed execution slots) is repeated six times in total.

関連技術に係るアトミック命令（ＣＡＳ命令１）でのアトミック処理の実行回数について説明する。以降の説明では説明の都合上、異なるアドレスもすべて同一のキャッシュライン上に載っていて、並列処理が可能なものとして説明する。異なるキャッシュラインのアドレスが含まれる場合は、それぞれのキャッシュラインでアトミック処理の実行回数を計算し、それをすべて足し合わせたものが実際のアトミック処理の回数となる。 The number of executions of the atomic process with the atomic instruction (CAS instruction 1) according to the related art will be described. In the following description, for the sake of explanation, it is assumed that all different addresses are on the same cache line and can be processed in parallel. When different cache line addresses are included, the number of executions of the atomic process is calculated for each cache line, and the sum of all the numbers is the actual number of atomic processes.

図５に示すように、アドレスＡが２つの実行スロットで重複している場合には、１回目のＣＡＳ命令１で最大の重複数Ｎ（＝２）と同じ回数のアトミック処理（符号２４０＃で示される３つの処理）が実行される。複数回実行されるアトミック処理のうち、アドレスが重複している複数の実行スロットのうち、１つの実行スロットについてのみ、更新後データの書き込みが成功する。すなわち、２回目のＣＡＳ命令１は、最大の重複数Ｎから１だけ引いた数のアドレス重複がある状態で実行される。その結果、１回目のＣＡＳ命令１の実行時には、アトミック処理が２回実行され、２回目のＣＡＳ命令１の実行時には、アトミック処理が１回実行され、アトミック処理の総実行数は３回となる。 As shown in FIG. 5, when the address A is duplicated in two execution slots, the number of atomic processes (reference numeral 240 #) is the same as the maximum number of duplicates N (= 2) in the first CAS instruction 1. The three processes shown are executed. In the atomic process executed a plurality of times, writing of updated data is successful only for one execution slot among a plurality of execution slots having overlapping addresses. That is, the second CAS instruction 1 is executed in a state where there is a number of address duplications obtained by subtracting 1 from the maximum duplication number N. As a result, when the first CAS instruction 1 is executed, the atomic process is executed twice. When the second CAS instruction 1 is executed, the atomic process is executed once, and the total number of executions of the atomic process is three. .

また、図６に示すように、アドレスＡが３つの実行スロットで重複しており、アドレスＣが２つの実行スロットで重複している場合には、１回目のＣＡＳ命令１で最大の重複数Ｎ（＝３）と同じ回数のアトミック処理（符号２４０＃で示される３つの処理）が実行される。複数回実行されるアトミック処理のうち、アドレスが重複している複数の実行スロットのうち、１つの実行スロットについてのみ、更新後データの書き込みが成功する。すなわち、それ以降のＣＡＳ命令１は、最大の重複数Ｎから１ずつ引いた数のアドレス重複がある状態で実行される。 As shown in FIG. 6, when address A is overlapped in three execution slots and address C is overlapped in two execution slots, the maximum overlap number N in the first CAS instruction 1 The same number of atomic processes (three processes indicated by reference numeral 240 #) are executed as many times as (= 3). In the atomic process executed a plurality of times, writing of updated data is successful only for one execution slot among a plurality of execution slots having overlapping addresses. That is, the subsequent CAS instruction 1 is executed in a state where there is a number of address duplications obtained by subtracting 1 from the maximum duplication number N.

その結果、１回目のＣＡＳ命令１の実行時には、アトミック処理が３回実行され、２回目のＣＡＳ命令１の実行時には、アトミック処理が２回実行され、３回目のＣＡＳ命令１の実行時には、アトミック処理が１回実行され、アトミック処理の総実行数は６回となる。 As a result, when the first CAS instruction 1 is executed, the atomic process is executed three times, when the second CAS instruction 1 is executed, the atomic process is executed twice, and when the third CAS instruction 1 is executed, the atomic process is executed twice. The process is executed once, and the total number of executions of the atomic process is 6.

以上のように、ｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃が１回実行されると、アトミック処理の総実行数は、最小で、Ｎ×（Ｎ＋１）／２回となる。ここで、Ｎは、同一のＳＩＭＤ命令２００＃に含まれるアドレスの最大の重複数Ｎを示す。 As described above, when the lock-free algorithm 300 # is executed once, the total number of executions of the atomic process is N × (N + 1) / 2 times at the minimum. Here, N indicates the maximum overlapping number N of addresses included in the same SIMD instruction 200 #.

なお、独立にマルチコアプロセッサ１０内に存在する別のプロセッサコア１００で実行される別のスレッドが同一のアドレスに同時アクセスする場合には、処理回数がさらに増加する。 Note that when another thread executed independently by another processor core 100 existing in the multi-core processor 10 simultaneously accesses the same address, the number of processes further increases.

（ｂ６：実施の形態１に従うＣＡＳ類似命令の実装・動作およびそれを用いたｌｏｃｋ−ｆｒｅｅアルゴリズム）
実施の形態１に従うマルチコアプロセッサ１０（または、プロセッサコア１００の各々）で採用する、関連技術に係るＣＡＳ命令に類似した命令の実装・動作について説明する。以下、関連技術に係るＣＡＳ命令に類似した命令を、説明の便宜上、「ＣＡＳ命令２」または「atomic-compare-and-set2」とも記述する。ＳＩＭＤ命令セット上で定義されるＣＡＳ命令２は、各実行スロットで指定されたアドレスに格納されているデータ（値）を読み込み、その読み込んだ第１のデータと指定された第２のデータとを比較し、両データが等しい場合には命令の実行が成功したものとして、当該指定されたアドレスに第３のデータを書き込むという一連の処理をアトミックに行う命令である。さらに比較の際に等しかった場合には成功を表す戻り値（通常１）、等しくなかった場合には、失敗を表す戻り値（通常０）を返す。ただし、同一アドレスにアクセスする実行スロットが複数あった場合にただ一つについてのみ実際のアトミック処理を行い、その他の実行スロットについてはアトミック処理を行わず、常に失敗を表す戻り値を返すものである。 (B6: Implementation and operation of CAS-like instruction according to Embodiment 1, and lock-free algorithm using the same)
A description will be given of the implementation and operation of an instruction similar to the CAS instruction according to the related technology, which is employed in multi-core processor 10 (or each of processor cores 100) according to the first embodiment. Hereinafter, an instruction similar to the CAS instruction according to the related art is also referred to as “CAS instruction 2” or “atomic-compare-and-set2” for convenience of explanation. The CAS instruction 2 defined on the SIMD instruction set reads data (value) stored at an address designated by each execution slot, and reads the read first data and designated second data. In comparison, when both data are equal, it is an instruction that atomically performs a series of processes of writing the third data to the designated address, assuming that the instruction has been successfully executed. If they are equal in the comparison, a return value indicating success (usually 1) is returned, and if they are not equal, a return value indicating failure (usually 0) is returned. However, when there are multiple execution slots that access the same address, only the actual atomic processing is performed for only one execution slot, and atomic processing is not performed for the other execution slots, and a return value indicating failure is always returned. .

ＣＡＳ命令２が指定されたＳＩＭＤ命令２００が実行されると、以下のように動作する。 When the SIMD instruction 200 in which the CAS instruction 2 is specified is executed, the following operation is performed.

実施の形態１に従うコンピュータ１に実装されるＣＡＳ命令２の処理手順について説明する。図７は、実施の形態１に従うＣＡＳ命令２の処理手順を説明するための模式図である。図７を参照して、マルチコアプロセッサ１０に与えられるＳＩＭＤ命令２００は、４つの実行スロット２０１〜２０４（ｓｌｏｔ０〜ｓｌｏｔ３）に適用される。実行スロット２０１〜２０４の各々には、並列実行される同一のアトミック命令（ＣＡＳ命令２）および各実行スロットに異なる入力データが指定される。 A processing procedure of CAS instruction 2 implemented in computer 1 according to the first embodiment will be described. FIG. 7 is a schematic diagram for explaining the processing procedure of CAS instruction 2 according to the first embodiment. Referring to FIG. 7, SIMD instruction 200 given to multi-core processor 10 is applied to four execution slots 201 to 204 (slot 0 to slot 3). In each of the execution slots 201 to 204, the same atomic instruction (CAS instruction 2) executed in parallel and different input data are designated in each execution slot.

まず、それぞれの実行スロットにＣＡＳ命令２が指定されたＳＩＭＤ命令２００の実行が開始されると（（１）命令実行開始）、フラグ領域２１１〜２１４に格納されているそれぞれの値が初期化される。図７に示す処理手順においては、フラグ領域２１１〜２１４の各々には、対応する実行スロットに指定されているＣＡＳ命令２について、「未処理」または「処理済」を意味する値が設定される。 First, when execution of the SIMD instruction 200 in which the CAS instruction 2 is specified in each execution slot is started ((1) instruction execution start), the respective values stored in the flag areas 211 to 214 are initialized. The In the processing procedure shown in FIG. 7, a value indicating “unprocessed” or “processed” is set in each of the flag areas 211 to 214 for the CAS instruction 2 specified in the corresponding execution slot. .

初期化において、ＣＡＳ命令２が有効に指定されている実行スロットに対応するフラグ領域には、「未処理」を意味する値が設定され、命令が無効に指定されている実行スロットに対応するフラグ領域には、「処理済」を意味する値が設定される。 In initialization, a value corresponding to “unprocessed” is set in the flag area corresponding to the execution slot in which the CAS instruction 2 is effectively specified, and the flag corresponding to the execution slot in which the instruction is specified as invalid A value meaning “processed” is set in the area.

図７において、ＳＩＭＤ命令２００内に記載される「アドレスＡ」、「アドレスＢ」、「アドレスＣ」は、対応する実行スロット２０１〜２０４のＣＡＳ命令２が作用する先のアドレスを意味する。図７に示す例では、実行スロット２０１および２０３には、いずれも「アドレスＡ」に対して処理を行なうことが指定されている。 In FIG. 7, “address A”, “address B”, and “address C” described in the SIMD instruction 200 mean destination addresses to which the CAS instruction 2 of the corresponding execution slots 201 to 204 acts. In the example shown in FIG. 7, execution slots 201 and 203 are both designated to perform processing on “address A”.

初期化が終了すると、プロセッサコア１００は、「未処理」に設定されている実行スロットの中で同時処理される実行スロットを特定するとともに、同時処理される実行スロットがアクセスするアドレスに重複があるか否かを判断する（（２）アドレス重複確認）。すなわち、プロセッサコア１００は、ＣＡＳ命令２に対応する複数の実行スロットのうち指定されたアドレスが重複している実行スロットを特定する。言い換えれば、プロセッサコア１００は、実行スロットに対応するアドレスの中で同一のものがあるか否かを判断する。 When the initialization is completed, the processor core 100 specifies an execution slot that is simultaneously processed among execution slots that are set to “unprocessed”, and there is an overlap in addresses accessed by the execution slots that are simultaneously processed. ((2) Address duplication check). That is, the processor core 100 specifies an execution slot in which the designated address is duplicated among a plurality of execution slots corresponding to the CAS instruction 2. In other words, the processor core 100 determines whether there is the same address corresponding to the execution slot.

上述したように、マルチプロセッサ１０内において、１メモリサイクル内では、共有キャッシュ１１０の予めハードウェアによって定められた限られた数のキャッシュラインにしかアクセスできないという制約が存在する。そのためプロセッサコア１００はＳＩＭＤアトミック命令のような複数のアドレスに対するメモリ操作を行うＳＩＭＤ命令を実行する場合、各メモリサイクルでは限られた数の同じキャッシュラインに対するアクセスを行う実行スロットのみ並列に実行し、全実行スロットの命令を完了するまでには１以上の複数のメモリサイクルにわたって命令の実行を行うことになる。以下では１メモリサイクルでアクセス可能なキャッシュライン数が１であるとして説明を行うが、１メモリサイクルでアクセス可能なキャッシュライン数が２以上の場合でも、アクセスするキャッシュラインの決定部分で複数を選択し、それぞれのキャッシュラインについて、関連する実行スロットのみが有効であると考えることで、同様に処理可能となる。このような制約のもとで、アドレス重複確認の具体的な手順を以下で説明する。 As described above, in the multiprocessor 10, there is a restriction that only a limited number of cache lines defined in advance by the hardware of the shared cache 110 can be accessed within one memory cycle. Therefore, when the processor core 100 executes a SIMD instruction that performs a memory operation on a plurality of addresses such as a SIMD atomic instruction, only the execution slots that access a limited number of the same cache lines are executed in parallel in each memory cycle. The instruction is executed over one or more memory cycles before the instruction in all execution slots is completed. In the following description, the number of cache lines that can be accessed in one memory cycle is assumed to be 1. However, even when the number of cache lines that can be accessed in one memory cycle is 2 or more, a plurality of cache lines are selected in the determination of the cache line to be accessed By considering that only the relevant execution slot is valid for each cache line, the same processing can be performed. A specific procedure for address duplication confirmation will be described below under such constraints.

アドレスに重複がある場合には、プロセッサコア１００は、アドレスが重複している実行スロットのうち、ただ１つだけを「未処理」に維持し、その他の実行スロットについては、「処理済」に設定する。併せて、プロセッサコア１００は、アドレスの重複により「処理済」に設定した実行スロットについて、ＣＡＳ命令２の実行が失敗したことを示す結果（戻り値）を、対応する戻り値領域へ書き込む。 When there is an overlap in the address, the processor core 100 keeps only one of the execution slots with the duplicate address “unprocessed”, and sets the other execution slots to “processed”. Set. At the same time, the processor core 100 writes a result (return value) indicating that the execution of the CAS instruction 2 has failed in the corresponding return value area for the execution slot set to “processed” due to address duplication.

（２）アドレス重複確認の具体的な手順例としては、以下のような処理が実行される。
プロセッサコア１００は、アドレスが重複している実行スロットのうち、実行スロットに割り当てられた番号に基づいて、実行する１つの実行スロットを決定する。より具体的には、プロセッサコア１００は、「未処理」に設定されている実行スロットの中で、実行スロットの番号が最小（または、最大）となる実行スロットを抽出し、その抽出した実行スロットに対応するアドレスをアドレスＸとして設定する。そして、プロセッサコア１００は、他の「未処理」に設定されている実行スロットに対応するそれぞれのアドレスと、アドレスＸとを比較し、アドレスＸと同一のキャッシュライン上にない実行スロットを一時的に無効化する。さらに、同一キャッシュライン内の各アドレスについて、命令を実行すべき実行スロットを定めるための処理を行なう。 (2) As a specific procedure example of address duplication confirmation, the following processing is executed.
The processor core 100 determines one execution slot to be executed based on the number assigned to the execution slot among the execution slots having overlapping addresses. More specifically, the processor core 100 extracts an execution slot having the smallest (or largest) execution slot number from the execution slots set to “unprocessed”, and the extracted execution slot. Is set as the address X. Then, the processor core 100 compares each address corresponding to another execution slot set to “unprocessed” with the address X, and temporarily selects an execution slot not on the same cache line as the address X. Disable to. Further, for each address in the same cache line, processing for determining an execution slot in which an instruction is to be executed is performed.

このように定めたアドレスごとのフラグについて、「１」となる最小のｂｉｔ番号が、当該アドレスへのアクセスを許された実行スロットの番号とする。 For the flag for each address thus determined, the smallest bit number that is “1” is the number of the execution slot that is allowed to access the address.

各実行スロットは、命令の作用アドレスに対応するアドレスのフラグから定められたアクセスを許された実行スロットの番号が自分の実行スロットの番号と異なる場合は、「処理済」に設定するとともに、命令の実行が失敗したことを示す結果（戻り値）を対応する戻り値領域へ書き込む。一方、同一のキャッシュライン上のアドレスであって、フラグの状態によりそのアドレスへのアクセスが許された実行スロットについては、そのまま以降の処理に移る。 Each execution slot is set to “processed” when the number of the execution slot permitted to access is determined from the flag of the address corresponding to the action address of the instruction and the number of the execution slot of the execution slot. A result (return value) indicating that the execution of Failed has been written to the corresponding return value area. On the other hand, for an execution slot which is an address on the same cache line and access to the address is permitted according to the flag state, the process proceeds to the subsequent processing.

このように、プロセッサコア１００は、同一アドレスへの複数のアクセスのうち、ただ１つだけを有効化して、その他のアクセスを「処理済」（但し、ＣＡＳ命令２の実行は失敗）に設定する。図７に示す例では、プロセッサコア１００は、失敗結果値２３３Ｆを戻り値領域２２３へ書き込む。 In this way, the processor core 100 enables only one of a plurality of accesses to the same address, and sets the other accesses to “processed” (however, execution of the CAS instruction 2 has failed). . In the example illustrated in FIG. 7, the processor core 100 writes the failure result value 233F into the return value area 223.

続いて、「未処理」に設定されている（すなわち、有効化されている）実行スロットに指定されているＣＡＳ命令２について、プロセッサコア１００は、アトミック処理（リード・モディファイ・ライト：メモリからのデータ読み込み、更新後データの計算、更新後データのメモリへの書き込み）を実行する（（３）アトミック処理）。プロセッサコア１００は、最初のメモリからのデータ読み込みによって得られた、成功したことを示す結果（戻り値）を、対応する戻り値領域へ書き込む。そして、プロセッサコア１００は、各実行スロットについて、ＣＡＳ命令２のアトミック処理の際に行われた比較結果により、成功もしくは失敗を表す成否結果値を、対応する戻り値領域へ書き込む。そして、プロセッサコア１００は、指定されたアトミック処理の実行が完了した実行スロットを「処理済」に設定する。 Subsequently, for the CAS instruction 2 specified in the execution slot set to “unprocessed” (that is, enabled), the processor core 100 performs atomic processing (read / modify / write: read from the memory). Data reading, calculation of updated data, and writing of updated data to memory are executed ((3) atomic processing). The processor core 100 writes the result (return value) indicating the success obtained by reading data from the first memory into the corresponding return value area. Then, for each execution slot, the processor core 100 writes a success / failure result value indicating success or failure in the corresponding return value area based on the comparison result performed during the atomic processing of the CAS instruction 2. Then, the processor core 100 sets the execution slot in which the designated atomic process is completed to “processed”.

図７に示す例では、プロセッサコア１００は、ＣＡＳ命令２のアトミック処理の成否により成否結果値２３１，２３２，２３４を戻り値領域２２１，２２２，２２４へ書き込む。 In the example illustrated in FIG. 7, the processor core 100 writes the success / failure result values 231, 232, and 234 to the return value areas 221, 222, and 224 depending on the success or failure of the atomic processing of the CAS instruction 2.

以上のように、プロセッサコア１００は、アドレスが重複している実行スロットに指定されたＣＡＳ命令２および入力データのうち１つの実行スロットでのみアトミック処理を実行し、当該実行の結果を返すとともに、残りの実行スロットについては、アトミック処理を実行することなく、ＣＡＳ命令２の実行が失敗したことを示す結果を返す。図７に示すように、実施の形態１においては、プロセッサコア１００は、アドレスが重複している実行スロットのうち１つの実行スロットの命令実行前に、残りの実行スロットについて、ＣＡＳ命令２の実行が失敗したことを示す結果を返すようになっている。先に、ＣＡＳ命令２の実行が失敗したことを示す失敗結果値を書き込むことで、後続の未処理の実行スロットの確認の処理を簡素化できる。 As described above, the processor core 100 executes the atomic process only in one execution slot among the CAS instruction 2 and the input data specified in the execution slot having the duplicate address, and returns the result of the execution. For the remaining execution slots, a result indicating that execution of the CAS instruction 2 has failed is returned without executing the atomic process. As shown in FIG. 7, in the first embodiment, the processor core 100 executes the CAS instruction 2 for the remaining execution slots before executing the instructions for one execution slot among the execution slots having overlapping addresses. Returns a result indicating that failed. First, by writing a failure result value indicating that execution of the CAS instruction 2 has failed, it is possible to simplify the process of confirming the subsequent unprocessed execution slot.

但し、処理の都合上、成否結果値２３１，２３２，２３４を戻り値領域２２１，２２２，２２４へ書き込むのと同時に、失敗結果値２３３Ｆを戻り値領域２２３へ書き込むようにしてもよい。 However, for the sake of processing, the failure result value 233F may be written to the return value area 223 at the same time as the success / failure result values 231, 232, 234 are written to the return value areas 221, 222, 224.

そして、プロセッサコア１００は、一時的に無効化されている実行スロットを「未処理」状態に設定し、「未処理」に設定されている実行スロットが存在しているか否を判断する（（４）未処理の実行スロットの確認）。実施の形態１に従うＳＩＭＤ命令２００においては、互いに重複したアドレスが指定された複数の実行スロットが存在する場合には、１つの実行スロットのみを有効化して、それ以外の実行スロットについては、「処理済」（すなわち、無効）に設定される。 Then, the processor core 100 sets the execution slot temporarily invalidated to the “unprocessed” state, and determines whether or not there is an execution slot set to “unprocessed” ((4 ) Check for outstanding execution slots). In the SIMD instruction 200 according to the first embodiment, when there are a plurality of execution slots in which duplicate addresses are specified, only one execution slot is validated, and for other execution slots, “processing” "Completed" (ie, invalid).

そのため、単一のＳＩＭＤアトミック命令の実行について、同一のアドレスへの処理が高々１回で済むことになる。 For this reason, at the time of executing a single SIMD atomic instruction, processing to the same address is required only once.

図７に示すように、実施の形態１に従うＣＡＳ命令２の処理手順においては、ＳＩＭＤ命令２００に含まれる実行スロットのうち、互いに重複したアドレスが指定された実行スロットが存在する場合には、１つの実行スロットについてのみアトミック処理（典型的には、メモリへのアクセスを含む処理）が実行され、それ以外の実行スロットについては、アトミック処理を実行することなく、ＣＡＳ命令２の実行が失敗したことを示す結果（戻り値）を返す。これによって、図２に示すような繰り返しのアトミック処理を実行しない。すなわち、アトミック処理の実行前に、アドレスの重複を比較し、その比較結果に基づいて、ＣＡＳ命令２の実行が成功したこと、または、失敗したことを示す結果を戻すことになる。 As shown in FIG. 7, in the processing procedure of CAS instruction 2 according to the first embodiment, if there are execution slots in which duplicate addresses are designated among execution slots included in SIMD instruction 200, Atomic processing (typically, processing including access to memory) is executed for only one execution slot, and execution of CAS instruction 2 has failed for other execution slots without executing atomic processing. A result (return value) is returned. Thus, the repeated atomic process as shown in FIG. 2 is not executed. That is, before the atomic process is executed, address duplication is compared, and a result indicating that the CAS instruction 2 has been successfully executed or failed is returned based on the comparison result.

図８は、実施の形態１に従うＣＡＳ命令２を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００の処理手順を示すフローチャートである。図８に示すフローチャートは、マルチコアプロセッサ１０で実行されるプログラムの一部の処理を示すものであり、各ステップに対応付けて、プログラムに用いられるコードの一例を示す。 FIG. 8 is a flowchart showing a processing procedure of the lock-free algorithm 300 using the CAS instruction 2 according to the first embodiment. The flowchart shown in FIG. 8 shows a part of processing of a program executed by the multi-core processor 10, and shows an example of code used in the program in association with each step.

図８を参照して、プロセッサコア１００は、データ更新を準備する（ステップＳ１００）。具体的には、プロセッサコア１００は、指定されたアドレスに格納されている更新前データから更新後データを計算するための変数％ｉｎｐｕｔをセットする（コード３０２）。 Referring to FIG. 8, processor core 100 prepares for data update (step S100). Specifically, the processor core 100 sets a variable% input for calculating post-update data from pre-update data stored at a specified address (code 302).

続いて、プロセッサコア１００は、更新後データの書き込みを試行する（ステップＳ１１０）。具体的には、プロセッサコア１００は、ＣＡＳ命令２（atomic-compare-and-set2）を実行して、指定されたアドレス（ａｄｄｒｅｓｓＸ）に格納されているデータと更新後データを計算する際に用いた更新前データ（すなわち、変数％ｏｌｄの値）とを比較し、両データが一致していれば、更新後データ（すなわち、変数％ｎｅｗの値）を指定されたアドレスへ書き込む（コード３１８）。ＣＡＳ命令２（atomic-compare-and-set2）の戻り値は、ブーリアン型をとり、更新後データの書き込みが成功した場合には、「ｔｒｕｅ（真）」を返し、そうでなければ「ｆａｌｓｅ（偽）」を返す。すなわち、現ステップにおける指定されたアドレスに格納されているデータと更新後データを計算する際に用いた更新前データ（すなわち、変数％ｏｌｄの値）とが一致していれば、変数％ｃｏｎｄに「ｔｒｕｅ（真）」がセットされ、そうでなければ、変数％ｃｏｎｄには「ｆａｌｓｅ（偽）」がセットされる。 Subsequently, the processor core 100 tries to write the updated data (step S110). Specifically, the processor core 100 executes CAS instruction 2 (atomic-compare-and-set2) to calculate data stored at a specified address (addressX) and updated data. The data before update (that is, the value of variable% old) is compared, and if the two data match, the data after update (that is, the value of variable% new) is written to the designated address (code 318). . The return value of the CAS instruction 2 (atomic-compare-and-set2) is a Boolean type, and when writing of the updated data is successful, “true (true)” is returned, otherwise “false ( False) ”. That is, if the data stored at the specified address in the current step and the pre-update data (that is, the value of the variable% old) used in calculating the updated data match, the variable% cond “True” is set, otherwise “false” is set in the variable% cond.

更新後データの書き込みが成功した場合（ステップＳ１１０において、変数％ｃｏｎｄ＝＝ｔｒｕｅの場合：コード３２０）、処理は終了する。これに対して、更新後データの書き込みが失敗した場合（ステップＳ１１０において、変数％ｃｏｎｄ＝＝ｆａｌｓｅの場合：コード３２２）、ステップＳ１０２以下の処理が再度実行される。この場合は、実行中のスレッドによる更新後データの計算中に、別のスレッドが対象のアドレスのデータを変更したことを意味し、更新前データの再読み込みが必要となる。 If writing of the updated data is successful (in step S110, if variable% cond = true: code 320), the process ends. On the other hand, when the writing of the updated data fails (in the case of variable% cond == false in step S110: code 322), the processing in step S102 and subsequent steps is executed again. In this case, it means that another thread has changed the data at the target address while the post-update data is being calculated by the executing thread, and it is necessary to reread the pre-update data.

図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃とは異なり、実施の形態１に従うＣＡＳ命令２を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００では、アトミック処理ではない、指定されたアドレスに格納されている更新前データの読み出しが、毎回実行される。これは、ＣＡＳ命令２ではＣＡＳ命令１と戻り値に関する仕様が異なり、当該メモリについての最新のデータを戻り値として取得することができないからである。但し、更新前データの読み出し処理は、アトミック処理でないため、キャッシュを利用することによりアトミック処理と比較すると格段に高速アクセスが可能である。また、仮に重複したアドレスを指定する別の実行スロットが存在したとしても、アトミック命令であるＣＡＳ命令２の実行時には、高々１回のアクセスで済むため、メモリアクセスに係る処理負荷が増大することはない。 Unlike the lock-free algorithm 300 # shown in FIG. 4, in the lock-free algorithm 300 using the CAS instruction 2 according to the first embodiment, the data before update stored in the specified address is not an atomic process. Reading is performed every time. This is because the CAS instruction 2 has a different specification regarding the return value from the CAS instruction 1, and the latest data for the memory cannot be acquired as the return value. However, since the pre-update data read process is not an atomic process, it can be accessed at a much higher speed by using a cache as compared to the atomic process. Further, even if there is another execution slot for designating a duplicate address, at the time of execution of CAS instruction 2 which is an atomic instruction, only one access is required, so that the processing load related to memory access increases. Absent.

（ｂ７：実施の形態１に従うＣＡＳ類似命令（ＣＡＳ命令２）を用いたｌｏｃｋ−ｆｒｅｅアルゴリズムに適用した場合の同一アドレスに対する複数の実行スロットのアトミック処理回数）
以降の説明では説明の都合上、異なるアドレスもすべて同一のキャッシュライン上に載っていて、並列処理が可能なものとして説明する。異なるキャッシュラインのアドレスが含まれる場合は、それぞれのキャッシュラインでアトミック処理の実行回数を計算し、それをすべて足し合わせたものが実際のアトミック処理の回数となる。 (B7: The number of atomic processings of a plurality of execution slots for the same address when applied to the lock-free algorithm using the CAS-like instruction (CAS instruction 2) according to the first embodiment)
In the following description, for the sake of explanation, it is assumed that all different addresses are on the same cache line and can be processed in parallel. When different cache line addresses are included, the number of executions of the atomic process is calculated for each cache line, and the sum of all the numbers is the actual number of atomic processes.

（ｉ）実行スロット数が４の場合
図９は、図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００の実行過程におけるメモリに格納されるデータの変化を説明するための図である（実行スロット数が４の場合）。 (I) When the Number of Execution Slots is 4 FIG. 9 is a diagram for explaining changes in data stored in the memory during the execution process of the lock-free algorithm 300 shown in FIG. 8 (the number of execution slots is 4). If).

図９には、図５と同様に、４つのアトミック命令を含むＳＩＭＤ命令２００がマルチコアプロセッサ１０へ与えられた場合の動作を示す。 FIG. 9 shows the operation when the SIMD instruction 200 including four atomic instructions is given to the multi-core processor 10 as in FIG.

図９に示す例においても、実行スロット２０１および２０３に指定されている２つの命令では、いずれも「アドレスＡ」が作用アドレスとして指定されている（図８のコード３１８中のａｄｄｒｅｓｓＸに対応）ものとする。この場合、実行スロット２０１，２０２，２０４に指定されている命令は並列実行されるとともに、実行スロット２０３に指定されている命令に対しては、失敗結果値２３３Ｆが付与されて「処理済」に設定される。すなわち、実行スロット２０３については、戻り値（図８のコード３１８内の変数％ｃｏｎｄに対応）として、命令の実行が失敗したことを示す「ｆａｌｓｅ（偽）」が結果として戻される。 In the example shown in FIG. 9 as well, in the two instructions specified in the execution slots 201 and 203, “address A” is specified as the action address (corresponding to addressX in the code 318 in FIG. 8). And In this case, the instructions specified in the execution slots 201, 202, and 204 are executed in parallel, and the instruction specified in the execution slot 203 is assigned a failure result value 233F to “processed”. Is set. That is, with respect to the execution slot 203, “false (false)” indicating that the execution of the instruction has failed is returned as a return value (corresponding to the variable% cond in the code 318 in FIG. 8).

図９（Ａ）中の符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行されると、１回目のＣＡＳ命令２の実行は完了する。しかしながら、実行スロット２０３に設定されている命令については、戻り値として「ｆａｌｓｅ（偽）」が設定された状態であるので、これに対応するスレッドでは、更新前データの読み込み、更新後データの計算、更新後データの書き込み試行（図８のステップＳ１０２，Ｓ１０４，Ｓ１１０）が再度実行される。更新後データの書き込み試行（ステップＳ１１０）が再度実行されることで、図９（Ｂ）に示すように、２回目のＣＡＳ命令２が実行される。２回目のＣＡＳ命令２においても、符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行される。 When three processes (address duplication confirmation, atomic process, confirmation of unprocessed execution slot) indicated by reference numeral 240 in FIG. 9A are executed once, the execution of the first CAS instruction 2 is completed. . However, since the instruction set in the execution slot 203 is in a state where “false (false)” is set as a return value, the corresponding thread reads the pre-update data and calculates the post-update data. Then, an attempt to write updated data (steps S102, S104, S110 in FIG. 8) is executed again. By executing the data write attempt after update (step S110) again, the second CAS instruction 2 is executed as shown in FIG. 9B. Also in the second CAS instruction 2, the three processes indicated by reference numeral 240 (address duplication confirmation, atomic process, and confirmation of unprocessed execution slots) are executed once.

（ｉｉ）実行スロット数が８の場合
図１０は、図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００の実行過程におけるメモリに格納されるデータの変化の別の例を説明するための図である（実行スロット数が８の場合）。 (Ii) When the Number of Execution Slots is 8 FIG. 10 is a diagram for explaining another example of change in data stored in the memory in the execution process of the lock-free algorithm 300 shown in FIG. If the number is 8).

これに対して、実施の形態１に従うＣＡＳ命令２を用いた場合について説明する。図１０には、図６と同様に、ＳＩＭＤ命令２００がマルチコアプロセッサ１０へ与えられた場合の動作を示す。 On the other hand, a case where CAS instruction 2 according to the first embodiment is used will be described. FIG. 10 shows the operation when the SIMD instruction 200 is given to the multi-core processor 10 as in FIG.

図１０に示す例においても、実行スロット２０１，２０３，２０６では、いずれも「アドレスＡ」が作用アドレスとして指定されている（図８のコード３１８中のａｄｄｒｅｓｓＸに対応）ものとする。実行スロット２０４および２０８では、いずれも「アドレスＣ」が作用アドレスとして指定されている。 Also in the example shown in FIG. 10, it is assumed that “address A” is designated as the action address in the execution slots 201, 203, and 206 (corresponding to addressX in the code 318 in FIG. 8). In each of the execution slots 204 and 208, “address C” is designated as an action address.

図１０（Ａ）中の符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行されると、１回目のＣＡＳ命令２の実行は完了する。実行スロット２０３，２０６，２０８に設定されている命令については、いずれも戻り値として「ｆａｌｓｅ（偽）」が設定された状態であるので、更新前データの読み込み、更新後データの計算、更新後データの書き込み試行（図８のステップＳ１０２，Ｓ１０４，Ｓ１１０）が再度実行される。更新後データの書き込み試行（ステップＳ１１０）が再度実行されることで、図１０（Ｂ）に示すように、２回目のＣＡＳ命令２が実行される。２回目のＣＡＳ命令２においても、符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行される。 When the three processes indicated by reference numeral 240 in FIG. 10A (address duplication confirmation, atomic process, and confirmation of an unprocessed execution slot) are executed once, the execution of the first CAS instruction 2 is completed. . Since the instructions set in the execution slots 203, 206, and 208 are all in a state in which “false” is set as a return value, the pre-update data is read, the post-update data is calculated, and the post-update An attempt to write data (steps S102, S104, and S110 in FIG. 8) is executed again. By executing the data write attempt after the update (step S110) again, the second CAS instruction 2 is executed as shown in FIG. 10B. Also in the second CAS instruction 2, the three processes indicated by reference numeral 240 (address duplication confirmation, atomic process, and confirmation of unprocessed execution slots) are executed once.

さらに、実行スロット２０６に設定されている命令については、戻り値として「ｆａｌｓｅ（偽）」が設定された状態であるので、更新前データの読み込み、更新後データの計算、更新後データの書き込み試行（図８のステップＳ１０２，Ｓ１０４，Ｓ１１０）が再度実行される。更新後データの書き込み試行（ステップＳ１１０）が再度実行されることで、図１０（Ｃ）に示すように、３回目のＣＡＳ命令２が実行される。３回目のＣＡＳ命令２においても、符号２４０で示される３つの処理（アドレス重複確認、アトミック処理、未処理の実行スロットの確認）が１回実行される。 Further, since the instruction set in the execution slot 206 is in a state where “false” is set as a return value, reading of pre-update data, calculation of post-update data, and writing of post-update data are attempted. (Steps S102, S104, and S110 in FIG. 8) are executed again. By executing the data write attempt after update (step S110) again, the third CAS instruction 2 is executed as shown in FIG. Also in the CAS instruction 2 for the third time, three processes indicated by reference numeral 240 (address duplication confirmation, atomic process, and confirmation of unprocessed execution slots) are executed once.

図９に示すように、アドレスＡが２つの実行スロットで重複している場合において、１回目のＣＡＳ命令２でアトミック処理（符号２４０で示される３つの処理）が１回実行される。アドレスが重複している複数の実行スロットのうち、１つの実行スロットについてのみ、更新後データの書き込みが成功する。２回目のＣＡＳ命令２は、最大の重複数Ｎから１だけ引いた数のアドレス重複がある状態で実行される。但し、ＣＡＳ命令２の各実行において、アトミック処理（符号２４０で示される３つの処理）は１回だけしか実行されない。その結果、１回目および２回目のＣＡＳ命令２の実行時には、それぞれアトミック処理が１回ずつ実行され、アトミック処理の総実行数は２回となる。 As shown in FIG. 9, when address A overlaps in two execution slots, atomic processing (three processes indicated by reference numeral 240) is executed once by the first CAS instruction 2. Of the plurality of execution slots having overlapping addresses, the updated data is successfully written in only one execution slot. The second CAS instruction 2 is executed in a state where there is a number of address duplications obtained by subtracting 1 from the maximum duplication number N. However, in each execution of the CAS instruction 2, the atomic process (three processes indicated by reference numeral 240) is executed only once. As a result, when the first and second CAS instructions 2 are executed, each atomic process is executed once, and the total number of executions of the atomic process is two.

また、図１０に示すように、アドレスＡが３つの実行スロットで重複しており、アドレスＣが２つの実行スロットで重複している場合において、１回目のＣＡＳ命令２でアトミック処理（符号２４０で示される３つの処理）は１回実行される。アドレスが重複している複数の実行スロットのうち、１つの実行スロットについてのみ、更新後データの書き込みが成功する。すなわち、それ以降のＣＡＳ命令２は、最大の重複数Ｎから１ずつ引いた数のアドレス重複がある状態で実行される。 In addition, as shown in FIG. 10, when address A overlaps in three execution slots and address C overlaps in two execution slots, atomic processing is performed with the first CAS instruction 2 (reference numeral 240). The three processes shown are executed once. Of the plurality of execution slots having overlapping addresses, the updated data is successfully written in only one execution slot. That is, the subsequent CAS instruction 2 is executed in a state where there is a number of address duplications obtained by subtracting 1 from the maximum duplication number N.

その結果、１回目〜３回目のＣＡＳ命令２の実行時には、それぞれ、アトミック処理が１回ずつ実行され、アトミック処理の総実行数は３回となる。 As a result, when the first to third CAS instruction 2 is executed, each atomic process is executed once, and the total number of executions of the atomic process is three.

以上のように、ｌｏｃｋ−ｆｒｅｅアルゴリズム３００が１回実行されると、アトミック処理の総実行数は、最小で、Ｎ回となる。ここで、Ｎは、同一のＳＩＭＤ命令２００に含まれるアドレスの最大の重複数Ｎを示す。 As described above, when the lock-free algorithm 300 is executed once, the total number of executions of the atomic processing is N at a minimum. Here, N indicates the maximum overlapping number N of addresses included in the same SIMD instruction 200.

なお、独立に実行される別のスレッドが同一のアドレスに同時アクセスする場合には、処理回数がさらに増加する。 In addition, when another thread that is executed independently accesses the same address at the same time, the number of processes further increases.

以上をまとめると、マルチコアプロセッサ１０で並列実行される複数の実行スロットに指定されているアドレスの最大の重複数をＮとすると、関連技術に係るＣＡＳ命令１を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃では、外部からのデータの書き込みがないという条件下で、Ｎ×（Ｎ＋１）／２回のアトミック処理が実行される。これに対して、実施の形態１に従うＣＡＳ命令２を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３００では、Ｎ回のアトミック処理が実行される。 In summary, when the maximum duplication number of addresses specified in a plurality of execution slots executed in parallel by the multi-core processor 10 is N, the lock-free algorithm 300 # using the CAS instruction 1 according to the related technology Under the condition that no external data is written, N × (N + 1) / 2 atomic processes are executed. In contrast, in the lock-free algorithm 300 using the CAS instruction 2 according to the first embodiment, N atomic processes are executed.

図４および図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズムの実行に要する時間は、更新前データを読み込む処理（ステップＳ１０２）の実行に要する時間、および、更新後データの書き込みの試行に要する時間（ステップＳ１０６，Ｓ１１０）の和として概算できる。 The time required to execute the lock-free algorithm shown in FIGS. 4 and 8 is the time required to execute the process of reading the pre-update data (step S102) and the time required to write the post-update data (step S106, It can be estimated as the sum of S110).

ここで、更新前データを読み込む時間（図４および図８のステップＳ１０２の実行に要する時間）をｔ１とする。また、図２，３，６〜９に示される、アドレス重複確認処理、および、未処理の実行スロットの確認処理については、マルチコアプロセッサ１０内でパイプライン処理されるため、処理時間としては実質的に無視できる。そのため、アトミック処理についてのみ考慮すればよい。以下では、アトミック処理に要する時間をｔ２とする。 Here, the time for reading the pre-update data (the time required for executing step S102 in FIGS. 4 and 8) is t1. In addition, the address duplication confirmation process and the unprocessed execution slot confirmation process shown in FIGS. 2, 3, 6 to 9 are pipelined in the multi-core processor 10, so that the processing time is substantial. Can be ignored. Therefore, only atomic processing needs to be considered. Hereinafter, the time required for the atomic process is assumed to be t2.

したがって、図４に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃の実行に要する時間Ｔ１は、以下の（１）式のように示すことができる。 Therefore, the time T1 required to execute the lock-free algorithm 300 # shown in FIG. 4 can be expressed as the following equation (1).

Ｔ１＝ｔ１×１＋ｔ２×Ｎ×（Ｎ＋１）／２ …（１）
また、図８に示すｌｏｃｋ−ｆｒｅｅアルゴリズム３００の実行に要する時間Ｔ２は、以下の（２）式のように示すことができる。 T1 = t1 × 1 + t2 × N × (N + 1) / 2 (1)
Further, the time T2 required to execute the lock-free algorithm 300 shown in FIG. 8 can be expressed by the following equation (2).

Ｔ２＝（ｔ１＋ｔ２）×Ｎ …（２）
ここで、アトミック処理の途中では、通常の処理と同様の、メモリからのデータ読み込み処理が発生するので、アトミック処理に要する時間ｔ２≫更新前データの読み込み時間ｔ１が成立する。ここで、アトミック処理は、バスへの負荷などを考慮すると、通常のメモリアクセス（メモリからのデータ読み込み処理）に比較して必要な処理負荷が非常に高い。そのため、アトミック処理に要する時間ｔ２は、更新前データの読み込み時間ｔ１に比較して十分に大きな値となる。 T2 = (t1 + t2) × N (2)
Here, in the middle of the atomic process, a data reading process from the memory similar to the normal process occurs. Therefore, the time t2 required for the atomic process >> the data reading time t1 before the update is established. Here, the atomic processing requires a very high processing load compared to normal memory access (data reading processing from the memory) in consideration of the load on the bus and the like. For this reason, the time t2 required for the atomic process is a sufficiently large value compared to the read time t1 of the pre-update data.

（１）式および（２）式について検討すると、以下のようになる。
（ｉ）最大の重複数Ｎ＝１の場合
Ｔ１＝ｔ１＋ｔ２×１×２／２＝ｔ１＋ｔ２
Ｔ２＝（ｔ１＋ｔ２）×１＝ｔ１＋ｔ２＝Ｔ１
（ｉｉ）最大の重複数Ｎ＝２の場合
Ｔ１＝ｔ１＋ｔ２×２×３／２＝ｔ１＋３×ｔ２
Ｔ２＝（ｔ１＋ｔ２）×２＝ｔ１×２＋ｔ２×２＝Ｔ１＋（ｔ２−ｔ１）＞Ｔ１
したがって、最大の重複数Ｎ≧２の場合は、常に、Ｔ２＜Ｔ１が成立する。すなわち、マルチコアプロセッサ１０で並列実行される複数の実行スロットに指定されているアドレス間で重複があれば、本実施の形態に従うｌｏｃｋ−ｆｒｅｅアルゴリズム３００の実行時間は、関連技術に係るｌｏｃｋ−ｆｒｅｅアルゴリズム３００＃の実行時間より短縮できることを意味する。 Examining the equations (1) and (2), the following is obtained.
(I) When the maximum overlap number N = 1 T1 = t1 + t2 × 1 × 2/2 = t1 + t2
T2 = (t1 + t2) × 1 = t1 + t2 = T1
(Ii) When the maximum overlap number N = 2 T1 = t1 + t2 × 2 × 3/2 = t1 + 3 × t2
T2 = (t1 + t2) × 2 = t1 × 2 + t2 × 2 = T1 + (t2−t1)> T1
Therefore, T2 <T1 is always established when the maximum overlap number N ≧ 2. That is, if there is an overlap between addresses specified in a plurality of execution slots executed in parallel in multi-core processor 10, the execution time of lock-free algorithm 300 according to the present embodiment is the lock-free algorithm according to the related technology. This means that it can be shortened from the execution time of 300 #.

上述のように、実施の形態１に従うアトミック命令であるＣＡＳ命令２を用いることで、ｌｏｃｋ−ｆｒｅｅアルゴリズムの実装をより最適化することができる。この最適化によって、関連技術に係るアトミック命令であるＣＡＳ命令１を用いた場合に比較して、同様のｌｏｃｋ−ｆｒｅｅアルゴリズムをより少ないアトミック処理の実行回数で実現できる。 As described above, by using CAS instruction 2 that is an atomic instruction according to the first embodiment, the implementation of the lock-free algorithm can be further optimized. By this optimization, the same lock-free algorithm can be realized with a smaller number of executions of the atomic process as compared with the case where the CAS instruction 1 which is an atomic instruction according to the related art is used.

＜Ｃ．実施の形態２＞
上述の実施の形態１に従うＣＡＳ命令２では、戻り値として、成功または失敗したことを示す結果が返されるが、複数の結果を戻すようにしてもよい。以下、実施の形態２においては、Ｃ＋＋（ＩＳＯ／ＩＥＣ１４８８２：２０１１）コア言語にて定義されている、「atomic::compare_exchange_weak」関数の外部動作仕様に準じた複数の結果を返すＣＳＷ命令（以下、命令を「atomic-compare-swap-weak」とも記述する。）の実装例について説明する。 <C. Second Embodiment>
In the CAS instruction 2 according to the first embodiment described above, a result indicating success or failure is returned as a return value, but a plurality of results may be returned. Hereinafter, in the second embodiment, a CSW instruction that returns a plurality of results in accordance with the external operation specifications of the “atomic :: compare_exchange_weak” function defined in the C ++ (ISO / IEC 14882: 2011) core language (hereinafter, The instruction is also described as “atomic-compare-swap-weak”).

実施の形態２に従うＣＳＷ命令が実装されるコンピュータ、および、それに含まれるマルチコアプロセッサについては、上述の実施の形態１と同様であるので、詳細な説明は繰り返さない。 Since the computer on which the CSW instruction according to the second embodiment is implemented and the multi-core processor included in the computer are the same as those in the first embodiment, detailed description will not be repeated.

それぞれの実行スロットにアトミック命令であるＣＳＷ命令が複数指定されているＳＩＭＤ命令がマルチコアプロセッサ１０へ与えられると、同時処理される実行スロットがアクセスするアドレスに重複があるか否かが判断される。互いに重複したアドレスが指定された複数の実行スロットが存在する場合には、１つの実行スロットのみが有効化され、それ以外の実行スロットについては無効化される。すなわち、互いに重複したアドレスが指定された実行スロットが存在する場合には、１つの実行スロットルのみアトミック処理（典型的には、メモリへのアクセスを含む処理）が実行され、それ以外の実行スロットについては、アトミック処理を実行することなく、命令の実行が失敗したことを示す結果（戻り値）を返す。 When a SIMD instruction in which a plurality of CSW instructions that are atomic instructions are specified in each execution slot is given to the multi-core processor 10, it is determined whether or not there is an overlap in the addresses accessed by the execution slots that are simultaneously processed. When there are a plurality of execution slots in which overlapping addresses are specified, only one execution slot is validated, and the other execution slots are invalidated. In other words, when there are execution slots in which duplicate addresses are designated, atomic processing (typically, processing including access to the memory) is executed for only one execution throttle, and for other execution slots Returns a result (return value) indicating that the execution of the instruction has failed without executing the atomic process.

実際にアトミック処理が実行された実行スロットについては、結果１として、命令の実行が成功したか否かを示す結果（フラグ値）が返され、結果２として、アトミック処理時に読み込んだデータ（値）が返される。このように、マルチコアプロセッサ１０は、アドレスが重複している実行スロットに指定されたアトミック命令のうち１つの実行スロットの命令実行後に、指定されたアドレスから読み込んだ値をさらなる結果（結果２）として返す。 For the execution slot where the atomic processing is actually executed, a result (flag value) indicating whether or not the instruction has been successfully executed is returned as a result 1, and data (value) read at the time of the atomic processing is returned as a result 2. Is returned. As described above, the multi-core processor 10 uses the value read from the designated address as the further result (result 2) after executing the instruction in one execution slot among the atomic instructions designated in the execution slot having the duplicate address. return.

また、アトミック処理を実行していない実行スロットについては、結果１として、命令の実行が失敗したことを示す結果（フラグ値）が返される。また、アトミック処理を実行していない実行スロットについての結果２として、（１）実際にアトミック処理が実行された実行スロットの結果１が成功を示す場合には、当該実行スロットの命令がメモリに書き込んだデータ（値）が返され、（２）実際にアトミック処理が実行された実行スロットの結果１が失敗を示す場合には、当該１つの実行スロットが命令実行途中でメモリから読み込んだ値（すなわち、実行された実行スロットの結果２と同じ値）が返される。 For an execution slot that is not executing an atomic process, a result (flag value) indicating that the execution of the instruction has failed is returned as result 1. As a result 2 for an execution slot that is not executing an atomic process, (1) if the result 1 of the execution slot in which the atomic process is actually executed indicates success, the instruction in the execution slot is written to the memory. Data (value) is returned, and (2) if the result 1 of the execution slot in which the atomic process is actually executed indicates failure, the value read from the memory by the one execution slot during execution of the instruction (ie, , The same value as the result 2 of the executed execution slot) is returned.

このように、マルチコアプロセッサ１０は、アドレスが重複している実行スロットに指定されたアトミック命令のうち実行されない残りのアトミック命令について、当該実行された１の実行が成功した場合には、当該１つのアトミック命令が書き込んだ値をさらなる結果（結果２）として返す。一方、マルチコアプロセッサ１０は、当該実行された１つのアトミック命令の実行が失敗した場合には、当該１つのアトミック命令が読み込んだ値をさらなる結果（結果２）として返す。 As described above, when the executed one of the remaining atomic instructions that are not executed among the atomic instructions specified in the execution slots having overlapping addresses is successfully executed, The value written by the atomic instruction is returned as a further result (result 2). On the other hand, when the execution of one executed atomic instruction fails, the multi-core processor 10 returns the value read by the one atomic instruction as a further result (result 2).

図１１は、実施の形態２に従うアトミック命令の処理手順を説明するための模式図である。図２を参照して、マルチコアプロセッサ１０に与えられるＳＩＭＤ命令２００は、４つの実行スロット２０１〜２０４（ｓｌｏｔ０〜ｓｌｏｔ３）を含む。この例において、実行スロット２０１〜２０４の各々には、実施の形態２に従うＣＳＷ命令が指定されるものとする。 FIG. 11 is a schematic diagram for describing a processing procedure of an atomic instruction according to the second embodiment. Referring to FIG. 2, SIMD instruction 200 given to multi-core processor 10 includes four execution slots 201-204 (slot0-slot3). In this example, it is assumed that a CSW instruction according to the second embodiment is designated in each of execution slots 201-204.

ＳＩＭＤ命令２００の実行が開始されると（（１）命令実行開始）、各実行スロットについての結果１および結果２を格納するフラグ領域などが初期化される。初期化が終了すると、プロセッサコア１００は、「未処理」に設定されている実行スロットの中で同時処理される実行スロットを特定するとともに、同時処理される実行スロットがアクセスするアドレスに重複があるか否かを判断する（（２）アドレス重複確認）。すなわち、プロセッサコア１００は、実行スロットに対応するアドレスの中で同一のものがあるか否かを判断する。そして、プロセッサコア１００は、同一アドレスへの複数のアクセスのうち、ただ１つだけを有効化して、その他のアクセスを「処理済」に設定する。図７に示す例では、プロセッサコア１００は、実行スロット２０３に対応するフラグ領域に失敗結果値２３３Ｆを書き込む。すなわち、プロセッサコア１００は、実行スロット２０３に指定されているＣＳＷ命令の結果１として、命令の実行が失敗したことを示す結果を返す。 When the execution of the SIMD instruction 200 is started ((1) instruction execution start), a flag area for storing the result 1 and the result 2 for each execution slot is initialized. When the initialization is completed, the processor core 100 specifies an execution slot that is simultaneously processed among execution slots that are set to “unprocessed”, and there is an overlap in addresses accessed by the execution slots that are simultaneously processed. ((2) Address duplication check). That is, the processor core 100 determines whether there is the same address corresponding to the execution slot. Then, the processor core 100 validates only one of a plurality of accesses to the same address and sets other accesses to “processed”. In the example illustrated in FIG. 7, the processor core 100 writes the failure result value 233F in the flag area corresponding to the execution slot 203. That is, the processor core 100 returns a result indicating that the execution of the instruction has failed as a result 1 of the CSW instruction specified in the execution slot 203.

続いて、「未処理」に設定されている（すなわち、有効化されている）実行スロットに指定されている命令について、プロセッサコア１００は、アトミック処理（ＣＳＷ命令）を実行する（（３）アトミック処理）。このとき、プロセッサコア１００は、実行スロット２０１，２０２，２０４に指定されているＣＳＷ命令のそれぞれの結果２として、アトミック処理時に読み込んだそれぞれのデータ（値）を返す。 Subsequently, the processor core 100 executes an atomic process (CSW instruction) for the instruction specified in the execution slot set to “unprocessed” (that is, enabled) ((3) atomic processing). At this time, the processor core 100 returns each data (value) read at the time of atomic processing as the result 2 of each CSW instruction specified in the execution slots 201, 202, and 204.

続いて、プロセッサコア１００は、実行スロット２０１，２０２，２０４に指定されているＣＳＷ命令のそれぞれの結果１として、指定されたアドレスに格納されているデータと更新前データとのそれぞれの一致の成否に応じて、命令の実行が成功または失敗したことを示す結果（フラグ値）を返す（（４）結果１の生成）。 Subsequently, the processor core 100 determines whether or not each of the data stored at the specified address and the pre-update data match as the result 1 of each of the CSW instructions specified in the execution slots 201, 202, and 204. In response to this, a result (flag value) indicating that the execution of the instruction has succeeded or failed is returned ((4) generation of result 1).

さらに、プロセッサコア１００は、アドレスの重複により実行されていない実行スロット２０３に設定されているＣＳＷ命令の結果２として、アドレスが同一である実行スロット２０１の処理結果に応じて、実行スロット２０１に指定されているＣＳＷ命令が書き込んだデータ（値）（実行スロット２０１の結果１として成功が返された場合）、または、実行スロット２０１に指定されているＣＳＷ命令の結果２と同じデータ（値）（実行スロット２０１の結果１として失敗が返された場合）を返す（（５）結果２の生成）。 Further, the processor core 100 designates the execution slot 201 as the result 2 of the CSW instruction set in the execution slot 203 that is not executed due to the duplication of the address according to the processing result of the execution slot 201 having the same address. Data (value) written by the executed CSW instruction (when success is returned as the result 1 of the execution slot 201), or the same data (value) as the result 2 of the CSW instruction specified in the execution slot 201 ( (When failure is returned as the result 1 of the execution slot 201)) ((5) generation of the result 2).

このようにアトミック処理が１回実行されると、ＳＩＭＤ命令２００の実行は終了する。 When the atomic process is executed once in this way, the execution of the SIMD instruction 200 ends.

図１２は、実施の形態２に従うＣＳＷ命令を用いたｌｏｃｋ−ｆｒｅｅアルゴリズム３５０の処理手順を示すフローチャートである。図１２に示すフローチャートは、マルチコアプロセッサ１０で実行されるプログラムの一部の処理を示すものであり、各ステップに対応付けて、プログラムに用いられるコードの一例を示す。 FIG. 12 is a flowchart showing a processing procedure of the lock-free algorithm 350 using the CSW instruction according to the second embodiment. The flowchart shown in FIG. 12 shows a part of processing of a program executed by the multi-core processor 10, and shows an example of code used in the program in association with each step.

図１２を参照して、プロセッサコア１００は、データ更新を準備する（ステップＳ１００）。具体的には、プロセッサコア１００は、指定されたアドレスに格納されている更新前データから更新後データを計算するための変数％ｉｎｐｕｔをセットする（コード３０２）。 Referring to FIG. 12, processor core 100 prepares for data update (step S100). Specifically, the processor core 100 sets a variable% input for calculating post-update data from pre-update data stored at a specified address (code 302).

続いて、プロセッサコア１００は、更新後データの書き込みを試行する（ステップＳ１２０）。具体的には、プロセッサコア１００は、ＣＳＷ命令（atomic-compare-swap-weak）を実行して、指定されたアドレス（ａｄｄｒｅｓｓＸ）に格納されているデータと更新後データを計算する際に用いた更新前データ（すなわち、変数％ｏｌｄの値）とを比較し、両データが一致していれば、更新後データ（すなわち、変数％ｎｅｗの値）を指定されたアドレスへ書き込む（コード３２８）。ＣＳＷ命令（atomic-compare-swap-weak）の結果１は、変数％ｃｏｎｄにセットされ、結果２は、変数％ｏｌｄにセットされる。 Subsequently, the processor core 100 tries to write updated data (step S120). Specifically, the processor core 100 uses a CSW instruction (atomic-compare-swap-weak) to calculate data stored at a specified address (addressX) and updated data. The data before update (that is, the value of variable% old) is compared, and if the two data match, the data after update (that is, the value of variable% new) is written to the designated address (code 328). The result 1 of the CSW instruction (atomic-compare-swap-weak) is set to the variable% cond, and the result 2 is set to the variable% old.

更新後データの書き込みが成功した場合（ステップＳ１２０において、変数％ｃｏｎｄ＝＝ｔｒｕｅの場合：コード３３０）、処理は終了する。これに対して、更新後データの書き込みが失敗した場合（ステップＳ１２０において、変数％ｃｏｎｄ＝＝ｆａｌｓｅの場合：コード３３２）、ステップＳ１０４以下の処理が再度実行される。更新後データの書き込みが失敗した場合には、結果２として、アトミック処理時に読み込んだそれぞれのデータ（値）が返されるので、更新前データの読み込み（ステップＳ１０２）の再実行は不要となる。ＣＳＷ命令の実行が失敗した場合は、別のスレッドが対象のアドレスのデータを変更したことを意味するが、更新前データの再読み込みは不要であり、メモリアクセスに係る処理負荷の増大を抑制できる。 If writing of the updated data is successful (in step S120, variable% cond = true: code 330), the process ends. On the other hand, when the writing of the updated data has failed (in the case of variable% cond == false in step S120: code 332), the processing from step S104 onward is executed again. If writing of the updated data fails, each data (value) read at the time of the atomic process is returned as a result 2. Therefore, it is not necessary to re-read the data before updating (step S102). If execution of the CSW instruction fails, it means that another thread has changed the data at the target address, but it is not necessary to re-read the data before update, and an increase in processing load related to memory access can be suppressed. .

実施の形態２に従うＣＳＷ命令は、ハードウェア実装に必要な回路面積が増大するものの、標準のプログラミング言語（例えば、Ｃ＋＋コア言語）に用意されている関数の動作に準拠するものである。そのため、既存のプログラムを修正することなく、実施の形態２に従うＣＳＷ命令を利用することができる。また、図１２に示すようなｌｏｃｋ−ｆｒｅｅアルゴリズムを実装した際に、ループ処理内でのメモリアクセス（更新前データの読み込み処理）の付加を低減できる。 The CSW instruction according to the second embodiment complies with the operation of a function prepared in a standard programming language (for example, C ++ core language), although the circuit area required for hardware implementation increases. Therefore, the CSW instruction according to the second embodiment can be used without modifying an existing program. Further, when the lock-free algorithm as shown in FIG. 12 is implemented, it is possible to reduce the addition of memory access (reading process of pre-update data) in the loop process.

実施の形態２に従うＣＳＷ命令では、アトミック処理を実行していない実行スロットについての結果２として返されるデータ（値）は、実際にアトミック処理が実行された実行スロットの結果１に応じて変化する。これは、各実行スロットがあたかも直列実行されるようなインターフェイスを実現するためであり、このようなインターフェイスを用意することで、プログラミングを容易化できる。 In the CSW instruction according to the second embodiment, the data (value) returned as the result 2 for the execution slot that is not executing the atomic process changes according to the result 1 of the execution slot in which the atomic process is actually executed. This is to realize an interface in which each execution slot is executed in series, and programming can be facilitated by preparing such an interface.

＜Ｄ．実施の形態３＞
次に、実施の形態３として、本実施の形態に従うアトミック処理を共有メモリ型（ＮＵＭＡ：Non-Uniform Memory Access）マルチプロセッサコンピュータシステムに実装した構成について説明する。 <D. Embodiment 3>
Next, a configuration in which the atomic processing according to the present embodiment is implemented in a shared memory type (NUMA: Non-Uniform Memory Access) multiprocessor computer system will be described as a third embodiment.

図１３は、実施の形態３に従うコンピュータ１Ａの装置構成を示す模式図である。図１３を参照して、実施の形態３に従うコンピュータ１Ａは、２つのマルチコアプロセッサ１０−１，１０−２と、外部メモリ１８０−１，１８０−２と、外部機器１９０−１，１９０−２とを含む。マルチコアプロセッサ１０−１とマルチコアプロセッサ１０−２とは、プロセッサインターコネクトバス１４４を介して、データを遣り取りできるように接続されている。 FIG. 13 is a schematic diagram showing a device configuration of a computer 1A according to the third embodiment. Referring to FIG. 13, computer 1A according to the third embodiment includes two multi-core processors 10-1, 10-2, external memories 180-1, 180-2, and external devices 190-1, 190-2. including. The multi-core processor 10-1 and the multi-core processor 10-2 are connected to each other via a processor interconnect bus 144 so that data can be exchanged.

マルチコアプロセッサ１０−１，１０−２（すなわち、プロセッサコア１００−１〜１００−４）には、実施の形態１に従うアトミック命令（ＣＡＳ命令２および／またはＣＳＷ命令）が実装されているものとする。 It is assumed that atomic instructions (CAS instruction 2 and / or CSW instruction) according to the first embodiment are implemented in multi-core processors 10-1 and 10-2 (that is, processor cores 100-1 to 100-4). .

マルチコアプロセッサ１０−１は、プロセッサコア１００−１，１００−２を有しており、マルチコアプロセッサ１０−２は、プロセッサコア１００−３，１００−４を有している。プロセッサコア１００−１，１００−２は、プロセッサインターコネクトバス１４４を介して、マルチコアプロセッサ１０−２内にある共有キャッシュ１１０−２へアクセス可能である。同様に、プロセッサコア１００−３，１００−４は、プロセッサインターコネクトバス１４４を介して、マルチコアプロセッサ１０−１内にある共有キャッシュ１１０−１へアクセス可能である。 The multi-core processor 10-1 has processor cores 100-1 and 100-2, and the multi-core processor 10-2 has processor cores 100-3 and 100-4. The processor cores 100-1 and 100-2 can access the shared cache 110-2 in the multi-core processor 10-2 via the processor interconnect bus 144. Similarly, the processor cores 100-3 and 100-4 can access the shared cache 110-1 in the multi-core processor 10-1 via the processor interconnect bus 144.

ある命令を実行しているマルチコアプロセッサ１０とは異なるマルチコアプロセッサ１０に接続されているメモリ領域にアクセスする必要がある場合には、アトミック処理の実行に要する時間はより長くなる。すなわち、あるプロセッサコア１００がアトミック命令を実行中において、指定されたアドレスＸが、そのアトミック命令を実行しているプロセッサコア１００とは別のプロセッサコア１００に接続されている共有メモリ（「リモートメモリ領域）とも称す。）に存在する場合には、プロセッサインターコネクトバス１４４を介したアクセスとなる。 When it is necessary to access a memory area connected to a multicore processor 10 different from the multicore processor 10 executing a certain instruction, the time required for executing the atomic process becomes longer. That is, while a certain processor core 100 is executing an atomic instruction, a specified address X is shared memory ("remote memory") connected to a processor core 100 different from the processor core 100 executing the atomic instruction. If it exists in the area), it is accessed via the processor interconnect bus 144.

すなわち、アトミック処理中は、プロセッサインターコネクトバス１４４を介した、リモートメモリ領域からのデータ読み込み（ｒｅａｄ）、および、リモートメモリ領域への書き込み（ｗｒｉｔｅ）が実行される。その結果、ハードウェアリソースの占有時間が増大し、アトミック処理に要する時間自体も増加する。すなわち、アトミック処理の実行回数自体が性能に対してより大きく影響することになる。 That is, during the atomic process, data reading (read) from the remote memory area and writing (write) to the remote memory area are executed via the processor interconnect bus 144. As a result, the occupation time of hardware resources increases, and the time required for atomic processing itself also increases. That is, the number of executions of atomic processing itself has a greater influence on performance.

言い換えれば、実施の形態１に従うＣＡＳ命令２、および／または、実施の形態２に従うＣＳＷ命令を実装することによる性能向上の効果は、図１３に示すようなアトミック処理を共有メモリ型マルチプロセッサコンピュータシステムにおいて、より顕著になる。 In other words, the performance improvement effect by implementing the CAS instruction 2 according to the first embodiment and / or the CSW instruction according to the second embodiment is the same as that of the shared memory type multiprocessor computer system shown in FIG. Becomes more prominent.

その他の構成は、図１に示すコンピュータ１と同様であるので、詳細な説明は繰り返さない。 Since other configurations are the same as those of computer 1 shown in FIG. 1, detailed description will not be repeated.

＜Ｅ．実施の形態４＞
次に、実施の形態４として、本実施の形態に従うアトミック処理をプロセッサコアごとにローカルストレージが設けられたマルチプロセッサコンピュータシステムに実装した構成について説明する。このようなプロセッサコアごとにローカルストレージが設けられた構成は、典型的には、画像処理エンジン（ＧＰＵ）などに応用される。 <E. Embodiment 4>
Next, a configuration in which the atomic processing according to the present embodiment is implemented in a multiprocessor computer system provided with a local storage for each processor core will be described as a fourth embodiment. Such a configuration in which a local storage is provided for each processor core is typically applied to an image processing engine (GPU) or the like.

図１４は、実施の形態４に従うコンピュータ１Ｂの装置構成を示す模式図である。図１４を参照して、実施の形態４に従うコンピュータ１Ｂは、マルチコアプロセッサ１０Ｂと、外部メモリ１８０と、外部機器１９０とを含む。 FIG. 14 is a schematic diagram showing a device configuration of a computer 1B according to the fourth embodiment. Referring to FIG. 14, computer 1B according to the fourth embodiment includes a multi-core processor 10B, an external memory 180, and an external device 190.

マルチコアプロセッサ１０Ｂは、複数のプロセッサコア１００−１，１００−２と、共有キャッシュ１１０と、メモリコントローラ１２０と、ＩＯコントローラ１３０とを含む。プロセッサコア１００−１，１００−２には、ストレージバス１０４−１，１０４−２を介して、ローカルストレージ１０２−１，１０２−２（以下、「ローカルストレージ１０２」とも総称する。）にそれぞれ接続される。図１４には、説明の便宜上、２つのプロセッサコア１００からなる構成について例示するが、プロセッサコア１００の数については特に制約はない。 The multi-core processor 10B includes a plurality of processor cores 100-1 and 100-2, a shared cache 110, a memory controller 120, and an IO controller 130. The processor cores 100-1 and 100-2 are connected to local storages 102-1 and 102-2 (hereinafter also collectively referred to as “local storage 102”) via storage buses 104-1 and 104-2, respectively. Is done. FIG. 14 illustrates a configuration including two processor cores 100 for convenience of explanation, but the number of processor cores 100 is not particularly limited.

コンピュータ１Ｂにおいて、プロセッサコア１００ごとに対応したローカルストレージ１０２がハードウェアとして実装されている。ローカルストレージ１０２のメモリ領域は、特定のスレッド間のみで共有することができる。言い換えれば、ローカルストレージ１０２のメモリ領域は、プロセッサコア１００間で共有される必要はない。また、要求されるローカルストレージ１０２のメモリ容量も限られる。そのため、各ローカルストレージ１０２は、各プロセッサコア１００の直近に配置される。 In the computer 1B, a local storage 102 corresponding to each processor core 100 is mounted as hardware. The memory area of the local storage 102 can be shared only between specific threads. In other words, the memory area of the local storage 102 does not need to be shared between the processor cores 100. Further, the required memory capacity of the local storage 102 is limited. For this reason, each local storage 102 is arranged in the immediate vicinity of each processor core 100.

図１４に示すようなマルチコアプロセッサ１０Ｂは、例えば、ＯｐｅｎＣＬ（Open Computing Language）でのローカルアドレス空間（local address space）に類似したプラットフォームを実現できる。 The multi-core processor 10B as shown in FIG. 14 can realize, for example, a platform similar to a local address space in OpenCL (Open Computing Language).

図１４に示すようなマルチコアプロセッサ１０Ｂでは、隣り合う所定数単位のデータごと（例えば、３２ｂｉｔごと）に、異なるバンクに格納されるようにマルチバンク構成を採用することが好ましい。このようなマルチバンク構成を採用することで、ランダムアクセス時においても、プロセッサコア１００が扱うデータ並列度と同等の並列メモリアクセス性能を実現できる。 In the multi-core processor 10B as shown in FIG. 14, it is preferable to adopt a multi-bank configuration so that data is stored in a different bank for every predetermined number of adjacent data (for example, every 32 bits). By adopting such a multi-bank configuration, parallel memory access performance equivalent to the data parallelism handled by the processor core 100 can be realized even during random access.

図１５は、図１４に示すローカルストレージ１０２の構成例を示す模式図である。図１５を参照して、ローカルストレージ１０２の各々は、複数のバンクメモリ１０２１，１０２２，１０２３，１０２４を含む。説明の便宜上、図１５には、４つのバンクメモリからなる構成例を示すが、バンクメモリの数は複数であればいずれであってもよい。 FIG. 15 is a schematic diagram showing a configuration example of the local storage 102 shown in FIG. Referring to FIG. 15, each local storage 102 includes a plurality of bank memories 1021, 1022, 1023, and 1024. For convenience of explanation, FIG. 15 shows a configuration example including four bank memories, but any number of bank memories may be used.

ストレージバス１０４（図１４参照）は、プロセッサコア１００がバンクメモリ１０２１，１０２２，１０２３，１０２４にそれぞれ並列してアクセスできるように、バンクメモリの数に相当する並列化されている。図１４に示す実施の形態４に従うコンピュータ１Ｂにおいて、実施の形態１に従うアトミック命令（ＣＡＳ命令２）が実行されたときのメモリアクセスの処理手順について説明する。 The storage bus 104 (see FIG. 14) is parallelized corresponding to the number of bank memories so that the processor core 100 can access the bank memories 1021, 1022, 1023, and 1024 in parallel. A processing procedure for memory access when an atomic instruction (CAS instruction 2) according to the first embodiment is executed in computer 1B according to the fourth embodiment shown in FIG. 14 will be described.

図１６は、図１４に示す実施の形態４に従うコンピュータ１Ｂにおけるメモリアクセスの処理手順を説明するためのフローチャートである。図１６に示す処理手順は、バンクメモリごとに並列実行される。図１７は、図１６のフローチャートによって定義される処理手順において用いられるアクセスフラグ１０８のデータ構造例を示す模式図である。 FIG. 16 is a flowchart for illustrating a memory access processing procedure in computer 1B according to the fourth embodiment shown in FIG. The processing procedure shown in FIG. 16 is executed in parallel for each bank memory. FIG. 17 is a schematic diagram showing an example of the data structure of the access flag 108 used in the processing procedure defined by the flowchart of FIG.

図１６を参照して、プロセッサコア１００は、それぞれの実行スロットに指定されている命令がいずれのバンクメモリへのアクセスを要求するものであるか否かを示すアクセスフラグ１０８を更新する（ステップＳ２００）。図１７に示すように、アクセスフラグ１０８は、バンクメモリごとに実行スロットの数だけ設けられる。すなわち、アクセスフラグ１０８は、少なくとも、バンクメモリの数と実行スロットの数との積に相当する数のフラグを格納できるようになっている。 Referring to FIG. 16, processor core 100 updates access flag 108 indicating whether or not an instruction specified in each execution slot requests access to which bank memory (step S200). ). As shown in FIG. 17, as many access flags 108 as the number of execution slots are provided for each bank memory. That is, the access flag 108 can store at least as many flags as the product of the number of bank memories and the number of execution slots.

以下、ステップＳ２０２〜Ｓ２１６は、バンクメモリの数だけ並列実行される。
プロセッサコア１００は、ステップＳ２００において更新したアクセスフラグ１０８において、対象のバンクメモリへのアクセスを要求している実行スロットが存在するか否かを判断する（ステップＳ２０２）。すなわち、対象のバンクメモリのいずれかのフラグが「ｔｒｕｅ（真）」または「１」がセットされているか否かを判断する。対象のバンクメモリへのアクセスを要求している実行スロットが存在していなければ（ステップＳ２０２においてＮＯ）、処理はステップＳ２２０へ進む。 Thereafter, steps S202 to S216 are executed in parallel by the number of bank memories.
The processor core 100 determines whether or not there is an execution slot requesting access to the target bank memory in the access flag 108 updated in step S200 (step S202). That is, it is determined whether or not any of the flags of the target bank memory is set to “true (true)” or “1”. If there is no execution slot requesting access to the target bank memory (NO in step S202), the process proceeds to step S220.

対象のバンクメモリへのアクセスを要求している実行スロットが存在していれば（ステップＳ２０２においてＹＥＳ）、プロセッサコア１００は、対象のバンクメモリへのアクセスを要求している（典型的には、「ｔｒｕｅ（真）」または「１」がセットされている）実行スロットのうち、いずれか１つの実行スロットを選択する（ステップＳ２０４）。この選択された実行スロットのスロット番号をｎとする。 If there is an execution slot that requests access to the target bank memory (YES in step S202), the processor core 100 requests access to the target bank memory (typically, One execution slot is selected from the execution slots set to “true (true)” or “1” (step S204). Let n be the slot number of the selected execution slot.

続いて、プロセッサコア１００は、選択された実行スロットに指定されているアドレスと同一のアドレスを指定する他の実行スロットが存在するか否かを判断する（ステップＳ２０６）。より具体的には、プロセッサコア１００は、選択された実行スロットに指定されているアドレスをアドレスＸとして設定するとともに、同一のバンクメモリへのアクセスが指定されているそれ以外の実行スロットの各々について、その指定されているアドレスとアドレスＸとを比較する。すなわち、アクセスフラグ１０８において対象のバンクメモリについて「ｔｒｕｅ（真）」または「１」がセットされており、かつ、実行スロットのスロット番号がｎではない、実行スロットについて、その指定されているアドレスとアドレスＸとが比較される。 Subsequently, the processor core 100 determines whether or not there is another execution slot that specifies the same address as the address specified in the selected execution slot (step S206). More specifically, the processor core 100 sets the address specified in the selected execution slot as the address X, and for each of the other execution slots in which access to the same bank memory is specified. The designated address is compared with the address X. That is, “true (true)” or “1” is set for the target bank memory in the access flag 108 and the slot number of the execution slot is not n, the designated address and Address X is compared.

選択された実行スロットに指定されているアドレスと同一のアドレスを指定する他の実行スロットが存在すれば（ステップＳ２０６においてＹＥＳ）、プロセッサコア１００は、アクセスフラグ１０８において、当該同一のアドレスを指定する他の実行スロットに対応するフラグを「ｆａｌｓｅ（偽）」または「０」に書き換える（ステップＳ２０８）。併せて、プロセッサコア１００は、命令の実行が失敗したことを示す結果（戻り値）を、当該他の実行スロットに対応する戻り値領域へ書き込む（ステップＳ２１０）。 If there is another execution slot that specifies the same address as that specified for the selected execution slot (YES in step S206), processor core 100 specifies the same address in access flag 108. The flags corresponding to the other execution slots are rewritten to “false” or “0” (step S208). At the same time, the processor core 100 writes a result (return value) indicating that the instruction execution has failed into the return value area corresponding to the other execution slot (step S210).

選択された実行スロットに指定されているアドレスと同一のアドレスを指定する他の実行スロットが存在しなければ（ステップＳ２０６においてＮＯ）、ステップＳ２０８およびＳ２１０の処理はスキップされる。 If there is no other execution slot that specifies the same address as the address specified for the selected execution slot (NO in step S206), the processing in steps S208 and S210 is skipped.

続いて、プロセッサコア１００は、対象のバンクメモリにおいて選択されたスロット番号ｎの実行スロットに指定されているアドレスＸに格納されているデータに対して、実施の形態１に従うＣＡＳ命令２を実行する（ステップＳ２１２）。すなわち、プロセッサコア１００は、対象のバンクメモリの指定されたアドレスＸからデータを読み込み、読み込んだデータと旧データとを比較し、両データが一致していた場合にのみ、新データをメモリへ書き込むという処理を実行する。そして、プロセッサコア１００は、読み込んだデータと旧データとの比較結果を実行の結果（戻り値）を対応する戻り値領域に書き込む（ステップＳ２１４）。そして、プロセッサコア１００は、アクセスフラグ１０８において、対象のバンクメモリのスロット番号ｎの実行スロットに対応するフラグを「ｆａｌｓｅ（偽）」または「０」に書き換える（ステップＳ２１６）。そして、ステップＳ２０２以下の処理が繰り返される。 Subsequently, the processor core 100 executes the CAS instruction 2 according to the first embodiment on the data stored in the address X designated in the execution slot of the slot number n selected in the target bank memory. (Step S212). That is, the processor core 100 reads data from the designated address X of the target bank memory, compares the read data with the old data, and writes new data to the memory only when the two data match. The process is executed. Then, the processor core 100 writes the comparison result between the read data and the old data in the corresponding return value area as the execution result (return value) (step S214). Then, the processor core 100 rewrites the flag corresponding to the execution slot of the slot number n of the target bank memory to “false” or “0” in the access flag 108 (step S216). And the process after step S202 is repeated.

ステップＳ２２０において、アクセスを要求している実行スロットが存在していないと判断されたバンクメモリの数が管理される。すべてのバンクメモリについて、アクセスを要求している実行スロットが存在していなければ（ステップＳ２２０においてＹＥＳ）、命令の実行は終了する。すなわち、アクセスフラグ１０８において、すべてのバンクメモリについて、「ｔｒｕｅ（真）」または「１」がセットされているフラグが存在しなくなると、命令の実行は終了する。 In step S220, the number of bank memories determined to have no execution slot requesting access is managed. If there is no execution slot requesting access for all bank memories (YES in step S220), the execution of the instruction ends. That is, in the access flag 108, when there is no flag set to “true (true)” or “1” for all the bank memories, the execution of the instruction ends.

以上のような処理によって、マルチバンク構成のローカルストレージを採用した場合であっても、実施の形態１に従うアトミック命令を実現できる。なお、実施の形態２に従うＣＳＷ命令についても、上述と同様に実装できるので、詳細な説明は繰り返さない。 Through the processing as described above, the atomic instruction according to the first embodiment can be realized even when a local storage having a multi-bank configuration is adopted. Since the CSW instruction according to the second embodiment can be implemented in the same manner as described above, detailed description will not be repeated.

＜Ｆ．実施の形態５＞
次に、実施の形態５として、本実施の形態に従うアトミック処理をマイクロコード（一種のファームウエア）により実装する構成について説明する。図１８は、実施の形態５に従うコンピュータに関連したハードウェア構成を示す模式図である。 <F. Embodiment 5>
Next, as a fifth embodiment, a configuration in which atomic processing according to the present embodiment is implemented using microcode (a kind of firmware) will be described. FIG. 18 is a schematic diagram showing a hardware configuration related to the computer according to the fifth embodiment.

図１４および図１５に示すようなハードウェア構成においては、複数のメモリバンクのそれぞれについてアトミック処理を並列実行するために、各メモリバンクと対応付けて、メモリバンクと同数のアトミック演算器を設けることが好ましい。より具体的には、図１８（Ａ）に示すように、ローカルストレージ１０２内のバンクメモリ１０２１〜１０２４に対応付けて、同数のアトミック演算器１００１〜１００４がプロセッサコア１００内に設けられている。 In the hardware configuration as shown in FIG. 14 and FIG. 15, in order to execute atomic processing in parallel for each of a plurality of memory banks, the same number of atomic operation units as the memory banks are provided in association with each memory bank. Is preferred. More specifically, as shown in FIG. 18A, the same number of atomic calculators 1001 to 1004 are provided in the processor core 100 in association with the bank memories 1021 to 1024 in the local storage 102.

一方で、図８（Ａ）に示すようなハードウェア構成は、実装コストが上がるという欠点もある。この場合には、図８（Ｂ）に示すような、バンクメモリ１０２１〜１０２４に対して単一のアトミック演算器１００６を設けておき、バンクメモリ１０２１〜１０２４のうち、各タイミングでは１つのバンクメモリのみにアクセスするような機構をソフトウエア的に実装する。 On the other hand, the hardware configuration as shown in FIG. 8A has a drawback that the mounting cost increases. In this case, a single atomic calculator 1006 is provided for the bank memories 1021 to 1024 as shown in FIG. 8B, and one bank memory is selected at each timing among the bank memories 1021 to 1024. A mechanism to access only the software is implemented in software.

例えば、任意のバンクメモリへのアクセスをロックするためのロック機構を実装してもよい。具体的には、このロック機構によるロックの状態を示す特殊レジスタが設けられるとともに、当該特殊レジスタにアクセスしてその値を読み込む命令と、ロックを解除する特殊なメモリ書き込み命令とをセットにして実装する。この命令のセットを用いて、特定のバンクメモリへのアクセスをロックした上で、通常の演算命令を実行することで、アトミック処理を実現してもよい。 For example, a lock mechanism for locking access to an arbitrary bank memory may be implemented. Specifically, a special register that indicates the lock state by this lock mechanism is provided, and a set of instructions that access the special register and read its value and a special memory write instruction that releases the lock To do. The atomic processing may be realized by using the set of instructions and locking the access to a specific bank memory and executing a normal operation instruction.

図１９は、アトミック処理を実現するためのマイクロコードの一例を示す図である。図１９（Ａ）は、関連技術に係るＣＡＳ命令１に対応するマイクロコードの一例を示し、図１９（Ｂ）は、実施の形態１に従うＣＡＳ命令２に対応するマイクロコードの一例を示す。なお、図１９には、ＮＶＩＤＩＡコーポレーション（アメリカ合衆国カリフォルニア州サンタクララ）のＦｅｒｍｉアーキテクチャに沿った疑似命令を用いた場合のコード例を示す。 FIG. 19 is a diagram illustrating an example of microcode for realizing atomic processing. FIG. 19A shows an example of microcode corresponding to CAS instruction 1 according to the related art, and FIG. 19B shows an example of microcode corresponding to CAS instruction 2 according to the first embodiment. FIG. 19 shows a code example in the case of using a pseudo instruction according to the Fermi architecture of NVIDIA Corporation (Santa Clara, California, USA).

図１９（Ａ）に示すマイクロコード４００Ａにおいては、全スレッドについてアトミック処理の実行が完了するまで、図に示すループ処理が実行される。マイクロコード４００Ａにおいて、命令「ＳＥＬ％ｏｕｔ，％ｎｅｗ，％ｒｅｔ，Ｐ１」がＣＡＳ命令１の実行を指示するものである。図１９（Ａ）に示すＬＤＳＬＫ命令は、アクセス先として指定されたメモリバンクが互いに重複している複数の実行スロットに対して、高々１つの実行スロットしかロックしかできない。そのため、対象のメモリスロットとは別のメモリスロットに指定された命令については、次のループの処理まで待たされることになる。 In microcode 400A shown in FIG. 19A, loop processing shown in the drawing is executed until execution of atomic processing is completed for all threads. In the microcode 400A, the instruction “SEL% out,% new,% ret, P1” instructs execution of the CAS instruction 1. The LDSLK instruction shown in FIG. 19A can only lock at most one execution slot with respect to a plurality of execution slots in which memory banks designated as access destinations overlap each other. For this reason, an instruction designated in a memory slot different from the target memory slot is awaited until the next loop processing.

これに対して、図１９（Ｂ）に示すマイクロコード４００Ｂにおいては、実施の形態１に従うＣＡＳ命令２を用いて、ループ処理が実行される。マイクロコード４００Ｂにおいて、命令「ＳＥＬ％ｏｕｔ，％ｎｅｗ，％ｒｅｔ，Ｐ１」がＣＡＳ命令２の実行を指示するものである。 On the other hand, in microcode 400B shown in FIG. 19B, loop processing is executed using CAS instruction 2 according to the first embodiment. In the microcode 400B, the instruction “SEL% out,% new,% ret, P1” instructs the execution of the CAS instruction 2.

マイクロコード４００Ｂにおいて、命令「ＸＸＸ％ａｄｄｒＸ，Ｐ０，％ａｄｄｒ」（図１９（Ｂ）中の符号４０２）は、変数％ａｄｄｒと同じメモリバンクで変数Ｐ０＝＝ｔｒｕｅ（真）である実行スロットのアドレス（変数％ａｄｄｒ）をアドレスＸ（変数％ａｄｄｒＸ）として設定する命令である。すなわち、メモリアドレスが重複しているものが残っているか否かの判断を実行するための処理（図１９（Ｂ）中の符号４０４）の前処理に相当する。 In the microcode 400B, the instruction “XXX% addrX, P0,% addr” (reference numeral 402 in FIG. 19B) is stored in the same memory bank as the variable% addr in the variable P0 == true (true). This is an instruction for setting an address (variable% addr) as an address X (variable% addrX). In other words, this corresponds to the pre-processing of the process (reference numeral 404 in FIG. 19B) for executing a determination as to whether or not there remains an overlapping memory address.

なお、実施の形態２に従うＣＳＷ命令についても、上述と同様に実装できるので、詳細な説明は繰り返さない。 Since the CSW instruction according to the second embodiment can be implemented in the same manner as described above, detailed description will not be repeated.

＜Ｇ．まとめ＞
一般的なＳＩＭＤ／ＳＩＭＴ型プロセッサにおいて、複雑なアトミック処理を行なう際に用いるＣＡＳ命令を実行する際には、通常のアトミック処理と同様に、アドレスが重複している複数の命令の実行を直列化することで、アクセスの競合を回避する。このような直列化を採用することで、結果的に、むだになるメモリアクセスが増加し、性能上のボトルネックとなっている。 <G. Summary>
In a general SIMD / SIMT type processor, when executing a CAS instruction used for performing complex atomic processing, the execution of a plurality of instructions having overlapping addresses is serialized in the same manner as in normal atomic processing. To avoid access conflicts. Adopting such serialization results in an increase in wasted memory access and a performance bottleneck.

本願発明者は、このような新たな課題を発見し、その課題を解決するために、以下のような解決手段を採用した。すなわち、ハードウェアにより並列処理される複数のスレッドで、同時にアトミックなＣＡＳ命令を実行する際には、単一のアドレスについては、高々１つのスレッドについてしか処理を行なわない。そして、実行が成功したスレッドについては、成功を示す結果を返し、他のスレッドについては直列化処理せず、即座に失敗を示す結果を返す。 The inventor of the present application has found such a new problem, and has adopted the following means for solving the problem. That is, when an atomic CAS instruction is simultaneously executed by a plurality of threads processed in parallel by hardware, only one thread is processed at most for a single address. For a thread that has been successfully executed, a result indicating success is returned. For other threads, serialization processing is not performed, and a result indicating failure is immediately returned.

このように、命令の実行が失敗したことを示す条件について、ＳＩＭＴ型プロセッサに特有の状態を追加することで、アトミックなメモリアクセスの頻度を減らすことが可能となり、ＳＩＭＴ型プロセッサにおける性能上のボトルネックを解消できる。すなわち、このようなアトミック命令を追加することで、プロセッサの性能を向上させることができる。 As described above, by adding a state specific to the SIMT type processor for the condition indicating that the execution of the instruction has failed, it is possible to reduce the frequency of atomic memory access, and a performance bottleneck in the SIMT type processor. The bottleneck can be eliminated. That is, by adding such an atomic instruction, the performance of the processor can be improved.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した説明ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１０，１０Ｂ，１０−１，１０−２マルチコアプロセッサ、１００プロセッサコア、１０２ローカルストレージ、１０４ストレージバス、１０８アクセスフラグ、１１０共有キャッシュ、１２０メモリコントローラ、１３０コントローラ、１４０，１５０内部バス、１４４プロセッサインターコネクトバス、１６０外部メモリバス、１７０外部機器バス、１８０外部メモリ、１９０外部機器、２００，２００＃ＳＩＭＤ命令、２０１〜２０８実行スロット、２１１〜２１４フラグ領域、２２１〜２２４戻り値領域、１００１〜１００４，１００６アトミック演算器、１０２１〜１０２４バンクメモリ。 10, 10B, 10-1, 10-2 Multi-core processor, 100 processor core, 102 local storage, 104 storage bus, 108 access flag, 110 shared cache, 120 memory controller, 130 controller, 140, 150 internal bus, 144 processor interconnect Bus, 160 External memory bus, 170 External device bus, 180 External memory, 190 External device, 200, 200 # SIMD instruction, 201-208 Execution slot, 211-214 Flag area, 221-224 Return value area, 1001-1004 1006 Atomic computing unit, 1021-1024 bank memory.

Claims

A SIMD (Single Instruction Multi Data) processor capable of executing in parallel and having a plurality of execution slots to enable the same instruction to be simultaneously applied to a plurality of data,
The same instruction reads data stored at an address designated for each execution slot, and writes the data to the designated address when the read data satisfies a predetermined condition. Includes atomic instructions to execute while maintaining atomicity,
When the atomic instruction is executed, the processor
Identify the execution slot where the specified address is duplicated among the execution slots corresponding to the atomic instruction,
An instruction is executed in one execution slot among the execution slots having the same address, the result of the execution is returned, and the instruction execution failed for the remaining execution slots without executing the instruction. A processor that returns a result indicating

2. The processor according to claim 1, wherein the processor returns a result indicating that the execution of the instruction has failed for the remaining execution slots before executing the instruction of one execution slot among the execution slots having the same address. Processor.

The processor is
A value read from a specified address is returned as a further result after executing an instruction in one execution slot among the execution slots having the same address,
For the remaining execution slots,
If execution of an instruction in one execution slot among the execution slots having the same address is successful, the value written in the memory by the instruction in the one execution slot is returned as a further result,
2. When the instruction execution in one execution slot among the execution slots having the same address fails, the value read from the memory during the execution of the instruction by the one execution slot is returned as a further result. The processor according to 2.

4. The processor according to claim 1, wherein the processor determines one execution slot to be executed based on a number assigned to the execution slot among execution slots having the same address. 5. Processor.

The processor according to any one of claims 1 to 4, wherein the processor includes a plurality of processor cores and a memory shared among the plurality of processor cores.